APACHE SPARK : WHAT IT ACHIEVES THAT HADOOP DIDN’T?
Apache Spark in its definition is a cluster-computing framework. It is open source and has a community support of more than 1200 committers. Chiefly used for doing data analytics and machine learning tasks, Spark has found great popularity not only in business analytics applications but also in the research community around the world.
Why was Hadoop not so successful for HPC?
- For iterative processes and interactive use, Hadoop MapReduce’s mandatory dumping of output to disk after every Map/Reduce stage proved to be a big bottleneck. The Machine Learning users, who heavily rely on iterative processes to train-test-retrain their models found this to be impeding their work by great degrees.
- HDFS impedes access to data outside of Apache Hadoop family. Traditional Hadoop applications needed the data first to be copied to the HDFS and then did processing on it. This at times created network slowdowns as well.
- Mappers needed a data localisation phase in which the data was written to the local file system to bring resilience.
The above bottlenecks can be basically attributed to high network traffic requirements and storage requirements in HPC systems.
What Spark achieves?
- Spark with its In-Memory processing paradigm, lowers the disk IO overhead substantially. Spark uses the concept of Directed Acyclic Graphs (DAGs) to store details of each transformation done on a parallelised dataset and does not process them to get results until required.
- Spark works equally well with HDFS or any POSIX style file system, or even cloud-based storage mechanisms like S3, Azure Blob Storage etc.
- Resilience in Spark is brought about by the DAGs, in which a missing RDD (the parallelised quanta of data in Spark) is recalculated by following the path from which the RDD was created. Hence the requirement of Data Localization is minimal in the case of Spark.
Spark on Commodity Hardware vs. Spark on HPC
In-Memory processing, one of the Spark’s strength also proves to be one of its weakness at times. The memory requirement for In-Memory processing is high and difficult to achieve using compute nodes made from commodity hardware due to their small sizes. This becomes an issue in the case of a multiple iteration program or when there are intermittent calls to serialisation of data. Caching of data also becomes an issue in iterative processing of high volume datasets.
On the other hand, on an HPC cluster, the memory is high and usually in the range of hundreds of gigabytes per node. Users can thus capitalise on the high capacity memory to run their task In-Memory and optimise the parallelism provided by Spark.
Spark @ NUSIT-HPC updates
The Research Data Repository and Analytics system running on the HPC Cluster at NUS IT has an Apache Spark 1.6.2
Support for Machine Learning
We have support for various machine learning algorithms through Spark MLlib as well as other Python-based libraries like scikit-learn etc. Support for Machine Learning in Scala is also available.
Support for Bioinformatics
For Bioinformatics users we have GATK4 with Spark ready for use. GATK4 is a genome variant discovery package for analysis of high-throughput sequencing data. It uses Apache Spark to bring about high speed and scalability.
For more details and the use of our Research Data Repository, contact us at nusit-hpc@nus.edu.sg or data.engineering@nus.edu.sg