CAN HPC AND BIG-DATA ANALYTICS CO-EXIST?
Traditional HPC and emerging Big-Data applications are both compute and data intensive. We will examine whether they have common requirements that allow them to share resources.
Similarities and Differences
Hardware wise, both are using commodity hardware based cluster and manycore system to accelerate computation. Today, majority of HPC applications are supported on x86 based Linux cluster or GPGPU system. Similarly, for cost-effectiveness and scalability, Hadoop Big Data computing platforms are built on cluster. There are also other Big Data related computation such as in Machine Learning that can run well on manycore system such as GPGPU.
When it comes to storage and file system design, both have quite a different approach. While HPC has its’ centrally provisioned parallel file system such as Lustre or Network File System (NFS), most Hadoop file system are built using local storage in each cluster node. As most HPC cluster nodes do not come with large local storage, re-purposing the HPC cluster for Hadoop application will be a challenge. Fortunately, there are solutions now to emulate the Hadoop file system with the central storage system. We will discuss more in the next section when we explore this new approach.
Besides differences in storage and file system implementation, differences in application software development and execution between HPC and Big Data can be quite distinct also. While HPC applications are developed mainly in C or Fortran, Python and Java are used more often in Big Data applications. HPC and Hadoop computing platform also manage jobs quite differently using different schedulers, partly due to different types of jobs being executed. HPC involves mainly batch jobs while Big Data applications include real-time queries and processing.
Closing the Gap
Recently we installed a central storage system that can be used for HPC and also to emulate the Hadoop File System. With this implementation, we can now build a Hadoop computing cluster using the existing HPC cluster by integrating the HPC cluster with the central storage system. There are two key advantages in this approach. Firstly the cluster and the storage capacity can be scaled or upgraded independently. Secondly, the usable storage capacity in the central storage system is much higher than the traditional Hadoop cluster. To provide the same level of redundancy, traditional Hadoop cluster needs to have multiple data replication across the cluster (e.g. ~33% usable with two replications). For the central storage system, usable capacity is around 80%.
In the original Hadoop cluster design, application is being transferred to the location of the data for processing. With this new approach, data has to be transferred from the central storage system over the network to the compute cluster. To address the potential performance impact, adequate network bandwidth will have to be provided.
Hadoop is not the only way to do Big Data analytics. With proper software, numerical libraries and tools setup, non-Hadoop analytics applications written in Python, R and Matlab can be supported on the HPC cluster. Similarly, GPU system designed for HPC can also be set up to support data-intensive Machine-Learning application such as Deep-Learning. I believe future generation of manycore processors, including both GPU and Xeon Phi, will make HPC system more multi-purpose and close the gap further.
Conclusion
HPC and Big Data applications can co-exist in the same HPC cluster environment. Using the central storage system to emulate Hadoop File System, Hadoop cluster can be built on top of the HPC cluster. If you are interested in running data analytics on Hadoop cluster, do check out the article on Hadoop testbed by Sundy.