HADOOP 3.1.0 : GPU, CONTAINERS AND BEYOND
Stages of Hadoop evolution
Hadoop since its birth in 2006 has gone through numerous changes in paradigms and architecture. Its use has shifted patterns from being ad-hoc cluster oriented to long running and highly available. The stages in which the Hadoop framework has evolved can be categorized into 4 stages as stated below:
Stage 1: Ad Hoc/ Per User Clusters
A typical user would manually spin up a cluster on a few nodes, load their data into the Hadoop Distributed File System (HDFS), obtain the result as per the requirement by writing MapReduce jobs, and then tear down the cluster. Everything from creation to tear down of a cluster was a highly manual job.
Stage 2: Hadoop on Demand (HOD)
The main aim of Hadoop on Demand was to address multitenancy issues with Ad Hoc clusters over shared HDFS and to automate creation and tear down of clusters. The architecture was based on a resource manager named Torque and a job scheduler named Maui. This approach solved the problems of manual deallocation of resources and help achieve multitenancy but was very poor in terms of cluster utilisation.
Stage 3: Shared Computes and Storage
HOD’s restricted API and inefficient architecture forced the users to be experts in resource management in order to make the simplest of the jobs to work. This thus led to the evolution of a model comprising shared MapReduce clusters running on top on shared HDFS instances.
Stage 4: YARN Oriented (Shared and Highly Available)
MapReduce helped users solve many use-cases but it was not the ideal solution for all large scale computations. Machine Learning and Graph processing algorithms proved to be very costly in terms of scheduling effort and resources. All this shortcomings coupled with the need of highly scalable and fault tolerant architectures led to the evolution of YARN. YARN which stands for Yet Another Resource Negotiator, had the principles of serviceability, high utilization, high availability and diversified programmability at its core.
Evolution of YARN to support HPC like applications:
YARN which was first released with Hadoop in 2012 and deployed to production at Yahoo! in 2013 has been a continuously evolving project. The evolution however was highly focused around improving the resource utilization and throughput with the use of either commodity hardware or high capacity CPU severs. Applications involving the use of non-conventional hardware like GPUs to develop accelerated applications or application involving high amount of graphical processing were not natively supported. With the recent launch of Hadoop 3.1.0, first-class support for container orchestration, GPUs and FPGA scheduling and isolation has become possible.
YARN service framework providing first class support and hosting long running services in YARN helps users to program their algorithms without having to worry about the cluster resource management and availability.
YARN service framework like Kubernetes now provides container orchestration platform for managing containerised services on YARN. Docker containers as well as traditional process based containers are now supported natively in YARN. This would help the users to develop their applications quickly without having to worry about the fine grains of resource management and deployment environment.
The biggest shift towards HPC user support has come in the form of introduction of first class GPU scheduling and isolation support on YARN. Though as of now only NVIDIA GPUs are supported, GPU scheduling both Docker and non-Docker containers can be used as the runtime context. GPU scheduling and isolation can be done in both distributed mode with or without Docker.
The FPGA(Field-programmable gate array)resource is supported by YARN but only shipped with “IntelFpgaOpenclPlugin” for now. With the introduction of FPGA, it can be assumed that Hadoop clusters of today can solve any problem which is computable.
The addition support for GPUs, Container orchestration and FPGAs means that the users can now :
- Use GPUs to accelerate their big data applications
- Develop AI applications involving Deep Learning/ Machine Learning with ease.
- Deploy their applications in containers and orchestrate it natively from YARN
- Realize Bioinformatics/Medical Imaging/Voice recognition use cases using FPGAs and YARN
At NUS the support for Hadoop enabled applications is provided through the DRAS (Data Repository and Analytics System) which runs Hortonworks Data Platform (HDP 2.5.3). The current version of Hadoop is 2.7.3. In recent days, the Hadoop version of DRAS would be upgraded to 3.1.0 (which comes with HDP 3). This would help researchers to realize the use cases which were in the past limited due to the non-support of some non-conventional hardware on Hadoop.
For any queries or further details regarding the Hadoop infrastructure at NUSIT HPC, send us a mail at data.engineering@nus.edu.sg .