» Blurring Lines Between Hadoop and Deep Learning - NUS Information Technology

BLURRING LINES BETWEEN HADOOP AND DEEP LEARNING

By Kumar Sambhav, Research Computing, NUS IT on 20 May, 2019

Artificial Intelligence is easier to adopt than ever. Developers and enthusiasts are using the technology to solve problems like never before. Not just conventional computing, but non-computing areas like farming, medicine, species conservation, administration etc; and they are reaping the benefits of the recent rise in AI.

The Hadoop ecosystem and the Deep Learning technologies have been separated for quite long with the tracks of development focusing on different architectures and employing different kinds of computing resources. While Hadoop was only CPU based till recent times, the development of Deep Learning for AI was heavily based on GPU resources.

With the recent developments in the Hadoop version 3.1, support for GPU has been made available. This opened a new avenue for development wherein the big data systems can now natively do deep learning. Hadoop systems no longer remain a big data preprocessing system.

Apache Hadoop Submarine:

Apache Hadoop Submarine is a Hadoop subproject. It aims to empower users to run deep learning applications on big data resource management platforms like YARN or Kubernetes.

Submarine runs the distributed deep learning framework in the form of Docker containers. Since YARN now natively supports Docker containers, the coordination of various services becomes easy.

Submarine has support for most of the popular Deep Learning frameworks like TensorFlow, Pytorch, MxNet, Caffe etc. The reach of Submarine is not just limited to running applications. It helps users with the entire process of algorithm development, model training, model management, batch training, etc.

Support for Submarine interpreter is available in Zepplin as well as Jupyter notebooks, making it easy for developers to do interactive model development.

HDP 3.1

Hortonworks Data Platform 3.1 is out after capturing the best of Hadoop’s latest versions. It now has a native support for TensorFlow under their Machine Learning and Deep Learning application support area. There is easy portability brought in by Docker and support for Hybrid and cloud based storage architectures.

The choice of Hadoop as the big data processing framework was debated on for a long time. With the advent of technologies like Submarine, the parallel processing capabilities of Apache Spark and native GPU resource management and scheduling of YARN has made this question redundant. The lines between traditional Hadoop and Deep learning have blurred.