» Deep Learning on Apache Spark: Intel BigDL - NUS Information Technology

DEEP LEARNING ON APACHE SPARK: INTEL BIGDL

By Kumar Sambhav, NUS IT on 6 Dec, 2018

In the present-day data deluge, Data Lakes are the technology of choice for repository services. For processing the Big Data that comes with the data deluge, Hadoop/Spark-based ecosystems provide scalable, parallelised and reliable environments.

With the advent of AI, the spiked interest in Deep Learning has posed a fundamental problem for Deep Learning based applications/research i.e. the problem of processing the data in-situ and then running the Deep Learning algorithms on the processed data. The added overhead of performing ETL processes on the collected data in the data lakes and then moving the data out of data lakes to other systems with Deep Learning capabilities require a lot of effort and expertise.

Intel BigDL solves some of these issues. BigDL is a distributed deep learning library for Apache Spark, Deep Learning on Spark-based systems are more viable because

Large amount of data can be analysed on the same Big Data (Hadoop/Spark) cluster where the data is stored (HDFS, HBase, Hive, etc.)
The data preparation activities like ETL, warehousing, feature engineering etc can be offloaded to the Spark/Hadoop application stackand
The ease of getting started for data scientists and data engineers is relatively low as they are usually not the Deep Learning experts. BigDL the data scientists and data engineers can use the familiar environment and programming models of a typical Spark/Hadoop based environment and take the less taken path of Deep Learning

Cross Compatibility of BigDL models

Transfer Learning can be brought about in BigDL as the pretrained models built in Keras, Torch, Caffe or TensorFlow can be loaded into BigDL without any hassleBigDL models can also be loaded into the existing DL frameworks.

Fault Tolerance in BigDL

Existing deep learning frameworks are typically deployed as multiple long-running, potentially stateful tasks, which interact with each other (in a blocking fashion to support synchronous mini-batch SGD) for model computation and parameter synchronisation.

In contrast, BigDL runs a series of short-lived Spark jobs (For example, two jobs per mini-batch as described in earlier sections), and each task in the job is stateless and non-blocking. As a result, BigDL programs can automatically adapt to the dynamic resource changes (For example, pre-emption, failures, incremental scaling, resource sharing, etc.) in a timely fashion.

BigDL in action : Analytics Zoo

Analytics Zoo is an unified analytics and AI platform based on BigDL which seamlessly unites Spark, TensorFlow, Keras and BigDL into an integrated pipeline. The things that users can do on Apache Zoo are :

Data wrangling and analysis using PySpark
Deep learning model development using TensorFlow or Keras
Distributed training/inference on Spark and BigDL
Common Feature Engineering operations (for Image, text, 3D Image etc.)
Deep learning using Spark DataFrames

The Apache Zoo platform APIs are available for use in Python and Scala. The APIs also provide support for productionising the model inference using POJOs (Plain Old Java Objects)

The potential that BigDL and its related applications like Apache Zoo bring about is quite big and can be used to solve diverse use cases like Large-scale inferences, Hyperparameter tuning, transfer learning etc.

If you have any query, or Deep Learning use cases to be implemented on Big Data or in GPU based environments, feel free to drop us a mail at gs.ude.sun@gnireenigneatad