DEEP LEARNING ON APACHE SPARK: INTEL BIGDL
In the present-day data deluge, Data Lakes are the technology of choice for repository services. For processing the Big Data that comes with the data deluge, Hadoop/Spark-based ecosystems provide scalable, parallelised and reliable environments.
With the advent of AI, the spiked interest in Deep Learning has posed a fundamental problem for Deep Learning based applications/research i.e. the problem of processing the data in-situ and then running the Deep Learning algorithms on the processed data. The added overhead of performing ETL processes on the collected data in the data lakes and then moving the data out of data lakes to other systems with Deep Learning capabilities require a lot of effort and expertise.
Intel BigDL solves some of these issues. BigDL is a distributed deep learning library for Apache Spark, Deep Learning on Spark-based systems are more viable because
- Large amount of data can be analysed on the same Big Data (Hadoop/Spark) cluster where the data is stored (HDFS, HBase, Hive, etc.)
- The data preparation activities like ETL, warehousing, feature engineering etc can be offloaded to the Spark/Hadoop application stackand
- The ease of getting started for data scientists and data engineers is relatively low as they are usually not the Deep Learning experts. BigDL the data scientists and data engineers can use the familiar environment and programming models of a typical Spark/Hadoop based environment and take the less taken path of Deep Learning
Cross Compatibility of BigDL models
Transfer Learning can be brought about in BigDL as the pretrained models built in Keras, Torch, Caffe or TensorFlow can be loaded into BigDL without any hassleBigDL models can also be loaded into the existing DL frameworks.
Fault Tolerance in BigDL
Existing deep learning frameworks are typically deployed as multiple long-running, potentially stateful tasks, which interact with each other (in a blocking fashion to support synchronous mini-batch SGD) for model computation and parameter synchronisation.
In contrast, BigDL runs a series of short-lived Spark jobs (For example, two jobs per mini-batch as described in earlier sections), and each task in the job is stateless and non-blocking. As a result, BigDL programs can automatically adapt to the dynamic resource changes (For example, pre-emption, failures, incremental scaling, resource sharing, etc.) in a timely fashion.
BigDL in action : Analytics Zoo
Analytics Zoo is an unified analytics and AI platform based on BigDL which seamlessly unites Spark, TensorFlow, Keras and BigDL into an integrated pipeline. The things that users can do on Apache Zoo are :
- Data wrangling and analysis using PySpark
- Deep learning model development using TensorFlow or Keras
- Distributed training/inference on Spark and BigDL
- Common Feature Engineering operations (for Image, text, 3D Image etc.)
- Deep learning using Spark DataFrames
The Apache Zoo platform APIs are available for use in Python and Scala. The APIs also provide support for productionising the model inference using POJOs (Plain Old Java Objects)
The potential that BigDL and its related applications like Apache Zoo bring about is quite big and can be used to solve diverse use cases like Large-scale inferences, Hyperparameter tuning, transfer learning etc.
If you have any query, or Deep Learning use cases to be implemented on Big Data or in GPU based environments, feel free to drop us a mail at gs.ude.sun@gnireenigneatad