DATA LAKE – 5 W’S AND BENEFITS
THE 5 W’S TO ASK BEFORE SETTING UP A DATA LAKE
What is a Data Lake?
There is a defining factor that separates a data lake from a traditional data warehouse. A data lake provides the flexibility to store raw data as such in the manner the source provides. Fixed schema need not be imposed for the data unlike the rigid tabular structure followed by its predecessor. The Data Lake provides a common pool to combine multiple points and shape the data into useful insights tuned to the consumer’s needs and requirements. Governance standards are followed in the Data Lake to keep track of the lineage, enforce security and for centralized auditing.
The Data Lake is thus a platform for efficient data storage and supports tools for understanding data from quick exploration to advanced analytics – be it descriptive, diagnostic, predictive or prescriptive.
Why do we need a Data Lake?
Data Lake enables one to merge different data silos and provide a uniform representation of an organisation’s data assets. Data Lake lays the foundation for Data Science (analysis and insights) that would otherwise be tough to derive.
When is a Data Lake required?
When there is a need to handle any two or more of the major 4 V’s – Volume, Velocity, Variety and Veracity in business and achieve the 5th V (Value), a Data Lake is required. Although the aspect of velocity is gaining separate attention and has created a newer term ‘Fast Data’ and ‘Big Data’ to describe data at rest, the Data Lake still serves as the main storage repository to address the above need.
Where do we host a Data Lake?
A Data Lake can be hosted both on premise as well as on the cloud. Customers can procure up-front and manage a cluster in their existing on-premises data centres or spin up on demand instances on the cloud and subscribe to resources on a monthly basis. Alternatively, a hybrid solution is also possible.
Which technologies are best suited?
Since a Data Lake is typically built to handle ‘Big’ data, specialised frameworks available in the Hadoop and Spark ecosystem are best suited. They are robust distributed tools created and used by giants like Google, Yahoo, Microsoft, Facebook, Twitter, Netflix, NASA and NSA as solutions to the data explosion experienced even before the term ‘Big Data’ was coined. These tools are the means to current technology to build a stable Data Lake.
A brief description on some of the common Apache frameworks – Hadoop comprising HDFS, YARN and MR had been around since the age of the commercial Big Data stack. HDFS is still the default storage platform adopted as of today, and YARN is the de-facto resource management and job scheduling cluster management technology. The Spark engine has now replaced the Hadoop MapReduce paradigm owing to its faster processing and its memory computation advantages. For RT/NRT (Real Time/ Near Real Time) processing we have Storm, Spark Streaming and Flink. Kafka acts as a distributed message broker and supports the stream transition between components. Zookeeper accompanies Kafka and Storm for cluster coordination and distributed synchronization. Machine Learning libraries are available and written in Spark – Spark MLlib, Mahout, H2O etc for easy integration with the existing pipeline.
Some noteworthy are Big Data vendors – Cloudera, Hortonworks, MapR ( on-premise), and AWS ElasticMapReduce, Microsoft Azure HDInsight andGoogle DataProc (on the cloud).
Benefits of a Data Lake
Some benefits of a Data Lake are as follows:
Complementary and not an alternative
Data lakes complement the organization’s existing enterprise data warehouse and need not be thought of as their replacement. The data lake and the enterprise data warehouse can co-exist and work together as components of a logical data warehouse.
Imagine in the automotive industry a vehicle manufacturing company collects data from distributors – geolocation of sold vehicles, contract details, warranty of equipment etc. This company initiates a plan to install sensors at various vital functionalities to study the performance of vehicle parts. Data is continuously generated every second from millions of vehicles sold at different places. The Data Lake is an ideal candidate to ingest such high velocity data and it allows merging with the existing data to create the potential for powerful use cases like predicting time for failure of equipment in vehicles and prescribing replacements just before the failure occurs. ‘Hot’ transactional data stored in OLTP systems can function independently and provide support to the ‘Cold’ analytic oriented data stored in Data Lake.
Handling relational data vs data of variety
Traditional systems have been built to support relational models wherein data is stored as fields spreading columns and every new observation is a fresh record in a table. Though a warehouse holds data from disparate sources an ETL is required to put it into a structure. This approach either restricts the type of data stored and imposesa size/character limit or requires a transformation. (E.g. converting image to BLOB) The data lake design overcomes this limitation by allowing any data format and any size to be stored.
The Data Lake can hold structured (relational), unstructured (image, videos, pdfs or encoded text files) and semi-structured (xml, json) data, all under one roof thereby avoiding polyglot persistence.
Schema-on-write vs Schema-on-read
Schema-on-write based systems impose a pre-defined (generally relational) schema on the data. In other words, data is not loaded until the use for it has been defined. Traditional databases enforce schema during load time.
Schema-on-read based systems on the other hand do not force data to follow a particular schema. Data is written first and schema is defined during analysis at later stage.
You may use schema-on-write when:
• There is a standard ETL process to cleanse and transform data for a particular use case
• Schema is already decided before you store data
• Schema is static and does not evolve over time
You may choose schema-on-read when:
• You want to have different views of the same data
• You want flexibility in defining how your data is interpreted at load time. Dynamic schema/evolution with time
• You would like load your data before you know what to do with it
• You need flexibility in being able to store unstructured data
If the data at hand has the possibility to fit multiple use cases some of which are common to analytics, then schema-on-read principle comes in handy and data can be stored in the Data Lake.
Access for multiple users
A Data Lake is not restricted to just the operational users. Users defined by their functional role can be provided access with restrictions and limits on its rights and usage. Existing user accounts can be linked using LDAP access or a separate Kerberos based session tickets can be issued for each user for a particular session period. This creates a fence around groups of users so they can focus on their tasks without disrupting others’ data, and at the same time allows data sharing with authorised control.
Offload heavy processing for faster results at lower cost
Workloads that consume a lot of processing power and memory on the main system can be transferred to be run on a clustered environment. Subset of the data can be moved from source to Data Lake temporarily, undergo processing and have the results returned to the original source system.
Preserve raw data for Data Science and Exploration
The data lake has the capability to provide a self-sufficient, self-service analytics environment for data scientists and data analysts where data exploration and other data-related and analytical tasks can be performed without having to wait for the EDW team to model the data and then load it.
Data in its raw format may be perceived from different angles by different stakeholders. What looks like an ordinary dump today may hold the vital clue to convert copper into gold in the future.
Handling data at speed
If your application involves real-time and continuous streaming of data, like from sensors, social media, high speed router switches etc. it is best to use the Data Lake as the means of storage, and handle the velocity using specialized technology that directly interface with the lake. Fast data has a lot of applications and the potential is best tapped using reliable tools like Spark Streaming, Flink, Kafka and Storm which are built for scale and speed.