Hadoop on AWS: Benefits of EMR
by Kumar Sambhav, Research Computing, NUS Information Technology
Managing Big Data on Hadoop clusters has seen a lot of paradigm shift in the recent times. From Sysadmin managed clusters at the command line level to on-prem centrally managed platforms like Cloudera, Hortonworks and MapR. All of these platforms have a primary problem of being dependent on physical hardware resources. This core problem is a precursor to a few other issues that Hadoop systems on physical systems face. Some of those issues are listed and discussed here:
Difficulties in Upgrades: Scalability Issues
The concerns that this issue raises are in two ways. Not all of the hardware configurations are optimized for new and evolving releases. This puts the owners of the system in a situation where they cannot upgrade the software or use the features selectively.
In addition, the upgrades on the hardware require significant investments and hence finding the perfect configuration for a particular stack can actually come out be really expensive. (especially in HPC configurations like scenarios).
Lack of Flexibility in execution
Most of the cluster services in a multi-service, multi-tenant system has to be running round the clock. This leads to constant drainage of operational resources like power, cooling and space during the cluster idle time.
Monitoring problems:
Hadoop ecosystem consists of a lot of interdependent services which poses a management and monitoring nightmare for Sysadmins. The managed service platforms like Cloudera or Hortonworks close this gap a little, but the problem largely remains and causes concerns.
Cloud based solution: AWS EMR
The problems mentioned above are a few that Amazon EMR solves. EMR stands for Elastic Map Reduce. But its functionalities are not limited to Hadoop Map Reduce algorithm. EMR is a managed services platform which helps the user execute their big data loads in ecosystems of their choice. (Like Apache Hadoop or Apache Spark).
An EMR cluster runs on EC2 instances in a region and hardware configuration of choice. Thus, the benefits of saving costs by using reserved instances as well as spot instances helps organizations do a lot of cost savings.
The seamless integration of EMR to other AWS services make activities like network management (VPC), cluster usage (CloudWatch), auditing (CloudTrail), user identity management (IAM) etc. really simple to achieve and manage.
Apart from these usual benefits that come with EMR being a part of AWS ecosystem, one major benefit that addresses the lack of flexibility is the existence of different modes of execution. EMR can execute in Cluster mode or Step mode as described here:
CLUSTER |
STEP |
The cluster continues to run until the user choses to terminate it. This is suitable for users which want certain services available 24X7 or for a considerable duration |
Step mode of execution comprises creation of steps as the user asks and once the execution of the steps finish, cluster is terminated. This is suitable for user which have very specific purpose for the cluster. |
On top of all these benefits, the users are given access to the clusters through Jupyter notebooks with no provisioning of resources to be done for it.
Thus, EMR helps us to simplify almost every task which is difficult when being done by on-prem managed services. This leads to smaller time of acceptance and starting up of Big Data projects on a whole.