Amazon Elastic Map Reduce (EMR) to scale Machine Learning Systems

Authors: Xuchen Zhang, Sreenidhi Sundaram

6 min readMar 25, 2021

Overview

Machine Learning systems are ubiquitous and now more than ever, there is a need to scale, deploy, monitor and maintain ML systems in production. There are several software engineering tools in the market that can be used to scale ML systems. In this blog post we dive into how Amazon Elastic Map Reduce can be used to scale a ML system. We take the example of a Movie Recommendation service to demonstrate how Amazon EMR makes it easier to deploy a ML system into production.

What is Amazon EMR?

Amazon Elastic Map Reduce (EMR) is a cloud-based big data platform for processing large amounts of data. It can be easily used in collaboration with open source tools. Amazon EMR makes it easy to set up and scale systems by automating time-consuming tasks. EMR has built-in machine learning tools and custom AMIs and bootstrap actions that can be augmented with preferred libraries and tools. EMR also supports Real-time Streaming data that can be obtained from services like Apache Kafka to create long-running, highly available, and fault-tolerant streaming data pipelines.

Scenario: Movie Recommendation Service

The specific scenario we would like to demonstrate is a Movie Recommendation service.

Dataset

The dataset is obtained from multiple sources. Some of the data is obtained from an event stream (Apache Kafka) that includes information about which user watched which video and ratings about those movies. The dataset has the the following format: <user_id>, <movie_id>, <movie_type> and <rating>. The rest of the data is taken from the IMDB dataset. We need to constantly update our data and model to improve our accuracy, and this gives rise to a need for real-time data processing.

Benefits of Amazon EMR

For the movie recommendation service scenario, we obtain data from an event stream, hence we need to able to incorporate dynamic changes. Amazon EMR helps us achieve this. Following are the benefits of using Amazon EMR for this service —

Integration with Kafka — helps achieve real-time data processing with Hadoop
On-Demand — Elastic load balancer and auto-scaling services can help us reduce the cost and achieve better performance.

Getting Started with Amazon EMR

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. You can read more about the node types here.

To deploy the Movie Recommendation service on Amazon EMR, first you need to launch an EMR. The EMR must have Auto Scaling service for the data processing of movie-rating data with Hadoop and Kafka in order to scale and reduce cost.

AWS Auto Scaling

AWS Auto Scaling monitors the application and automatically adjusts capacity to maintain steady and predictable performance at the lowest possible cost. Using AWS Auto Scaling, it’s easy to setup the application and scale it for multiple resources across multiple services in minutes. The service provides a simple, powerful user interface that lets you build scaling plans for resources including Amazon EC2 instances and Spot Fleets, Amazon ECS tasks, Amazon DynamoDB tables and indexes, and Amazon Aurora Replicas. AWS Auto Scaling helps optimize performance and costs.

Below are the general steps to follow to deploy a Movie Recommendation service on EMR with Auto Scaling —

Step 1: Create

Step 2: Upload

Step 3: Create security group

Step 4: Create Target Group

Step 5: Create Launch Configuration

Step 6: Create Load Balancer

Step 7: Create Auto Scaling Group and Policy

Step 8: Set up Kafka on AWS

To set up Kafka on AWS you can follow this documentation. They provide detailed steps to integrate Kafka with AWS.

Step 9: Monitor with CloudWatch

Analysis & Conclusion

Based on the above experiments, below is an analysis of the strengths and limitations of using Amazon EMR to deploy a ML system like a Movie Recommendation service in this scenario.

Strengths:

Low cost — Amazon EMR is very cost efficient. You pay a per-instance rate for every second used, with a one-minute minimum charge. You can launch a 10-node EMR cluster for as little as $0.15 per hour. When data is stored in S3 the cost per instance is lower.
Reliable — Amazon EMR makes you spend less time tuning and monitoring your cluster. EMR is tuned for the cloud and constantly monitors your cluster — retrying failed tasks and automatically replacing poorly performing instances. Clusters are highly available and automatically failover in the event of a node failure. EMR provides the latest stable open source software releases, so you don’t have to manage updates and bug fixes, which leads to fewer issues and less effort to maintain your environment. Therefore, fault tolerance is high.
Elastic — Unlike the rigid infrastructure of on-premises clusters, EMR decouples compute and storage, giving you the ability to scale each independently and take advantage of the tiered storage of Amazon S3. With EMR, you can provision one, hundreds, or thousands of compute instances or containers to process data at any scale. The number of instances can be increased or decreased automatically using Auto Scaling (which manages cluster sizes based on utilization) and you only pay for what you use.
Ease of use — EMR is easy to use and setup. It can be used easily by specifying the version of EMR applications and type of compute you want to use. EMR takes care of provisioning, configuring, and tuning clusters so that you can focus on running analytics.

Limitations:

Security — Although EMR automatically configures firewall settings, since the model will reside in the cloud, hence the model is not On-premise and in the hands of a third-party system.
Limited ML capabilities — The machine learning capabilities of Elastic MapReduce (using the big data tools of Hadoop/Spark) are good but are not as easy to use as other machine learning tools.

In conclusion, Amazon EMR is a very effective tool that can be used to scale ML systems with ease, so go ahead and get your hands dirty. Thanks for reading!

References

Amazon EMR - Big Data Platform - Amazon Web Services

Easily run and scale Apache Spark, Hive, Presto, and other big data frameworks Analysts, data engineers, and data…

aws.amazon.com

https://aws.amazon.com/autoscaling/

Netflix

This page is protected by Google reCAPTCHA to ensure you're not a bot. Learn more. The information collected by Google…

www.netflix.com

https://aws.amazon.com/blogs/big-data/real-time-stream-processing-using-apache-spark-streaming-and-apache-kafka-on-aws/