Learn how Map Reduce Algorithm works with a simple example.. (Tutorial Day 7)

MapReduce Algorithm

MapReduce is a Distributed Data Processing Algorithm, introduced by Google.It is mainly inspired by Functional Programming model. MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and efficient way in cluster environments. Its uses the technique “Divide and Conquer” algorithm to process large amount of data.It divides input task into smaller and manageable sub-tasks to execute them in-parallel.

MapReduce Algorithm Steps

MapReduce Algorithm uses the following three main steps:

Map Function : Map Function is the first step in MapReduce Algorithm. It takes input tasks (say DataSets. I have given only one DataSet in below diagram.) and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel.

This step performs the following two sub-steps:

  • Splitting :Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
  • Mapping :Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet.

The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the below diagram.

Sort & Shuffle Function :

It is the second step in MapReduce Algorithm. Shuffle Function is also know as “Combine Function”. It takes a list of outputs coming from “Map Function” and perform these two following sub-steps on each and every key-value pair:

  1. Merging
  2. Sorting
  • Merging step combines all key-value pairs which have same keys (that is grouping key-value pairs by comparing “Key”). This step returns <Key, List<Value>>.
  • Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.

 

Reduce Function :It is the final step in MapReduce Algorithm. It performs only one step : Reduce step. It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation as shown below.

Final step output looks like first step output. However final step <Key, Value> pairs are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are computed and sorted pairs.

We can observe the difference between first step output and final step output with some simple example.

Example 1: Word Counting

Problem Statement:
Count the number of occurrences of each word available in a DataSet. So in below example, count how many times Fruits appear and what is the count ?

Input DataSet
Please find below example Input DataSet file. This is small set of data, but in real-time applications , they use very huge amount of Data.

i1

Output: Counts of each fruit.

How it works:

Step1: MapReduce – Map Function (Split Step)

i2

Step2: MapReduce – Map Function (Mapping Step)

i3

Step 3: MapReduce – Sort & Shuffle Function

i4

Step4: Reduce Function

i5

So we can accumulate our learning from this example as :

  1. The input dataset can be divided into n number of chunks depending upon the amount of data and block size.
  2. In Map function, all the chunks are processed simultaneously at the same time, which embraces the parallel processing of data.
  3. In shuffling, happens which leads to aggregation of similar patterns.
  4. Finally, reducers combine them all to get a consolidated output as per the logic.

 

Now in next tutorial we will discuss what is Scope ?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s