MapReduce is a Distributed Data Processing Algorithm, introduced by Google.It is mainly inspired by Functional Programming model. MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and efficient way in cluster environments. Its uses the technique “Divide and Conquer” algorithm to process large amount of data.It divides input task into smaller and manageable sub-tasks to execute them in-parallel.
MapReduce Algorithm Steps
MapReduce Algorithm uses the following three main steps:
Map Function : Map Function is the first step in MapReduce Algorithm. It takes input tasks (say DataSets. I have given only one DataSet in below diagram.) and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel.
This step performs the following two sub-steps:
- Splitting :Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
- Mapping :Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet.
The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the below diagram.
Sort & Shuffle Function :
It is the second step in MapReduce Algorithm. Shuffle Function is also know as “Combine Function”. It takes a list of outputs coming from “Map Function” and perform these two following sub-steps on each and every key-value pair:
- Merging step combines all key-value pairs which have same keys (that is grouping key-value pairs by comparing “Key”). This step returns <Key, List<Value>>.
- Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Reduce Function :It is the final step in MapReduce Algorithm. It performs only one step : Reduce step. It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation as shown below.
Final step output looks like first step output. However final step <Key, Value> pairs are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are computed and sorted pairs.
We can observe the difference between first step output and final step output with some simple example.
Example 1: Word Counting
Count the number of occurrences of each word available in a DataSet. So in below example, count how many times Fruits appear and what is the count ?
Please find below example Input DataSet file. This is small set of data, but in real-time applications , they use very huge amount of Data.
Output: Counts of each fruit.
How it works:
Step1: MapReduce – Map Function (Split Step)
Step2: MapReduce – Map Function (Mapping Step)
Step 3: MapReduce – Sort & Shuffle Function
Step4: Reduce Function
So we can accumulate our learning from this example as :
- The input dataset can be divided into n number of chunks depending upon the amount of data and block size.
- In Map function, all the chunks are processed simultaneously at the same time, which embraces the parallel processing of data.
- In shuffling, happens which leads to aggregation of similar patterns.
- Finally, reducers combine them all to get a consolidated output as per the logic.
Now in next tutorial we will discuss what is Scope ?