Simple Introduction to Map Reduce
The term MapReduce
actually is a compound word, which simply is a programming model/architecture used for processing large data sets in parallel,
normally in a distributed setting.
The above figure
shows the typical phase in a MapReduce program.
Phase 1:
In the initial phase the file contents are been read by the program
into an InputStream.
Phase 2:
In this phase (Some applications will combine phase 2 & 3
(mapping)) each line of the input file is read into a separate mapper instance that
will be executed in parallel, sometimes in a distributed setting.
Phase 3:
In this phase each
line from the previous phase is then fed into a Map function that
tokenize each term/data item and thereby converting it into a Key Value pair
e.g. <Term, 1>. In the figure above, the Key
represents the type of fruit while
the Value represents
the number of occurrence for a given term (Some programs aggregate
the values on a given line for each term before assigning it, while
some programs simply iterate over the entire set giving it an initial
value of 1)
More
about Map
Map
is actually a concept popular in functional programming, it's a
high(er) order function (high(er) order functions are functions that
take another function as input). A Map
simply applies a function to every
element in a list/collection. An example of this is given below :
We
define a function square, that's simply squares a number
// code is written in JavaScript
function square(x) {
return x * x
}
var numbers = [1,2,3,4,5] //array of numbers
number.map(square) // returns [1,4,9,16,25]
The map basically iterate though the list applying the function. A simple implementation of the map function in JavaScript is shown below :
Array.prototype.map = function(func) {
var result = [];
var arr = this // represents the array
for(var i = 0; i < arr.length; i++) {
result.push(func(arr[i]))
}
return result;
}
NB: For a more robust implementation of the map function please take a look at Mozilla's pollyfill implementation
Phase 4: In
this phase the Key-Value<K,V> pairs are sorted into buckets of familiar terms. The
process of moving items from one bucket to another and vice verse
can be seen as a shuffling effort, but it doesn't mean that a
shuffle sorting algorithm was used. This
stage reduce the number of map instances because as the buckets are
sorted and shuffled the empty buckets are GC'ed and memory is freed
up.
Phase
5: The buckets from the previous phase are then aggregated using
the reduce function.
Each
bucket is iterated over (same as with a map) and all the values of
each Key Value pair is aggregated to form a partial answer. After
the values have been aggregated the partial answers of each Bucket is then
collected and combined by the MapReduce program to produce a final answer as shown the figure.
More
about Reduce (Sometimes called Foldl)
Reduce
is actually a concept popular in functional programming. A Reduce
simply applies a function to every
element in a list/collection which then
combines the resulting value(s) to form a result.
Actually a foldr can be used to construct
a map function as seen below :
map f = foldr ((:) . f) []
In
the above example the function composition operation (.) simply combines
the array operand (:) with the function f. This simply means the function is applied to
the current item and then added to the array [] (default value).
A simple implementation of the reduce function in JavaScript is shown below : // code is written in JavaScript function add(a, b) { return a * b } var numbers = [1,2,3,4,5] //array of numbers number.reduce(add) // returns15
Below is an implementation of the reduce function is JavaScript. I'll try to explain how it works.
The function initially takes a
reducer
function, in our case the function is add and an initial value. The program iterates over the collection, if the initialValue/value was set then it executes the callback function passing in the :
aggregated value so far/value,
the current element being process
The reducer function simply takes the aggregated value and the current value and apply an operation to further reduce the value; the result of this operation is assigned to the
value
variable and the process repeats until all the elements in the collection is being processed, after which the variable containing the result is returned by the program as the final output.
Array.prototype.reduce = function(func, initVal) { var value = null, len = this.length; if (1 < arguments.length) { value = initVal; } for(var i = 0; i < len; i++) { if(value!=null) { value = func(value, this[i]); } else { value = this[i]; } } return value; } NB: For a more robust implementation of the reduce function please take a look at Mozilla's pollyfill implementation
For more information on Map Reduce, here are some readings : Google's White Paper on MapReduce Wikipedia Article on Map Wikipedia Article on Reduce Google's Article on Data Processing using MapReduce IBM's What is MapReduce Article
Comments
Post a Comment