Simple Introduction to Map Reduce

The term MapReduce actually is a compound word, which simply is a programming model/architecture used for processing large data sets in parallel, normally in a distributed setting.




The above figure shows the typical phase in a MapReduce program.

Phase 1: In the initial phase the file contents are been read by the program into an InputStream.

Phase 2: In this phase (Some applications will combine phase 2 & 3 (mapping)) each line of the input file is read into a separate mapper instance that will be executed in parallel, sometimes in a distributed setting.

Phase 3: In this phase each line from the previous phase is then fed into a Map function that tokenize each term/data item and thereby converting it into a Key Value pair e.g. <Term, 1>. In the figure above, the Key represents the type of fruit while the Value represents the number of occurrence for a given term (Some programs aggregate the values on a given line for each term before assigning it, while some programs simply iterate over the entire set giving it an initial value of 1)
More about Map

Map is actually a concept popular in functional programming, it's a high(er) order function (high(er) order functions are functions that take another function as input). A Map simply applies a function to every element in a list/collection. An example of this is given below :

We define a function square, that's simply squares a number

// code is written in JavaScript

function square(x) {
  return x * x
}

var numbers = [1,2,3,4,5] //array of numbers

number.map(square) // returns [1,4,9,16,25]

The map basically iterate though the list applying the function. A simple implementation of the  map function in JavaScript is shown below :

Array.prototype.map =  function(func) {
var result = [];
var arr = this // represents the array
for(var i = 0; i < arr.length; i++) {
    result.push(func(arr[i]))
}
return result;
}

NB: For a more robust implementation of the map function please take a look at Mozilla's pollyfill implementation

Phase 4: In this phase the Key-Value<K,V> pairs are sorted into buckets of familiar terms. The process of moving items from one bucket to another and vice verse can be seen as a shuffling effort, but it doesn't mean that a shuffle sorting algorithm was used. This stage reduce the number of map instances because as the buckets are sorted and shuffled the empty buckets are GC'ed and memory is freed up.

Phase 5: The buckets from the previous phase are then aggregated using the reduce function. Each bucket is iterated over (same as with a map) and all the values of each Key Value pair is aggregated to form a partial answer. After the values have been aggregated the partial answers of each Bucket is then collected and combined by the MapReduce program to produce a final answer as shown the figure.

More about Reduce (Sometimes called Foldl)

Reduce is actually a concept popular in functional programming. A Reduce simply applies a function to every element in a list/collection which then combines the resulting value(s) to form a result. Actually a foldr can be used to construct a map function as seen below :


map f = foldr ((:) . f) []


In the above example the function composition operation (.) simply combines the array operand (:) with the function f. This simply means the function is applied to the current item and then added to the array [] (default value).


A simple implementation of the reduce function in JavaScript is shown below :

// code is written in JavaScript

function add(a, b) {
  return a * b
}

var numbers = [1,2,3,4,5] //array of numbers

number.reduce(add) // returns 15


Below is an implementation of the reduce function is JavaScript. I'll try to explain how it works.

The function initially takes a reducer function, in our case the function is add and an initial value.  The program iterates over the collection, if the initialValue/value was set then it executes the callback function passing in the : 
  1. aggregated value so far/value, 
  2. the current element being process

The reducer function simply takes the aggregated value and the current value and apply an operation to further reduce the value; the result of this operation is assigned to the value variable and the process repeats until all the elements in the collection is being processed, after which the variable containing the result is returned by the program as the final output. 


Array.prototype.reduce = function(func, initVal) {
    var value = null, len = this.length;
    if (1 < arguments.length) {
      value = initVal;
    }
    
    for(var i = 0; i < len; i++) {
        if(value!=null) {
            value = func(value, this[i]);
        } else {
            value = this[i];
        }
    }
    
    return value;  
}

NB: For a more robust implementation of the reduce function please take a look at Mozilla's pollyfill implementation


For more information on Map Reduce, here are some readings :

Google's White Paper on MapReduce
Wikipedia Article on Map
Wikipedia Article on Reduce
Google's Article on Data Processing using MapReduce 
IBM's What is MapReduce Article

Comments

Popular posts from this blog

JavaScript Module Pattern: 2 Forms

Pseudo-Random UUID Generation with mask support

Mocking Ajax with the JQuery Mockjax Library