When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic needs, building up use cases as we went along.
The first rule of MapReduce is that there are four words you need to understand: Key, Value, Map and Reduce:
- Keys and Values come as pair, for every key, there is a value. However, don’t think of it as just a column of two numbers, for example ID and balance. The key and the value can be ANYTHING as long as there is some sort of list of keys and corresponding values.
- Map: a function which takes a set of values and returns another set of keys and values. It’s just a function; it can be anything you can write.
- Reduce: a function which takes a key and its associated values, and returns a value and optionally a key. It is important to note when the Reduce function is called by MapReduce, it only calls it using one key and all of that key's values. If the data has many keys, then many Reduce functions will be called.
Below I have tried to make this concrete using a few examples on a fictional dataset. If I assume we have some data on credit card customers – in the table we have an ID, the colour of their card and their balance (see example below).
Here are a few problems you may have to solve and how you would use MapReduce to do this:
Example 1: Selecting all the ‘blue’ cards
This is a very simple example, in fact so simple that we do not need a Reduce function. All we need is a map function which looks at the data and outputs only the accounts with blue cards. There are no input Keys and the input Value here is CARD_COLOUR. The Map here excludes accounts where card colour is not blue. In this case, there is no output key and the output values are all the accounts with blue cards, as we expected.
It’s worth also noting the values have retained their row numbers from the original table. This was the original input key, and we can think of this as where the data was stored originally. It is not required in this example, but note the values are sorted in order of original key.
Example 2: Summing the card balance by card colour
This is another relatively straightforward example, a query of the sum of balance on each of the card types. In this instance we want to limit the amount of data we retrieve to just the card colour and balance – this might be appropriate if we had a lot of accounts. The map function therefore returns all the accounts, but just the card colour and balance for each account.