Making data science accessible

Making data science accessible

Making data science accessible

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic needs, building up use cases as we went along.

The first rule of MapReduce is that there are four words you need to understand: Key, Value, Map and Reduce:

-       Keys and Values come as pair, for every key, there is a value. However, don’t think of it as just a column of two numbers, for example ID and balance. The key and the value can be ANYTHING as long as there is some sort of list of keys and corresponding values.

-       Map: a function which takes a set of values and returns another set of keys and values. It’s just a function; it can be anything you can write.

Read Also:
3 Best Practices for Data Lake Deployment

-       Reduce: a function which takes a key and its associated values, and returns a value and optionally a key. It is important to note when the Reduce function is called by MapReduce, it only calls it using one key and all of that key's values. If the data has many keys, then many Reduce functions will be called.

Below I have tried to make this concrete using a few examples on a fictional dataset. If I assume we have some data on credit card customers – in the table we have an ID, the colour of their card and their balance (see example below).

Here are a few problems you may have to solve and how you would use MapReduce to do this:

Example 1: Selecting all the ‘blue’ cards 

This is a very simple example, in fact so simple that we do not need a Reduce function. All we need is a map function which looks at the data and outputs only the accounts with blue cards. There are no input Keys and the input Value here is CARD_COLOUR. The Map here excludes accounts where card colour is not blue. In this case, there is no output key and the output values are all the accounts with blue cards, as we expected.

Read Also:
Bad Data Costing Companies A Fortune

It’s worth also noting the values have retained their row numbers from the original table. This was the original input key, and we can think of this as where the data was stored originally. It is not required in this example, but note the values are sorted in order of original key.

Example 2: Summing the card balance by card colour 

This is another relatively straightforward example, a query of the sum of balance on each of the card types. In this instance we want to limit the amount of data we retrieve to just the card colour and balance – this might be appropriate if we had a lot of accounts. The map function therefore returns all the accounts, but just the card colour and balance for each account.

 



Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
Information Governance Insights: Ch-ch-ch-Changes!
Read Also:
Containers and microservices find home in Hadoop ecosystem

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Data Science of Process Mining – Understanding Complex Processes

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Bad Data Costing Companies A Fortune

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
Bad Data Costing Companies A Fortune

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
3 Best Practices for Data Lake Deployment

Leave a Reply

Your email address will not be published. Required fields are marked *