Making data science accessible

Making data science accessible

Making data science accessible

When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic needs, building up use cases as we went along.

The first rule of MapReduce is that there are four words you need to understand: Key, Value, Map and Reduce:

-       Keys and Values come as pair, for every key, there is a value. However, don’t think of it as just a column of two numbers, for example ID and balance. The key and the value can be ANYTHING as long as there is some sort of list of keys and corresponding values.

-       Map: a function which takes a set of values and returns another set of keys and values. It’s just a function; it can be anything you can write.

Read Also:
Top 5 misconceptions about Big Data

-       Reduce: a function which takes a key and its associated values, and returns a value and optionally a key. It is important to note when the Reduce function is called by MapReduce, it only calls it using one key and all of that key's values. If the data has many keys, then many Reduce functions will be called.

Below I have tried to make this concrete using a few examples on a fictional dataset. If I assume we have some data on credit card customers – in the table we have an ID, the colour of their card and their balance (see example below).

Here are a few problems you may have to solve and how you would use MapReduce to do this:

Example 1: Selecting all the ‘blue’ cards 

This is a very simple example, in fact so simple that we do not need a Reduce function. All we need is a map function which looks at the data and outputs only the accounts with blue cards. There are no input Keys and the input Value here is CARD_COLOUR. The Map here excludes accounts where card colour is not blue. In this case, there is no output key and the output values are all the accounts with blue cards, as we expected.

Read Also:
How Restaurants are Cooking Up Insights with Analytics

It’s worth also noting the values have retained their row numbers from the original table. This was the original input key, and we can think of this as where the data was stored originally. It is not required in this example, but note the values are sorted in order of original key.

Example 2: Summing the card balance by card colour 

This is another relatively straightforward example, a query of the sum of balance on each of the card types. In this instance we want to limit the amount of data we retrieve to just the card colour and balance – this might be appropriate if we had a lot of accounts. The map function therefore returns all the accounts, but just the card colour and balance for each account.

 



HR & Workforce Analytics Summit 2017 San Francisco

19
Jun
2017
HR & Workforce Analytics Summit 2017 San Francisco

$200 off with code DATA200

Read Also:
How Universities and Big Business are Solving Big Data Problems Together
Read Also:
Cloudera Corporate Overview - Driving Big Value with Big Data [Video]

M.I.E. SUMMIT BERLIN 2017

20
Jun
2017
M.I.E. SUMMIT BERLIN 2017

15% off with code 7databe

Read Also:
Business Intelligence ROI: 5 Critical Ways to Show BI Value

Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
Top 5 misconceptions about Big Data

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
Hadoop and NoSQL drive big data boom

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
5 Reasons That Business Intelligence on Hadoop Projects Fails

Leave a Reply

Your email address will not be published. Required fields are marked *