The misanthrope’s vain struggle with big data

The misanthrope’s vain struggle with big data

The misanthrope’s vain struggle with big data

Thousands of companies today declare they plan to invest in Big Data, sometimes with legitimate reasons. They want to hire scientists, invest in infrastructure, revolutionize the field, and so on. This is why many people are looking into changing their careers and working on new skills meant to secure them a position as a data scientist. Since there aren’t many Big Data science faculties, it is safe to assume that most graduate students are in the same position.

How does one become a data scientist?

They need to either learn more than what the curriculum teaches them or get the necessary skills after graduating. So the natural question is how exactly to prepare. And to answer that, one needs to understand the basic things about the necessities of the companies hiring data scientists. You would think that any question is easy to answer since this domain is all the rage today. Surprisingly, that is not so.

Exactly how big is Big Data? What is the average set size that an average hiring company has on the table? What is the set size distribution? What is the usual signal to noise ratio encountered in the data sets used by these companies? These are all simple questions that should have numerical answers. They are all important points because the data size determines the minimal skill set for a future hire.

Read Also:
Big Data as a Service delivers the analytics benefits without the grunt work

There is not enough quantitative knowledge on how big is Big Data.

Having quantitative estimations regarding big data feels natural since everyone is looking so much into it. A targeted systematic search looking at dozens of scholarly papers, books, statistical reports, research news and media articles revealed that no one knows the answers to these questions. Answering these questions would require a study obtaining the data from the hiring companies and the data scientists currently working for them.

The state of the field

The most on-topic report available looking at the data set size, “The State of Big Data Infrastructure”, quotes an unknown number of respondents (p.7, Fig. 6). The same report quotes 90% of companies as reaping the benefits of big data and most respondents having data sets over 1PB.

The State of Big Data Infrastructure, p.7, April 2015, based on a number of “X” respondents. Set size distribution as % of the total

This comes into conflict with 2 other reports stating that most companies fail at using big data and that only 29% of companies are actually using big data to make predictions. It also comes into conflict with the data from the KDnuggets surveys quoting the largest data set as being less than 1TB for 70% of the respondents. As we all know, the differences in findings can be put to the exact nature of the questions, which highlights another problematic aspect of statistic research.

Read Also:
It's Not Clairvoyance, It's Data Science

While the discrepancy in terms of declared beneficial outcomes can be put to the nature of the questions, the 3 order of magnitude difference between the stated data set sizes remains questionable.

Current studies only seem to care about how much companies are planning to invest in data research, define big data, categorize it, or talk about the 3V - which are a nice but strictly qualitative/descriptive approach. Other studies are looking at data sets possibly just because they are already available, even if the data is specific to a business model that is not easy to replicate or even useful to most companies. Researchers are looking at the industry giants but there is no word about the set size in the average company. The KDnuggets surveys are investigating the largest data sets handled by a specific company, without asking about the typical set size. It is indeed wise to be able to handle data surges however the daily load is much more useful to know for those trying to prepare for such a job.

Read Also:
What is Your Data Worth? On LinkedIn, Microsoft, and the Value of User Data

Some papers even go as far as to forecast the future of the field, but the analysis is obviously qualitative, having no strong quantitative background to justify the predictions.

 



HR & Workforce Analytics Summit 2017 San Francisco

19
Jun
2017
HR & Workforce Analytics Summit 2017 San Francisco

$200 off with code DATA200

Read Also:
Big Data as a Service delivers the analytics benefits without the grunt work

M.I.E. SUMMIT BERLIN 2017

20
Jun
2017
M.I.E. SUMMIT BERLIN 2017

15% off with code 7databe

Read Also:
It's Not Clairvoyance, It's Data Science

Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
Cloudera: Riding an IoT-driven big data wave

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
Machine Learning Becomes Mainstream: How to Increase Your Competitive Advantage by Nidhi Chappell, Intel and Ronald van Loon, Adversitement

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
Embedded BI Changes the Strategic Role of the Business Analyst

Leave a Reply

Your email address will not be published. Required fields are marked *