Big Data or Not Big Data: What is  question?

Big Data or Not Big Data: What is question?

Big Data or Not Big Data: What is <your> question?” width=”300″ />
<div><p class= Before jumping on the Big Data bandwagon, I think it is important to ask  the question of whether the problem you have requires much data.  That is, I think its important to determine when Big Data is relevant to the problem at hand.

The question of relevancy is important, for two reasons: (i) if the data are irrelevant, you can't draw appropriate conclusions (collecting more of the wrong data leads absolutely nowhere), (ii) the mismatch between the problem statement, the underlying process of interest, and the data in question is critical to understand if you are going to distill any great truths from your data.

Big Data is relevant when you see some evidence of a non-linear or non-stationary generative process that varies with time (or at least, sampling time), on the spectrum of random drift to full blown chaotic behavior.  Non-stationary behaviors can arise from complex (often 'hidden') interactions within the underlying process generating your observable data.  If you observe non-linear relationships, with underlying stationarity, it reduces to a sampling issue.  Big Data implicitly becomes relevant when we are dealing with processes embedded in a high dimensional context (i.e., what's left after dimension reduction).  With higher embedding dimensions, we need more and more well distributed samples to understand the underlying process.  For problems where the underlying process is both linear and stationary, we don't necessarily need much data at all.

Read Also:
Data Warehouse Disruptions 2016: Gartner Magic Quadrant

Note: The size of the circles do not reflect the frequency of observing any particular type of data. Complex (small, nonlinear, nonstationary) but under-sampled data are not rare.  However, for complex processes, you need more samples to capture the underlying variability and higher order statistical structure, so the "need" for big data is greater.  Whether you actually "have" sufficient data is a different issue.  Likewise, for a simple linear stationary process, you need very little data.

The wrench here is in knowing when you are dealing with a non-linear or non-stationary process.



Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
Advantages And Disadvantages Of Having Business Intelligence On The Cloud

Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
How open source helps startups get a big data boost

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
How Do Business Prioritize Their Digital Transformation Efforts?
Read Also:
Advantages And Disadvantages Of Having Business Intelligence On The Cloud

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Can Data Science Predict Who Will Get Michelin Stars?

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Hortonworks enters joint initiative with Hewlett Packard Enterprise on Apache Spark enhancements