data-or-not-big-data-what-is-question-300x242.png" alt="Big Data or Not Big Data: What is question?” width=”300″ />

Before jumping on the Big Data bandwagon, I think it is important to ask the question of whether the problem you have requires much data. That is, I think its important to determine when Big Data is relevant to the problem at hand.

The question of relevancy is important, for two reasons: (i) if the data are irrelevant, you can't draw appropriate conclusions (collecting more of the wrong data leads absolutely nowhere), (ii) the mismatch between the problem statement, the underlying process of interest, and the data in question is critical to understand if you are going to distill any great truths from your data.

Big Data is relevant when you see some evidence of a non-linear or non-stationary generative process that varies with time (or at least, sampling time), on the spectrum of random drift to full blown chaotic behavior. Non-stationary behaviors can arise from complex (often 'hidden') interactions within the underlying process generating your observable data. If you observe non-linear relationships, with underlying stationarity, it reduces to a sampling issue. Big Data implicitly becomes relevant when we are dealing with processes embedded in a high dimensional context (i.e., what's left after dimension reduction). With higher embedding dimensions, we need more and more well distributed samples to understand the underlying process. For problems where the underlying process is both linear and stationary, we don't necessarily need much data at all.

Note: The size of the circles do not reflect the frequency of observing any particular type of data. Complex (small, nonlinear, nonstationary) but under-sampled data are not rare. However, for complex processes, you need more samples to capture the underlying variability and higher order statistical structure, so the "need" for big data is greater. Whether you actually "have" sufficient data is a different issue. Likewise, for a simple linear stationary process, you need very little data.

The wrench here is in knowing when you are dealing with a non-linear or non-stationary process.

*Related*

*Related*