I wrote in a previous post about the fallacy of the word big in the phrase “Big Data”. This catch phrase that has been associated with “everything having to do with the analysis of data” is a poisonous co-opting of the analytics space to fit a vendor’s needs. It creates a major barrier to adoption for analytics and confusion over exactly what data analytics can and can’t do with its reliance on capturing massive amounts of data. If you accept the premise that “Big Data” is a lie, and that anyone can benefit from analyzing data then the next logical question is, “now what?”
You already have a LOT of data. We’ve all seen the examples of “how much data is created every second/minute” floating around (here is one if you have not) but this is for web-scale, massively consumed platforms (and a huge part of the Big Data Fallacy I might add). Even still, any reasonably sized organization is likely curating anywhere from 10 to 100 TB of just “data” with most organizations generating around 50% again year over year. That data can be in many forms and formats; RDBMS (databases), Files, Logs, Images, Video, Audio, source code, web pages, emails, forums and Intranet sites, and a bunch of other ones I am not thinking of so early in the morning. The value in analytics comes from linking data together, and in a world full of data and most critically data TYPES how do you actually get started?
There are two “bug bucket” schools of thought for this:
1) The “Big Data Fallacy” school: Put all the data in one place. It is the only way possible to attempt to capture all the information you have in a useable format. This is not just the storage vendor view (think of the term “Data Lake”) this is also the view of common analytics tools like Hadoop: “Dump everything in here and then you can analyze it”.