When I was in engineering school and wanted to get some serious data crunching done (shout out to Patran/Nastran aficionados), I would go downstairs to the lab and chat up Harvey. He owned the interface to the powerful and expensive mainframe, and nothing was going to get slotted in or processed without the blessing of this high priest of what, at the time, was Big Data. As much as I liked Harvey, boy am I glad that times have changed. But the advent of better number-crunching technology hasn’t happened overnight.
First Wave — Easier Data Access
The democratization of data access within the business has been years in the making. Whereas 20 years ago there were a limited number of databases that were often run on large systems and had high barriers/licenses to start using (by people with advanced SQL training), the world has been evolving in a number of different directions. With the advent of MySQL in the mid-1990s, it became free and easy to get started on storing data, even with a minimum of relational database knowledge. MySQL then went on to power much of the website revolution in the late 90s.
Second Wave — Cost Effective, Powerful Processing
Just as the first wave was starting to build, the need for a second wave was already starting. While it became easier to stand up and start a website and its underlying database, the explosive growth of the internet was starting to lead to other issues. Trying to find ways to index and search all the content being created was a daunting task. The major search engines such as Yahoo and Google were struggling with traditional ways of searching stacks of data. So instead of indexing every piece of hay in the proverbial haystack, they found ways to break it up and “mapreduce” it over several batches. The foundations for Hadoop came out of these efforts along with the work on Nutch by Doug Cutting.
Wave Three – The Bridge to Easy and Powerful
While cost effective and easy access sound great, this democracy presents opportunities and challenges since most existing data is already on legacy SQL systems. On the one hand, there are powerful new ways to combine NoSQL, SQL, and Hadoop for new areas such as IoT, as Matt Asay points out. On the other hand, there is now a modern Data Supply Chain (as Dan Woods notes) that must be put into place to manage all of this. That complexity can be intimidating for data architects, who only a decade ago were focused primarily on SQL and didn’t have to piece together NoSQL and Hadoop as well.