The Classic analytics traditionally supported by a data warehouse yields focus and insight by understanding organization’s past actions. One among many examples would include measuring the supply chain of materials into a product or service that’s brought to market, which is typical across all industries from telecom and financial services to pharmaceuticals and retail. Others measure sales across product channels and customer demographics, or cash flow in, out, and through an organization. Most organizations leverage some sort of data warehouse or a set of business intelligence tools to dissect their own transactional data. Mature organizations conduct analytics in real-time and some even use predictive modeling to forecast and help make decisions.
Big Data analytics shifts the analytics focus from analyzing internal mechanisms to the events that happen outside of an organization. Now, itispossible to leverage data to understand events external to an organization. Tapping into social media, news feeds, and product review data can provide insight on how customers view an organization’s products & services. In some cases, tapping into machine logs will yield insight into how an organization’s key stakeholders (customers, employees, etc.) deliver or use the final products. Big Data Management Systems like Hadoop help manage the storage of this information. Moving this information into a system like Hadoop can be challenging, and teasing this massive amount of data into actionable intelligence seems impossible.
Big Data Analytics requires new approaches and techniques to integrate data from classic data warehousing ETL. Specifically, the repurposing of Data Quality techniques can help solve Big Data Integration challenges. Recently I set up some of the leading discovery & visualization tools, Tableau & Qliksense, to access data in Hadoop. These tools require access through Hive, a native SQL-like interface to the data. In order to get the Hadoop data into Hive, we had to structure it. This meant parsing up the text from the logs and articles into traditional, relational based columns and rows. Processing massive amounts of text data from a nearly limitless pool of file formats is resource intensive and almost an irrational endeavor.