There is a lot of hype surrounding data and analytics. Firms are constantly exhorted to set strategies in place to collect and analyze big data, and warned about the potential negative consequences of not doing so. For example, the Wall Street Journal recently suggested that companies sit on a treasure trove of customer data but for the most part do not know how to use it. In this article we explore why. Based on our work with companies that are trying to find concrete and usable insights from petabytes of data, we have identified four common mistakes managers make when it comes to data.
The first challenge limiting the value of big data to firms is compatibility and integration. One of the key characteristics of big data is that it comes from a variety of sources. However, if this data is not naturally congruent or easy to integrate, the variety of sources can make it difficult for firms to actually save money or create value for customers. For example, in one of our projects we worked with a firm which had beautiful data both on customer purchases and loyalty and a separate database on online browsing behavior, but little way of cross-referencing these two sources of data to actually understand whether certain browsing behavior was predictive of sales. Firms can respond to the challenge by creating “data lakes”, holding vast amounts of data in their unstructured form. However, the very fact that these vast swathes of data now available to firm are often unstructured, such as in the form of strings of text, means it is very difficult to store them in as structured a way as could occur when data was merely binary. And that often makes it extremely difficult to integrate it across sources.
The second challenge to making big data valuable is its unstructured nature. Specialized advances are being made in mining text-based data, where context and technique can lead to insights similar to that of structured data, but other forms such as video data are still not easily analyzed. One example is that, despite state-of-the-art facial recognition software, authorities were unable to identify the two bombing suspects for the Boston Marathon from a multitude of video data, as the software struggled to cope with photos of their faces taken from a variety of angles.
Given the challenges of gaining insights from unstructured data, firms have been most successful with it when they use it to initially augment the speed and accuracy of existing data analysis practices. For example, in oil and gas exploration, big data is used to enhance existing operations and data analysis surrounding seismic drilling. Though the data they use may have increased in velocity, variety, and volume, ultimately it is still being used for the same purpose. In general, starting out with the hope of using unstructured data to try to generate new hypotheses is problematic until the firms have “practiced” and gained expertise in using unstructured data to enhance their answers to an existing question.
The third challenge — and in our opinion the most important factor that limits how valuable big data is to firms — is the difficulty of establishing causal relationships within large pools of overlapping observational data. Very large data sets usually contain a number of very similar or virtually identical observations that can lead to spurious correlations and as a result mislead managers in their decision-making. The Economist recently pointed out that ‘in a world of big data the correlations surface almost by themselves’, and a Sloan Management Review blog post emphasized that while many firms have access to big data, such data is not ’objective’, since the difficulty lies in distilling ‘true’ actionable insights from it. Similarly, typical machine learning algorithms used to analyze big data identify correlations that may not necessarily offer causal and therefore actionable managerial insights.
Chief Analytics Officer Europe
15% off with code 7WDCAO17
Chief Analytics Officer Spring 2017
15% off with code MP15
Big Data and Analytics for Healthcare Philadelphia
$200 off with code DATA200
10% off with code 7WDATASMX
Data Science Congress 2017
20% off with code 7wdata_DSC2017