Predictive analytics, data science, artificial intelligence, bots. The waves of advances in the application of data keep on coming. You can’t read the pages of the mainstream or business media without being impressed by the opportunity. Yet, although the power of analytics is common currency, it’s spoken of far more often than it’s practiced. The biggest obstacle to using advanced data analysis isn’t skill base or technology; it’s plain old access to the data.
Every CIO I meet tells me that they are excited at the potential of analytics for their business. With one caveat — they can’t get their hands on the data in the first place. Embracing data as a competitive advantage is a necessity for today’s business, so why is it so hard to get access to the data we need?
There is a cost to using data. Behind the glamor of powerful analytical insights is a backlog of tedious data preparation. Since the popular emergence of data science as a field, its practitioners have asserted that 80% of the work involved is acquiring and preparing data. Despite efforts among software vendors to create self-service tools for data preparation, this proportion of work is likely to stay the same for the foreseeable future, for a couple of reasons.
First, you can’t cleanly separate the data from its intended use. Depending on your desired application, you need to format, filter, and manipulate the data accordingly. Every new problem has its unique aspects that usually reach back into data acquisition and preparation. Second, data confers insight and advantage. Once you have harvested the low hanging fruit (the easy-to-prepare data), then you’re falling behind if you’re not looking for the next level of insight. So you must pursue the data which is harder to find and use, driving the amount of time spent in prep up.
But there is a bigger and costlier demon that lurks in enterprises. A demon that can drive up that 80% and often makes initiatives impossible: data silos. These silos are isolated islands of data, and they make it prohibitively costly to extract data and put it to other uses. They can arise for multiple reasons.
Structural. Software applications are written at one point in time, for a particular group in the company. In a world of limited resources, applications are optimized for their main function. The incentives of individual teams are unlikely to encourage data sharing as a primary requirement. This focus on function, for instance, may result in recent sales being stored in different systems from historical sales, thus presenting an immediate barrier to boosting sales through personal product recommendation.
Political. Knowledge is power, and groups within an organization become suspicious of others wanting to use their data. And often with some justification, as the scope for misuse, even accidental, is broad. Data isn’t a neutral entity — you must interpret it with knowledge of its history and context. This sense of proprietorship can act against the interests of the organization as a whole.
Growth. Any long-lived company has grown through multiple generations of leaders, philosophies, and acquisitions, resulting in multiple incompatible systems. Even if there are no political issues in integrating data, it is costly to reconcile and integrate sets of data that embody different approaches to important business concepts.