In many environments, the maturity of your reporting and business analytics functions depends on how effective you are at managing data before it’s time to analyze it. Traditional environments relied on a provisioning effort to conduct data preparation for analytics. After extracting data from source systems, the data landed at a staging area for cleansing, standardization and reorganization before loading it in a data warehouse.
Recently, there has been signification innovation in the evolution of end-user discovery and analysis tools. Often, these systems allow the analyst to bypass the traditional data warehouse by accessing the source data sets directly. This is putting more data – and analysis of that data – in the hands of more people. This encourages “undirected analysis,” which doesn’t pose any significant problems; the analysts are free to point their tools at any (or all!) data sets, with the hope of identifying some nugget of actionable knowledge that can be exploited.
However, it would be naïve to presume that many organizations are willing to allow a significant amount of “data-crunching” time to be spent on purely undirected discovery. Rather, data scientists have specific directions to solve particular types of business problems, such as analyzing: Logistics and facets of the supply chain to optimize the delivery channels.
Different challenges have different data needs, but if the analysts need to use data from the original sources, it’s worth considering an alternate approach to the conventional means of data preparation. The data warehouse approach balances two key goals: organized data inclusion (a large amount of data is integrated into a single data platform), and objective presentation (data is managed in an abstract data model specifically suited for querying and reporting).
A new approach to data preparation for analytics Does the data warehouse approach work in more modern, “built-to-suit” analytics? Maybe not, especially if data scientists go directly to the data – bypassing the data warehouse altogether. For data scientists, armed with analytics at their fingertips, let’s consider a rational, five-step approach to problem-solving. Clarify the question you want to answer. Identify the information necessary to answer the question.
Determine what information is available and what is not available. Acquire the information that is not available. In this process, steps 2, 3, and 4 all deal with data assessment and acquisition – but in a way that is parametrically opposed to the data warehouse approach. First, the warehouse’s data inclusion is pre-defined, which means that the data that is not available at step 3 may not be immediately accessible from the warehouse in step 4.