Organizations increasingly rely on analytics and advanced data visualization techniques to deliver incremental business value. However, when their efforts are hampered by data quality issues, the credibility of their entire analytics strategy comes into question.
Because analytics traditionally is seen as a presentation of a broad landscape of data points, it is often assumed that data quality issues can be ignored since they would not impact broader trends. But should bad data be ignored to allow analytics to proceed? Or should they stall to enable data quality issues to be addressed?
In this article, we use a shipping industry scenario to highlight the dependence on quality data and discuss how companies can address data quality in parallel with the deployment of their analytics platforms to deliver even greater business value.
An Analytics Use Case: Fuel Consumption in the Shipping Industry
Shipping companies are increasingly analyzing the financial and operational performance of their vessels against competitors, industry benchmarks and other vessels within their fleet. A three-month voyage, such as a round trip from the US West Coast to the Arabian Gulf, can generate a large volume of operational data, most of which is manually collected and reported by the onboard crew.
Fuel is one of the largest cost components for a shipping company. Optimum fuel consumption in relation to the speed of the vessel is a tough balancing act for most companies. The data collected daily by the fleet is essential to analyse the best-fit speed and consumption curve.
But consider an example of a speed versus fuel consumption exponential curve plotted to determine the optimum speed range at which the ships should operate. With only a few errors made by the crew in entering the data (such as an incorrect placement of a decimal point), the analysis presented is unusable for making decisions. The poor quality of data makes it impossible to determine the relationship between a change in speed and the proportional change in fuel consumption.
Most analytics programs are designed based on the belief that removing outliers is all that is needed to make sense of the data, and there are many data analysis tools available that can help with that. However, what if some of those outliers are not outliers and were the result of a scenario that needs to be considered?
For instance, in the example, what if some of the outliers were actual fuel consumption points captured when the ship encountered inclement weather? By ignoring these data points, users can make assumptions without considering important dimensions—and that could lead to very different decisions. This approach not only makes the analysis dubious, but also often leads to incorrect conclusions.
In some cases, the practice of removing outliers can lead to the deletion of a significant number of data points from the analysis. But can users get the answer they are looking for by ignoring 40 percent of the data set?
Companies need to determine the speed at which vessels are most efficient with a lot more certainty. Data quality issues only reduce the confidence in the analysis conducted. In the shipping example, a difference in speed of 1 to 2 knots can potentially result in a difference of $500,000 to $700,000 in fuel consumption for a round trip US West Coast to Arabian Gulf voyage at the current bunker price.
Does this mean that data needs to be validated 100 percent before it can be used for analytics? Does the entire universe of data need to be clean before it is useful for analytics? Absolutely not. In fact, companies should only clean the data they intend to use. The right approach can help to determine which issues should be addressed to manage data quality.
Data Used for Analytics: Where Should I Use My Cleansing Tools?
Analytics use cases have specific needs in terms of which pieces of data are critical to the analysis.
Chief Analytics Officer Europe
15% off with code 7WDCAO17
Chief Analytics Officer Spring 2017
15% off with code MP15
Big Data and Analytics for Healthcare Philadelphia
$200 off with code DATA200
10% off with code 7WDATASMX
Data Science Congress 2017
20% off with code 7wdata_DSC2017