Bad Data Is Ruining Machine Learning, Here’s How To Fix It

Bad Data Is Ruining Machine Learning

In a recent interview with us, Vijay D'Souza, Director of the Center for Enhanced Analytics at the US Government Accountability Office, noted that, ‘Regardless of the goals, it’s important to understand the quality of the data you have. The quality determines how much you can rely on the data to make good decisions.’ As it stands, bad data is ruining companies' data initiatives. This is one of the primary reasons why just 25% of businesses are successfully using their data to optimize revenue, despite the tremendous resources being pumped into them. IBM estimates that bad data is costing organizations some $3.1 billion a year in the US alone, while in Experian’s Data Quality survey, 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.

The issue of bad data is going to become an even bigger problem as Machine Learning adoption increases. Machine Learning algorithms depend on data to train them. Indeed, the majority of machine learning practitioners actually consider the data to be more important than the algorithms themselves. In Crowdflower’s 2017 Data Scientist report, when asked to identify the biggest bottleneck in successfully completing AI projects, over half the respondents cited getting good quality training data or improving the training dataset. Director of Data Science & Analytics at ‎Turo, Jérôme Selles argues that, ‘Depending on the quality of the data that is being used, automating the learning loop can be a challenge and, today, requires manual supervision. A good illustration of that is what happened with the Microsoft chatbot Tay that became racist within 24 hours. For Machine Learning to achieve its own potential, the learning process needs to be kept under control and values need to be respected. Data Quality for the models is as important as education values in our society and we need more automated and systematic ways to make this happen.’

However, while data scientists may realize the importance of quality data, the same is not necessarily true of business leaders. They are more focused on adopting the technology as soon as possible to maintain competitive edge and often understand little about the practicalities. This means people are using whatever is in their datasets and hoping for the best, under the misapprehension that quantity can compensate for quality. Mistakes in the training data infect a system like urine in a swimming pool, polluting all the results and rendering any insights untrustworthy. It leads to wrong solutions, bad decision making, and potentially unethical AI. These issues could ultimately destroy the technology as organizations grow disillusioned when they don't get the return for their investment they expected and abandon ship.

Bad data is a problem that the majority of companies will, unfortunately, struggle to deal with. All datasets are flawed, some are just less flawed than others. Bad data is the result of a number of factors, the first of which is basic data decay. At any given time, as much as 70% of data sets are outdated. People's lives are constantly changing - they move house, they switch job, they change phone number, and so forth. As a result, email addresses change at a rate of about 23% a year, 20% of all postal addresses do the same, and roughly 18% of all telephone numbers. If you’re using this data, your machine learning algorithms will be coming to conclusions that may have worked yesterday, but will not represent the realities today and tomorrow.

Another cause of bad data is bias at the source. Data is liable to be guided by false assumptions and drawn from sources like ill-considered market research.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

How best to integrate med device data

11 Jul, 2016

Once obscure research tools, medical devices today are nearly ubiquitous in healthcare. An acutely ill patient, for example, can be …

Read more

Protecting children’s data privacy in the smart city

19 May, 2019

The devices that we use have unique identifiers. With cross-browser fingerprinting, the data we generate as users isn’t as anonymized …

Read more

Why AI Will Become an Essential Business Tool

9 Oct, 2016

In some use cases, it is impossible for humans to replicate the performance of artificial intelligence. But businesses will need …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.