Bad Data Is Ruining Machine Learning, Here’s How To Fix It Blog

Bad Data Is Ruining Machine Learning, Here’s How To Fix It

by 7wData
November 11, 2017

In a recent interview with us, Vijay D'Souza, Director of the Center for Enhanced Analytics at the US Government Accountability Office, noted that, ‘Regardless of the goals, it’s important to understand the quality of the data you have. The quality determines how much you can rely on the data to make good decisions.’ As it stands, bad data is ruining companies' data initiatives. This is one of the primary reasons why just 25% of businesses are successfully using their data to optimize revenue, despite the tremendous resources being pumped into them. IBM estimates that bad data is costing organizations some $3.1 billion a year in the US alone, while in Experian’s Data Quality survey, 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.

The issue of bad data is going to become an even bigger problem as Machine Learning adoption increases. Machine Learning algorithms depend on data to train them. Indeed, the majority of machine learning practitioners actually consider the data to be more important than the algorithms themselves. In Crowdflower’s 2017 Data Scientist report, when asked to identify the biggest bottleneck in successfully completing AI projects, over half the respondents cited getting good quality training data or improving the training dataset. Director of Data Science & Analytics at ‎Turo, Jérôme Selles argues that, ‘Depending on the quality of the data that is being used, automating the learning loop can be a challenge and, today, requires manual supervision. A good illustration of that is what happened with the Microsoft chatbot Tay that became racist within 24 hours. For Machine Learning to achieve its own potential, the learning process needs to be kept under control and values need to be respected. Data Quality for the models is as important as education values in our society and we need more automated and systematic ways to make this happen.’

However, while data scientists may realize the importance of quality data, the same is not necessarily true of business leaders. They are more focused on adopting the technology as soon as possible to maintain competitive edge and often understand little about the practicalities. This means people are using whatever is in their datasets and hoping for the best, under the misapprehension that quantity can compensate for quality. Mistakes in the training data infect a system like urine in a swimming pool, polluting all the results and rendering any insights untrustworthy. It leads to wrong solutions, bad decision making, and potentially unethical AI. These issues could ultimately destroy the technology as organizations grow disillusioned when they don't get the return for their investment they expected and abandon ship.

Bad data is a problem that the majority of companies will, unfortunately, struggle to deal with. All datasets are flawed, some are just less flawed than others. Bad data is the result of a number of factors, the first of which is basic data decay. At any given time, as much as 70% of data sets are outdated. People's lives are constantly changing - they move house, they switch job, they change phone number, and so forth. As a result, email addresses change at a rate of about 23% a year, 20% of all postal addresses do the same, and roughly 18% of all telephone numbers. If you’re using this data, your machine learning algorithms will be coming to conclusions that may have worked yesterday, but will not represent the realities today and tomorrow.

Another cause of bad data is bias at the source. Data is liable to be guided by false assumptions and drawn from sources like ill-considered market research.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Bad Data Is Ruining Machine Learning, Here’s How To Fix It

Leave a Reply Cancel reply

Upcoming Events

The Role of Taxonomy and Ontology in Semantic Layers

Evolving Your Data Architecture for Trustworthy Generative AI

World Wide Data Vault Consortium 2024

Shift Difficult Problems Left with Graph Analysis on Streaming Data

Categories

Tags

You Might Be Interested In

How best to integrate med device data

Protecting children’s data privacy in the smart city

Why AI Will Become an Essential Business Tool

Recent Jobs

Associate Director for Impact and Analytics

Data Scientist: Support NYS Attorney General Investigations

Judiciary Research Manager (Court Executive 2B)

Cyber Security Engineer – P2

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Bad Data Is Ruining Machine Learning, Here’s How To Fix It

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change