Bad Data Is Ruining Machine Learning, Here’s How To Fix It
- by 7wData
In a recent interview with us, Vijay D'Souza, Director of the Center for Enhanced Analytics at the US Government Accountability Office, noted that, ‘Regardless of the goals, it’s important to understand the quality of the data you have. The quality determines how much you can rely on the data to make good decisions.’ As it stands, bad data is ruining companies' data initiatives. This is one of the primary reasons why just 25% of businesses are successfully using their data to optimize revenue, despite the tremendous resources being pumped into them. IBM estimates that bad data is costing organizations some $3.1 billion a year in the US alone, while in Experian’s Data Quality survey, 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.
The issue of bad data is going to become an even bigger problem as Machine Learning adoption increases. Machine Learning algorithms depend on data to train them. Indeed, the majority of machine learning practitioners actually consider the data to be more important than the algorithms themselves. In Crowdflower’s 2017 Data Scientist report, when asked to identify the biggest bottleneck in successfully completing AI projects, over half the respondents cited getting good quality training data or improving the training dataset. Director of Data Science & Analytics at ‎Turo, Jérôme Selles argues that, ‘Depending on the quality of the data that is being used, automating the learning loop can be a challenge and, today, requires manual supervision. A good illustration of that is what happened with the Microsoft chatbot Tay that became racist within 24 hours. For Machine Learning to achieve its own potential, the learning process needs to be kept under control and values need to be respected. Data Quality for the models is as important as education values in our society and we need more automated and systematic ways to make this happen.’
However, while data scientists may realize the importance of quality data, the same is not necessarily true of business leaders. They are more focused on adopting the technology as soon as possible to maintain competitive edge and often understand little about the practicalities. This means people are using whatever is in their datasets and hoping for the best, under the misapprehension that quantity can compensate for quality. Mistakes in the training data infect a system like urine in a swimming pool, polluting all the results and rendering any insights untrustworthy. It leads to wrong solutions, bad decision making, and potentially unethical AI. These issues could ultimately destroy the technology as organizations grow disillusioned when they don't get the return for their investment they expected and abandon ship.
Bad data is a problem that the majority of companies will, unfortunately, struggle to deal with. All datasets are flawed, some are just less flawed than others. Bad data is the result of a number of factors, the first of which is basic data decay. At any given time, as much as 70% of data sets are outdated. People's lives are constantly changing - they move house, they switch job, they change phone number, and so forth. As a result, email addresses change at a rate of about 23% a year, 20% of all postal addresses do the same, and roughly 18% of all telephone numbers. If you’re using this data, your machine learning algorithms will be coming to conclusions that may have worked yesterday, but will not represent the realities today and tomorrow.
Another cause of bad data is bias at the source. Data is liable to be guided by false assumptions and drawn from sources like ill-considered market research.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Evolving Your Data Architecture for Trustworthy Generative AI
18 April 2024
5 PM CET – 6 PM CET
Read MoreShift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read More