Danger! You’re Using The Wrong Data To Teach AI!
- by 7wData
Data is thefuel for artificial intelligence. The more data we have, the better the AI will learn and find those hidden patterns, right? Unfortunately, not so much. We have the ability to collect LOTS of data. Consider the nearly 31 billion IoT devices producing information for machine consumption. However, lots of data does not translate into gooddata. As humans, we have not fully grasped what is the slice of data that creates the real value in developing AI solutions. At the heart of this challenge, we have three major obstacles: 1) understanding what’s the real data, 2) validity of belief, and 3) implicit bias.
So, what is “real data?” Simply put, it is the data the machine really needs to learn and perform work. We have fallen into the trap that having big data gives us the key information to enable AI learning. The problem, though, is that more data can lead to more misconstructions and opportunities for bias. Consider what Dr. De Kai from the Hong Kong University of Science and Technology has shared: It takes an AI system hearing roughly 100 million words to learn a language, but a human child only needs to hear approximately 15 million words to learn it. Why is there such a delta? We don’t fully know, but there is a strong argument that it is particular words and phrases that really demonstrate the intricacies of language, not just a sheer volume. This makes for the argument that the secret lies in medium data, not big data. In other words, true AI skill development lies in using the critical data not just large volumes.
To see the power of medium data, we can look at fake news detection. Unfortunately, there is a lot of fake news out there with a very large amount variability. More variability means more data that AI needs to learn. However, at the University of Washington, the computer scientists at the Allen Institute for AI took a different approach. They created a system called Grover that learned how to write fake news so that it can better detect fake news. To write fake news articles, Grover had to learn what is real news by reading real news articles, which has much less variability than fake news. In effect, through their training strategy, they simplified the amount of data needed and went from vast big data needs to a reduced data set.
The obstacles in the validity of belief is trickier to manage. Fundamentally, each person has assumptions that we consider to be true and wind upholding as fact. For example, what color is the sun? Most people would say yellow, maybe red or orange at sunset. However, the sun is actually white. (Sorry Superman fans, but he shouldn’t really have any powers from a yellow sun.) Most people believe the sun is yellow because the Earth’s atmosphere scatters out the short-wavelength colors, so it winds up looking yellow. What’s the big deal, right? Imagine that we teach AI systems the sun is yellow as a fact.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More