AI still doesn’t have the common sense to understand human language
- by 7wData
Until pretty recently, computers were hopeless at producing sentences that actually made sense. But the field of natural-language processing (NLP) has taken huge strides, and machines can now generate convincing passages with the push of a button.
These advances have been driven by deep-learning techniques, which pick out statistical patterns in word usage and argument structure from vast troves of text. But a new paper from the Allen Institute of Artificial Intelligence calls attention to something still missing: machines don’t really understand what they’re writing (or reading).
This is a fundamental challenge in the grand pursuit of generalizable AI—but beyond academia, it’s relevant for consumers, too. Chatbots and voice assistants built on state-of-the-art natural-language models, for example, have become the interface for many financial institutions, health-care providers, and government agencies. Without a genuine understanding of language, these systems are more prone to fail, slowing access to important services.
The researchers built off the work of the Winograd Schema Challenge, a test created in 2011 to evaluate the common-sense reasoning of NLP systems. The challenge uses a set of 273 questions involving pairs of sentences that are identical except for one word. That word, known as a trigger, flips the meaning of each sentence’s pronoun, as seen in the example below:
To succeed, an NLP system must figure out which of two options the pronoun refers to. In this case, it would need to select “trophy” for the first and “suitcase” for the second to correctly solve the problem.
The test was originally designed with the idea that such problems couldn’t be answered without a deeper grasp of semantics. State-of-the-art deep-learning models can now reach around 90% accuracy, so it would seem that NLP has gotten closer to its goal. But in their paper, which will receive the Outstanding Paper Award at next month’s AAAI conference, the researchers challenge the effectiveness of the benchmark and, thus, the level of progress that the field has actually made.
They created a significantly larger data set, dubbed WinoGrande, with 44,000 of the same types of problems. To do so, they designed a crowdsourcing scheme to quickly create and validate new sentence pairs. (Part of the reason the Winograd data set is so small is that it was hand-crafted by experts.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More