Hand labeling is the past. The future is #NoLabel AI
- by 7wData
Data labeling is so hot right now… but could this rapidly emerging market face disruption from a small team at Stanford and the Snorkel open source project, which enables highly efficient programmatic labeling that is 10 to 1,000x as efficient as hand labeling?
We are witnessing a data labeling market explosion: labeling platforms have hit prime time. S&P Global released an October 11 report entitled *Avoiding Garbage in Machine Learning* in which it termed unlabeled data “garbage data” to highlight the importance of labeling in AI. The Economist recently noted that while spending on AI is growing from $38bn this year to $98bn in 2023, only 1 in 5 companies interested in AI has deployed Machine Learning models because of a shortage of labeled data. This is why “the market for data-labeling services may triple to $5bn by 2023.” It is difficult not to notice the abundance of labeling startups being funded of late that are chasing after this market.
The deep learning revolution has brought about new levels of sophistication in machine learning and artificial intelligence, such that it has applicability to a large range of business problems. Traditional modeling involved feature engineering as a primary activity in order to select a subset of features to simplify the model. Representation learning with neural networks uses massive models and large volumes of training data to learn features automatically. This has shifted the activity in machine learning from feature engineering to dataset management. [Ratner, A, 2019]. Acquiring and managing labeled data can be the most expensive part of building AI into a business, and this is a major gating factor in AI adoption.
Far from being an implementation detail, the availability of labeled data is often what determines whether a problem is even approachable. Pete Skomoroch, the AI veteran and investor, recently said, “Data labeling is a good proxy for whether machine learning is cost-effective for a problem. If you can build labeling into normal user activities you track like Facebook, Google, and Amazon consumer applications, you have a shot. Otherwise, you burn money paying for labeled data. Many people still try to apply machine learning on high profile problems without oxygen, and burn lots of money in the process without solving them.”
Large technology companies have a large lead in AI because their labeling costs are so low. They were the companies that created the field of big data processing, and now that infrastructure is being used to drive AI using their vast reserves of labeled data. Other companies are struggling to get started, let alone catchup. As they rush to do so, a major question they must answer is: how can I acquire the labeled data needed to transform my business with AI to keep it relevant in the changing market? Another related question is: how much labeling technology even applies to private data that can’t be shipped outside of an organization?
What if there were a shortcut to labeling data? What if subject matter experts could write data labeling programs that each acted as weak labels that an unsupervised model - one requiring no labels - could combine into strong labels? That is the promise of the Snorkel Project that emerged from Stanford’s HazyResearch group. “Back in 2016, we were surprised to notice that a lot of our collaborators in ML were starting to spend the majority of their time building, managing, cleaning, and most of all, labeling massive training datasets — and we asked why there wasn’t a system where practitioners could label and manage their training data in higher-level, programmatic, and ultimately faster ways?”
An overview of the Snorkel systems. (1) SME users write labeling functions (LFs)that express weak supervision sources like distance supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs' outputs into probabilistic labels.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Evolving Your Data Architecture for Trustworthy Generative AI
18 April 2024
5 PM CET – 6 PM CET
Read MoreShift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read More