Hand labeling is the past. The future is #NoLabel AI

Hand labeling is the past. The future is #NoLabel AI

Data labeling is so hot right now… but could this rapidly emerging market face disruption from a small team at Stanford and the Snorkel open source project, which enables highly efficient programmatic labeling that is 10 to 1,000x as efficient as hand labeling?

We are witnessing a data labeling market explosion: labeling platforms have hit prime time. S&P Global released an October 11 report entitled *Avoiding Garbage in Machine Learning* in which it termed unlabeled data “garbage data” to highlight the importance of labeling in AI. The Economist recently noted that while spending on AI is growing from $38bn this year to $98bn in 2023, only 1 in 5 companies interested in AI has deployed Machine Learning models because of a shortage of labeled data. This is why “the market for data-labeling services may triple to $5bn by 2023.” It is difficult not to notice the abundance of labeling startups being funded of late that are chasing after this market.

The deep learning revolution has brought about new levels of sophistication in machine learning and artificial intelligence, such that it has applicability to a large range of business problems. Traditional modeling involved feature engineering as a primary activity in order to select a subset of features to simplify the model. Representation learning with neural networks uses massive models and large volumes of training data to learn features automatically. This has shifted the activity in machine learning from feature engineering to dataset management. [Ratner, A, 2019]. Acquiring and managing labeled data can be the most expensive part of building AI into a business, and this is a major gating factor in AI adoption.

Far from being an implementation detail, the availability of labeled data is often what determines whether a problem is even approachable. Pete Skomoroch, the AI veteran and investor, recently said, “Data labeling is a good proxy for whether machine learning is cost-effective for a problem. If you can build labeling into normal user activities you track like Facebook, Google, and Amazon consumer applications, you have a shot. Otherwise, you burn money paying for labeled data. Many people still try to apply machine learning on high profile problems without oxygen, and burn lots of money in the process without solving them.”

Large technology companies have a large lead in AI because their labeling costs are so low. They were the companies that created the field of big data processing, and now that infrastructure is being used to drive AI using their vast reserves of labeled data. Other companies are struggling to get started, let alone catchup. As they rush to do so, a major question they must answer is: how can I acquire the labeled data needed to transform my business with AI to keep it relevant in the changing market? Another related question is: how much labeling technology even applies to private data that can’t be shipped outside of an organization?

What if there were a shortcut to labeling data? What if subject matter experts could write data labeling programs that each acted as weak labels that an unsupervised model - one requiring no labels - could combine into strong labels? That is the promise of the Snorkel Project that emerged from Stanford’s HazyResearch group.  “Back in 2016, we were surprised to notice that a lot of our collaborators in ML were starting to spend the majority of their time building, managing, cleaning, and most of all, labeling massive training datasets — and we asked why there wasn’t a system where practitioners could label and manage their training data in higher-level, programmatic, and ultimately faster ways?”

An overview of the Snorkel systems. (1) SME users write labeling functions (LFs)that express weak supervision sources like distance supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs' outputs into probabilistic labels.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Cloud BI Solutions Are Here to Stay and Here’s Why

29 Aug, 2017

It’s up there. All your information. It’s floating around us. Or at least that’s what most of us think when …

Read more

Five data quality lessons from Amazon

9 Apr, 2016

About a year ago on this site, I penned a post titled “Analytics lessons from Amazon.” In it, I described …

Read more

Augmented analytics, automated tools facilitate data analysis

8 Oct, 2018

Augmented analytics sounds futuristic — and BI and advanced analytics vendors are increasingly moving to make it a here-and-now reality, …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.