By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
The Tech MarketerThe Tech MarketerThe Tech Marketer
  • Home
  • Technology
  • Entertainment
    • Memes
    • Quiz
  • Marketing
  • Politics
  • Visionary Vault
    • Whitepaper
Reading: Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data
Share
Notification Show More
Font ResizerAa
The Tech MarketerThe Tech Marketer
Font ResizerAa
  • Home
  • Technology
  • Entertainment
  • Marketing
  • Politics
  • Visionary Vault
  • Home
  • Technology
  • Entertainment
    • Memes
    • Quiz
  • Marketing
  • Politics
  • Visionary Vault
    • Whitepaper
Have an existing account? Sign In
Follow US
© The Tech Marketer. All Rights Reserved.
The Tech Marketer > Blog > Technology > Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data
Technology

Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data

Last updated:
3 years ago
Share
SHARE

Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious.

Contents
Oh hi there 👋It’s nice to meet you.Sign up to receive awesome content in your inbox, every week.

Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset was found to contain thousands of racist and misogynistic labels that could have been used to train AI models.

AI News caught up with Devang Sachdev, VP of Marketing at Snorkel AI, to find out how the company is easing the laborious process of labelling data in a safe and effective way.

AI News: How is Snorkel helping to ease the laborious process of labelling data?

Devang Sachdev: Snorkel Flow changes the paradigm of training data labelling from the traditional manual process—which is slow, expensive, and unadaptable—to a programmatic process that we’ve proven accelerates training data creation 10x-100x.

Users are able to capture their knowledge and existing resources (both internal, e.g., ontologies and external, e.g., foundation models) as labelling functions, which are applied to training data at scale. 

Unlike a rules-based approach, these labelling functions can be imprecise, lack coverage, and conflict with each other. Snorkel Flow uses theoretically grounded weak supervision techniques to intelligently combine the labelling functions to auto-label your training data set en-masse using an optimal Snorkel Flow label model. 

Using this initial training data set, users train a larger machine learning model of their choice (with the click of a button from our ‘Model Zoo’) in order to:

  1. Generalise beyond the output of the label model.
  2. Generate model-guided error analysis to know exactly where the model is confused and how to iterate. This includes auto-generated suggestions, as well as analysis tools to explore and tag data to identify what labelling functions to edit or add. 

This rapid, iterative, and adaptable process becomes much more like software development rather than a tedious, manual process that cannot scale. And much like software development, it allows users to inspect and adapt the code that produced training data labels.

AN: Are there dangers to implementing too much automation in the labelling process?

DS: The labelling process can inherently introduce dangers simply for the fact that as humans, we’re fallible. Human labellers can be fatigued, make mistakes, or have a conscious or unconscious bias which they encode into the model via their manual labels.

When mistakes or biases occur—and they will—the danger is the model or downstream application essentially amplifies the isolated label. These amplifications can lead to consequential impacts at scale. For example, inequities in lending, discrimination in hiring, missed diagnoses for patients, and more. Automation can help.

In addition to these dangers—which have major downstream consequences—there are also more practical risks of attempting to automate too much or taking the human out of the loop of training data development.

Training data is how humans encode their expertise to machine learning models. While there are some cases where specialised expertise isn’t required to label data, in most enterprise settings, there is. For this training data to be effective, it needs to capture the fullness of subject matter experts’ knowledge and the diverse resources they rely on to make a decision on any given datapoint.

However, as we have all experienced, having highly in-demand experts label data manually one-by-one simply isn’t scalable. It also leaves an enormous amount of value on the table by losing the knowledge behind each manual label. We must take a programmatic approach to data labelling and engage in data-centric, rather than model-centric, AI development workflows. 

Here’s what this entails: 

  • Elevating how domain experts label training data from tediously labelling one-by-one to encoding their expertise—the rationale behind what would be their labelling decisions—in a way that can be applied at scale. 
  • Using weak supervision to intelligently auto-label at scale—this is not auto-magic, of course; it’s an inherently transparent, theoretically grounded approach. Every training data label that’s applied in this step can be inspected to understand why it was labelled as it was. 
  • Bringing experts into the core AI development loop to assist with iteration and troubleshooting. Using streamlined workflows within the Snorkel Flow platform, data scientists—as subject matter experts—are able to collaborate to identify the root cause of error modes and how to correct them by making simple labelling function updates, additions, or, at times, correcting ground truth or “gold standard” labels that error analysis reveals to be wrong.

AN: How easy is it to identify and update labels based on real-world changes?

DS: A fundamental value of Snorkel Flow’s data-centric approach to AI development is adaptability. We all know that real-world changes are inevitable, whether that’s production data drift or business goals that evolve. Because Snorkel Flow uses programmatic labelling, it’s extremely efficient to respond to these changes.

In the traditional paradigm, if the business comes to you with a change in objectives—say, they were classifying documents three ways but now need a 10-way schema, you’d effectively need to relabel your training data set (often thousands or hundreds of thousands of data points) from scratch. This would mean weeks or months of work before you could deliver on the new objective. 

In contrast, with Snorkel Flow, updating the schema is as simple as writing a few additional labelling functions to cover the new classes and applying weak supervision to combine all of your labelling functions and retrain your model. 

To identify data drift in production, you can rely on your monitoring system or use Snorkel Flow’s production APIs to bring live data back into the platform and see how your model performs against real-world data.

As you spot performance degradation, you’re able to follow the same workflow: using error analysis to understand patterns, apply auto-suggested actions, and iterate in collaboration with your subject matter experts to refine and add labelling functions. 

AN: MIT was forced to pull its ‘80 Million Tiny Images’ dataset after it was found to contain racist and misogynistic labels due to its use of an “automated data collection procedure” based on WordNet. How is Snorkel ensuring that it avoids this labelling problem that is leading to harmful biases in AI systems?

DS: Bias can start anywhere in the system – pre-processing, post-processing, with task design, with modelling choices, etc. And in particular issues with labelled training data.

To understand underlying bias, it is important to understand the rationale used by labellers. This is impractical when every datapoint is hand labelled and the logic behind labelling it one way or another is not captured. Moreover, information about label author and dataset versioning is rarely available. Often labelling is outsourced or in-house labellers have moved on to other projects or organizations. 

Snorkel AI’s programmatic labelling approach helps discover, manage, and mitigate bias. Instead of discarding the rationale behind each manually labelled datapoint, Snorkel Flow, our data-centric AI platform, captures the labellers’ (subject matter experts, data scientists, and others) knowledge as a labelling function and generates probabilistic labels using theoretical grounded algorithms encoded in a novel label model.

With Snorkel Flow, users can understand exactly why a certain datapoint was labelled the way it is. This process, along with label function and label dataset versioning, allows users to audit, interpret, and even explain model behaviours. This shift from manual to programmatic labelling is key to managing bias.

AN: A group led by Snorkel researcher Stephen Bach recently had their paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG) published. I’d direct readers to the paper for the full details, but can you give us a brief overview of what it is and how it improves over existing WordNet-based methods?

DS: ZSL-KG improves graph-based zero-shot learning in two ways: richer models and richer data. On the modelling side, ZSL-KG is based on a new type of graph neural network called a transformer graph convolutional network (TrGCN).

Many graph neural networks learn to represent nodes in a graph through linear combinations of neighbouring representations, which is limiting. TrGCN uses small transformers at each node to combine neighbourhood representations in more complex ways.

On the data side, ZSL-KG uses common sense knowledge graphs, which use natural language and graph structures to make explicit many types of relationships among concepts. They are much richer than the typical ImageNet subtype hierarchy.

AN: Gartner designated Snorkel a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you think makes you stand out from the competition?

DS: Data labelling is one of the biggest challenges for enterprise AI. Most organisations realise that current approaches are unscalable and often ridden with quality, explainability, and adaptability issues. Snorkel AI not only provides a solution for automating data labelling but also uniquely offers an AI development platform to adopt a data-centric approach and leverage knowledge resources including subject matter experts and existing systems.

In addition to the technology, Snorkel AI brings together 7+ years of R&D (which began at the Stanford AI Lab) and a highly-talented team of machine learning engineers, success managers, and researchers to successfully assist and advise customer development as well as bring new innovations to market.

Snorkel Flow unifies all the necessary components of a programmatic, data-centric AI development workflow—training data creation/management, model iteration, error analysis tooling, and data/application export or deployment—while also being completely interoperable at each stage via a Python SDK and a range of other connectors.

This unified platform also provides an intuitive interface and streamlined workflow for critical collaboration between SME annotators, data scientists, and other roles, to accelerate AI development. It allows data science and ML teams to iterate on both data and models within a single platform and use insights from one to guide the development of the other, leading to rapid development cycles.

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every week.

We don’t spam! Read our privacy policy for more info.

Check your inbox or spam folder to confirm your subscription.

You Might Also Like

OpenAI Investment Delivers $4.2 Billion Boost to SoftBank

iOS 26.3 Update Now Available With Key Improvements

Nancy Guthrie: Crypto Wallet Activity Raises New Questions in Disappearance Case

Google Stock Reacts to Alphabet’s Massive AI Infrastructure Push

India’s Supreme Court Warns WhatsApp: ‘You Cannot Play With the Right to Privacy’

Share This Article
Facebook LinkedIn Email Copy Link Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Benten Technologies: A secure, passwordless future
Next Article Irish Distillers adds digital label solution to support responsible choices 

Latest News

  • A surprise God of War prequel is out on the PS5 right now

    To close out its February 2026 State of Play presentation, Sony revealed God of War Sons of Sparta, a new prequel 2D side scroller in the God of War franchise, and announced that it's out right now on PlayStation 5. "God of War Sons of Sparta is a 2D action platformer with a canon story

  • Ring cancels its partnership with Flock Safety after surveillance backlash

    Following intense backlash to its partnership with Flock Safety, a surveillance technology company that works with law enforcement agencies, Ring has announced it is canceling the integration. In a statement published on Ring's blog and provided to The Verge ahead of publication, the company said: "Following a comprehensive review, we determined the planned Flock Safety

  • PlayStation State of Play February 2026: all the news and trailers

    Strap in, because the next PlayStation event is going to be a long one. Sony’s latest State of Play — a showcase event for upcoming PS5 games — kicks off on February 12th at 5PM ET and it’ll last more than an hour. We don’t know what exactly will be shown. Sony says that the

  • In one swoop, Trump kills US greenhouse gas regulations

    The Trump administration just eliminated the landmark finding that has underpinned federal regulations on planet-heating pollution since 2009. For nearly the past two decades, the "endangerment finding" has allowed the Environmental Protection Agency (EPA) to craft rules limiting greenhouse gas emissions under the Clean Air Act. Rather than repealing those rules individually, the Trump administration

  • Two powerful, OLED-equipped gaming laptops are hundreds off

    Whether you’re considering a 16-inch gaming laptop, or a more compact 14-inch model, there’s a deal with your name on it happening right now at Best Buy. The 2025 models of Acer’s Predator Helios Neo 16S and the Asus ROG Zephyrus G14 are hundreds off, although it’s the model with the bigger screen that is

- Advertisement -
about us

We influence 20 million users and is the number one business and technology news network on the planet.

Advertise

  • Advertise With Us
  • Newsletters
  • Partnerships
  • Brand Collaborations
  • Press Enquiries

Top Categories

  • Artificial Intelligence
  • Technology
  • Bussiness
  • Politics
  • Marketing
  • Science
  • Sports
  • White Paper

Legal

  • About Us
  • Contact Us
  • Privacy Policy
  • Affiliate Disclaimer
  • Legal

Find Us on Socials

The Tech MarketerThe Tech Marketer
© The Tech Marketer. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?