By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
The Tech MarketerThe Tech MarketerThe Tech Marketer
  • Home
  • Technology
  • Entertainment
    • Memes
    • Quiz
  • Marketing
  • Politics
  • Visionary Vault
    • Whitepaper
Reading: Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data
Share
Notification Show More
Font ResizerAa
The Tech MarketerThe Tech Marketer
Font ResizerAa
  • Home
  • Technology
  • Entertainment
  • Marketing
  • Politics
  • Visionary Vault
  • Home
  • Technology
  • Entertainment
    • Memes
    • Quiz
  • Marketing
  • Politics
  • Visionary Vault
    • Whitepaper
Have an existing account? Sign In
Follow US
© The Tech Marketer. All Rights Reserved.
The Tech Marketer > Blog > Technology > Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data
Technology

Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data

Last updated:
3 years ago
Share
SHARE

Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious.

Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset was found to contain thousands of racist and misogynistic labels that could have been used to train AI models.

AI News caught up with Devang Sachdev, VP of Marketing at Snorkel AI, to find out how the company is easing the laborious process of labelling data in a safe and effective way.

AI News: How is Snorkel helping to ease the laborious process of labelling data?

Devang Sachdev: Snorkel Flow changes the paradigm of training data labelling from the traditional manual process—which is slow, expensive, and unadaptable—to a programmatic process that we’ve proven accelerates training data creation 10x-100x.

Users are able to capture their knowledge and existing resources (both internal, e.g., ontologies and external, e.g., foundation models) as labelling functions, which are applied to training data at scale. 

Unlike a rules-based approach, these labelling functions can be imprecise, lack coverage, and conflict with each other. Snorkel Flow uses theoretically grounded weak supervision techniques to intelligently combine the labelling functions to auto-label your training data set en-masse using an optimal Snorkel Flow label model. 

Using this initial training data set, users train a larger machine learning model of their choice (with the click of a button from our ‘Model Zoo’) in order to:

  1. Generalise beyond the output of the label model.
  2. Generate model-guided error analysis to know exactly where the model is confused and how to iterate. This includes auto-generated suggestions, as well as analysis tools to explore and tag data to identify what labelling functions to edit or add. 

This rapid, iterative, and adaptable process becomes much more like software development rather than a tedious, manual process that cannot scale. And much like software development, it allows users to inspect and adapt the code that produced training data labels.

AN: Are there dangers to implementing too much automation in the labelling process?

DS: The labelling process can inherently introduce dangers simply for the fact that as humans, we’re fallible. Human labellers can be fatigued, make mistakes, or have a conscious or unconscious bias which they encode into the model via their manual labels.

When mistakes or biases occur—and they will—the danger is the model or downstream application essentially amplifies the isolated label. These amplifications can lead to consequential impacts at scale. For example, inequities in lending, discrimination in hiring, missed diagnoses for patients, and more. Automation can help.

In addition to these dangers—which have major downstream consequences—there are also more practical risks of attempting to automate too much or taking the human out of the loop of training data development.

Training data is how humans encode their expertise to machine learning models. While there are some cases where specialised expertise isn’t required to label data, in most enterprise settings, there is. For this training data to be effective, it needs to capture the fullness of subject matter experts’ knowledge and the diverse resources they rely on to make a decision on any given datapoint.

However, as we have all experienced, having highly in-demand experts label data manually one-by-one simply isn’t scalable. It also leaves an enormous amount of value on the table by losing the knowledge behind each manual label. We must take a programmatic approach to data labelling and engage in data-centric, rather than model-centric, AI development workflows. 

Here’s what this entails: 

  • Elevating how domain experts label training data from tediously labelling one-by-one to encoding their expertise—the rationale behind what would be their labelling decisions—in a way that can be applied at scale. 
  • Using weak supervision to intelligently auto-label at scale—this is not auto-magic, of course; it’s an inherently transparent, theoretically grounded approach. Every training data label that’s applied in this step can be inspected to understand why it was labelled as it was. 
  • Bringing experts into the core AI development loop to assist with iteration and troubleshooting. Using streamlined workflows within the Snorkel Flow platform, data scientists—as subject matter experts—are able to collaborate to identify the root cause of error modes and how to correct them by making simple labelling function updates, additions, or, at times, correcting ground truth or “gold standard” labels that error analysis reveals to be wrong.

AN: How easy is it to identify and update labels based on real-world changes?

DS: A fundamental value of Snorkel Flow’s data-centric approach to AI development is adaptability. We all know that real-world changes are inevitable, whether that’s production data drift or business goals that evolve. Because Snorkel Flow uses programmatic labelling, it’s extremely efficient to respond to these changes.

In the traditional paradigm, if the business comes to you with a change in objectives—say, they were classifying documents three ways but now need a 10-way schema, you’d effectively need to relabel your training data set (often thousands or hundreds of thousands of data points) from scratch. This would mean weeks or months of work before you could deliver on the new objective. 

In contrast, with Snorkel Flow, updating the schema is as simple as writing a few additional labelling functions to cover the new classes and applying weak supervision to combine all of your labelling functions and retrain your model. 

To identify data drift in production, you can rely on your monitoring system or use Snorkel Flow’s production APIs to bring live data back into the platform and see how your model performs against real-world data.

As you spot performance degradation, you’re able to follow the same workflow: using error analysis to understand patterns, apply auto-suggested actions, and iterate in collaboration with your subject matter experts to refine and add labelling functions. 

AN: MIT was forced to pull its ‘80 Million Tiny Images’ dataset after it was found to contain racist and misogynistic labels due to its use of an “automated data collection procedure” based on WordNet. How is Snorkel ensuring that it avoids this labelling problem that is leading to harmful biases in AI systems?

DS: Bias can start anywhere in the system – pre-processing, post-processing, with task design, with modelling choices, etc. And in particular issues with labelled training data.

To understand underlying bias, it is important to understand the rationale used by labellers. This is impractical when every datapoint is hand labelled and the logic behind labelling it one way or another is not captured. Moreover, information about label author and dataset versioning is rarely available. Often labelling is outsourced or in-house labellers have moved on to other projects or organizations. 

Snorkel AI’s programmatic labelling approach helps discover, manage, and mitigate bias. Instead of discarding the rationale behind each manually labelled datapoint, Snorkel Flow, our data-centric AI platform, captures the labellers’ (subject matter experts, data scientists, and others) knowledge as a labelling function and generates probabilistic labels using theoretical grounded algorithms encoded in a novel label model.

With Snorkel Flow, users can understand exactly why a certain datapoint was labelled the way it is. This process, along with label function and label dataset versioning, allows users to audit, interpret, and even explain model behaviours. This shift from manual to programmatic labelling is key to managing bias.

AN: A group led by Snorkel researcher Stephen Bach recently had their paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG) published. I’d direct readers to the paper for the full details, but can you give us a brief overview of what it is and how it improves over existing WordNet-based methods?

DS: ZSL-KG improves graph-based zero-shot learning in two ways: richer models and richer data. On the modelling side, ZSL-KG is based on a new type of graph neural network called a transformer graph convolutional network (TrGCN).

Many graph neural networks learn to represent nodes in a graph through linear combinations of neighbouring representations, which is limiting. TrGCN uses small transformers at each node to combine neighbourhood representations in more complex ways.

On the data side, ZSL-KG uses common sense knowledge graphs, which use natural language and graph structures to make explicit many types of relationships among concepts. They are much richer than the typical ImageNet subtype hierarchy.

AN: Gartner designated Snorkel a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you think makes you stand out from the competition?

DS: Data labelling is one of the biggest challenges for enterprise AI. Most organisations realise that current approaches are unscalable and often ridden with quality, explainability, and adaptability issues. Snorkel AI not only provides a solution for automating data labelling but also uniquely offers an AI development platform to adopt a data-centric approach and leverage knowledge resources including subject matter experts and existing systems.

In addition to the technology, Snorkel AI brings together 7+ years of R&D (which began at the Stanford AI Lab) and a highly-talented team of machine learning engineers, success managers, and researchers to successfully assist and advise customer development as well as bring new innovations to market.

Snorkel Flow unifies all the necessary components of a programmatic, data-centric AI development workflow—training data creation/management, model iteration, error analysis tooling, and data/application export or deployment—while also being completely interoperable at each stage via a Python SDK and a range of other connectors.

This unified platform also provides an intuitive interface and streamlined workflow for critical collaboration between SME annotators, data scientists, and other roles, to accelerate AI development. It allows data science and ML teams to iterate on both data and models within a single platform and use insights from one to guide the development of the other, leading to rapid development cycles.

You Might Also Like

AOL Outage Sparks Widespread Yahoo Mail and Finance Disruptions

Space Force Emerges as a Flashpoint for Orbital Warfare and Global Security

Ryan Hurst Officially Cast as Kratos in Amazon’s Live-Action God of War Series

Nintendo Switch Joy-Con Searches Explode After Switch 2 Color Reveal

Starlink Satellites Trigger Industry Pushback as SpaceX Expands Cellular Ambitions

Share This Article
Facebook LinkedIn Email Copy Link Print
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Benten Technologies: A secure, passwordless future
Next Article Irish Distillers adds digital label solution to support responsible choices 

Latest News

  • The TikTok deal is done, finally

    Just over a year after it briefly disappeared from app stores, TikTok in the US is now part of a new entity, TikTok USDS Joint Venture LLC. With approval from both the US and China closing on the schedule laid out in December, ByteDance's ownership of the new joint venture is now only 19.9 percent

  • Ring can verify videos now, but that might not help you with most AI fakes

    Ring has launched a new Ring Verify tool that the company says can "verify that Ring videos you receive haven't been edited or changed." But since Ring won't verify videos that have been altered in any way, it probably won't be able to verify those videos you see on TikTok that look like they're from

  • Beyond Good and Evil 2 somehow survived the purge at Ubisoft

    In what world does Prince of Persia: The Sands of Time Remake get canceled amid corporate restructuring and yet BG&E2 does not? "Beyond Good & Evil 2 remains a priority for us in the context of our strategy centered around Open World Adventures," an unnamed Ubisoft spokesperson told Kotaku and Insider Gaming. The original Beyond

  • Samsung’s smallest Frame TVs have fallen to their lowest prices to date

    Unless you’re dead set on picking up a larger panel, Samsung’s forthcoming 2026 Frame lineup represents a fairly minor upgrade. That makes last year’s model an easy rec for those on the hunt for an art-inspired 4K TV, especially given the 43- and 50-inch models are down to the lowest prices we’ve seen at Woot,

  • Nintendo is following up Alarmo with a weird Talking Flower in March

    After sharing a brief look at its new Talking Flower during a Nintendo Direct showcase last September that generated more questions than answers, Nintendo has finally revealed more details about what the interactive toy actually does and how much it costs. It shares some functionality with Nintendo's Alarmo alarm clock, but the Talking Flower seems

- Advertisement -
about us

We influence 20 million users and is the number one business and technology news network on the planet.

Advertise

  • Advertise With Us
  • Newsletters
  • Partnerships
  • Brand Collaborations
  • Press Enquiries

Top Categories

  • Artificial Intelligence
  • Technology
  • Bussiness
  • Politics
  • Marketing
  • Science
  • Sports
  • White Paper

Legal

  • About Us
  • Contact Us
  • Privacy Policy
  • Affiliate Disclaimer
  • Legal

Find Us on Socials

The Tech MarketerThe Tech Marketer
© The Tech Marketer. All Rights Reserved.
Join Us!

Subscribe to our newsletter and never miss our latest news, podcasts etc..

Zero spam, Unsubscribe at any time.
Welcome Back!

Sign in to your account

Lost your password?