By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
The Tech MarketerThe Tech MarketerThe Tech Marketer
  • Home
  • Technology
  • Entertainment
    • Memes
    • Quiz
  • Marketing
  • Politics
  • Visionary Vault
    • Whitepaper
Reading: Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data
Share
Notification Show More
Font ResizerAa
The Tech MarketerThe Tech Marketer
Font ResizerAa
  • Home
  • Technology
  • Entertainment
  • Marketing
  • Politics
  • Visionary Vault
  • Home
  • Technology
  • Entertainment
    • Memes
    • Quiz
  • Marketing
  • Politics
  • Visionary Vault
    • Whitepaper
Have an existing account? Sign In
Follow US
© The Tech Marketer. All Rights Reserved.
The Tech Marketer > Blog > Technology > Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data
Technology

Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data

Last updated:
3 years ago
Share
SHARE

Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious.

Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset was found to contain thousands of racist and misogynistic labels that could have been used to train AI models.

AI News caught up with Devang Sachdev, VP of Marketing at Snorkel AI, to find out how the company is easing the laborious process of labelling data in a safe and effective way.

AI News: How is Snorkel helping to ease the laborious process of labelling data?

Devang Sachdev: Snorkel Flow changes the paradigm of training data labelling from the traditional manual process—which is slow, expensive, and unadaptable—to a programmatic process that we’ve proven accelerates training data creation 10x-100x.

Users are able to capture their knowledge and existing resources (both internal, e.g., ontologies and external, e.g., foundation models) as labelling functions, which are applied to training data at scale. 

Unlike a rules-based approach, these labelling functions can be imprecise, lack coverage, and conflict with each other. Snorkel Flow uses theoretically grounded weak supervision techniques to intelligently combine the labelling functions to auto-label your training data set en-masse using an optimal Snorkel Flow label model. 

Using this initial training data set, users train a larger machine learning model of their choice (with the click of a button from our ‘Model Zoo’) in order to:

  1. Generalise beyond the output of the label model.
  2. Generate model-guided error analysis to know exactly where the model is confused and how to iterate. This includes auto-generated suggestions, as well as analysis tools to explore and tag data to identify what labelling functions to edit or add. 

This rapid, iterative, and adaptable process becomes much more like software development rather than a tedious, manual process that cannot scale. And much like software development, it allows users to inspect and adapt the code that produced training data labels.

AN: Are there dangers to implementing too much automation in the labelling process?

DS: The labelling process can inherently introduce dangers simply for the fact that as humans, we’re fallible. Human labellers can be fatigued, make mistakes, or have a conscious or unconscious bias which they encode into the model via their manual labels.

When mistakes or biases occur—and they will—the danger is the model or downstream application essentially amplifies the isolated label. These amplifications can lead to consequential impacts at scale. For example, inequities in lending, discrimination in hiring, missed diagnoses for patients, and more. Automation can help.

In addition to these dangers—which have major downstream consequences—there are also more practical risks of attempting to automate too much or taking the human out of the loop of training data development.

Training data is how humans encode their expertise to machine learning models. While there are some cases where specialised expertise isn’t required to label data, in most enterprise settings, there is. For this training data to be effective, it needs to capture the fullness of subject matter experts’ knowledge and the diverse resources they rely on to make a decision on any given datapoint.

However, as we have all experienced, having highly in-demand experts label data manually one-by-one simply isn’t scalable. It also leaves an enormous amount of value on the table by losing the knowledge behind each manual label. We must take a programmatic approach to data labelling and engage in data-centric, rather than model-centric, AI development workflows. 

Here’s what this entails: 

  • Elevating how domain experts label training data from tediously labelling one-by-one to encoding their expertise—the rationale behind what would be their labelling decisions—in a way that can be applied at scale. 
  • Using weak supervision to intelligently auto-label at scale—this is not auto-magic, of course; it’s an inherently transparent, theoretically grounded approach. Every training data label that’s applied in this step can be inspected to understand why it was labelled as it was. 
  • Bringing experts into the core AI development loop to assist with iteration and troubleshooting. Using streamlined workflows within the Snorkel Flow platform, data scientists—as subject matter experts—are able to collaborate to identify the root cause of error modes and how to correct them by making simple labelling function updates, additions, or, at times, correcting ground truth or “gold standard” labels that error analysis reveals to be wrong.

AN: How easy is it to identify and update labels based on real-world changes?

DS: A fundamental value of Snorkel Flow’s data-centric approach to AI development is adaptability. We all know that real-world changes are inevitable, whether that’s production data drift or business goals that evolve. Because Snorkel Flow uses programmatic labelling, it’s extremely efficient to respond to these changes.

In the traditional paradigm, if the business comes to you with a change in objectives—say, they were classifying documents three ways but now need a 10-way schema, you’d effectively need to relabel your training data set (often thousands or hundreds of thousands of data points) from scratch. This would mean weeks or months of work before you could deliver on the new objective. 

In contrast, with Snorkel Flow, updating the schema is as simple as writing a few additional labelling functions to cover the new classes and applying weak supervision to combine all of your labelling functions and retrain your model. 

To identify data drift in production, you can rely on your monitoring system or use Snorkel Flow’s production APIs to bring live data back into the platform and see how your model performs against real-world data.

As you spot performance degradation, you’re able to follow the same workflow: using error analysis to understand patterns, apply auto-suggested actions, and iterate in collaboration with your subject matter experts to refine and add labelling functions. 

AN: MIT was forced to pull its ‘80 Million Tiny Images’ dataset after it was found to contain racist and misogynistic labels due to its use of an “automated data collection procedure” based on WordNet. How is Snorkel ensuring that it avoids this labelling problem that is leading to harmful biases in AI systems?

DS: Bias can start anywhere in the system – pre-processing, post-processing, with task design, with modelling choices, etc. And in particular issues with labelled training data.

To understand underlying bias, it is important to understand the rationale used by labellers. This is impractical when every datapoint is hand labelled and the logic behind labelling it one way or another is not captured. Moreover, information about label author and dataset versioning is rarely available. Often labelling is outsourced or in-house labellers have moved on to other projects or organizations. 

Snorkel AI’s programmatic labelling approach helps discover, manage, and mitigate bias. Instead of discarding the rationale behind each manually labelled datapoint, Snorkel Flow, our data-centric AI platform, captures the labellers’ (subject matter experts, data scientists, and others) knowledge as a labelling function and generates probabilistic labels using theoretical grounded algorithms encoded in a novel label model.

With Snorkel Flow, users can understand exactly why a certain datapoint was labelled the way it is. This process, along with label function and label dataset versioning, allows users to audit, interpret, and even explain model behaviours. This shift from manual to programmatic labelling is key to managing bias.

AN: A group led by Snorkel researcher Stephen Bach recently had their paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG) published. I’d direct readers to the paper for the full details, but can you give us a brief overview of what it is and how it improves over existing WordNet-based methods?

DS: ZSL-KG improves graph-based zero-shot learning in two ways: richer models and richer data. On the modelling side, ZSL-KG is based on a new type of graph neural network called a transformer graph convolutional network (TrGCN).

Many graph neural networks learn to represent nodes in a graph through linear combinations of neighbouring representations, which is limiting. TrGCN uses small transformers at each node to combine neighbourhood representations in more complex ways.

On the data side, ZSL-KG uses common sense knowledge graphs, which use natural language and graph structures to make explicit many types of relationships among concepts. They are much richer than the typical ImageNet subtype hierarchy.

AN: Gartner designated Snorkel a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you think makes you stand out from the competition?

DS: Data labelling is one of the biggest challenges for enterprise AI. Most organisations realise that current approaches are unscalable and often ridden with quality, explainability, and adaptability issues. Snorkel AI not only provides a solution for automating data labelling but also uniquely offers an AI development platform to adopt a data-centric approach and leverage knowledge resources including subject matter experts and existing systems.

In addition to the technology, Snorkel AI brings together 7+ years of R&D (which began at the Stanford AI Lab) and a highly-talented team of machine learning engineers, success managers, and researchers to successfully assist and advise customer development as well as bring new innovations to market.

Snorkel Flow unifies all the necessary components of a programmatic, data-centric AI development workflow—training data creation/management, model iteration, error analysis tooling, and data/application export or deployment—while also being completely interoperable at each stage via a Python SDK and a range of other connectors.

This unified platform also provides an intuitive interface and streamlined workflow for critical collaboration between SME annotators, data scientists, and other roles, to accelerate AI development. It allows data science and ML teams to iterate on both data and models within a single platform and use insights from one to guide the development of the other, leading to rapid development cycles.

You Might Also Like

Nintendo Switch Joy-Con Searches Explode After Switch 2 Color Reveal

Starlink Satellites Trigger Industry Pushback as SpaceX Expands Cellular Ambitions

ASUS Unveils ROG Flow Z13 Kojima Edition Inspired by Hideo Kojima

Amazon Fire TV Stick 4K Max Search Surges as Amazon Unveils Major Fire TV Overhaul

DLSS 4.5 Announced: Nvidia’s Biggest AI Graphics Leap Yet

Share This Article
Facebook LinkedIn Email Copy Link Print
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Benten Technologies: A secure, passwordless future
Next Article Irish Distillers adds digital label solution to support responsible choices 

Latest News

  • Betterment’s financial app sends customers a $10,000 crypto scam message

    Betterment, a financial app, sent a sketchy-looking notification on Friday asking users to send $10,000 to Bitcoin and Ethereum crypto wallets and promising to "triple your crypto," according to a thread on Reddit. The Betterment account says in an X thread that this was an "unauthorized message" that was sent via a "third-party system." Here's

  • Amazon is planning a Super Amazon-mart store near Chicago

    Amazon could be taking another, even bigger swing at physical stores, this time with a Walmart-like supercenter. On Tuesday, the Orland Park Plan Commission in the Chicago suburb Orland Park, Illinois voted 6-1 to approve Amazon's proposal to develop 35 acres of land for a 229,000-square foot retail center, as reported by The Information. The

  • X accuses music publishers of ‘weaponizing’ DMCA takedowns

    X is suing music publishers and their trade group, the National Music Publishers' Association (NMPA), accusing them of attempted coercion in their ongoing battle over licensing, as reported earlier by The Hollywood Reporter. The Elon Musk-owned platform accuses music publishers of colluding with the NMPA to "coerce X into taking licenses to musical works from

  • This company could help bring Auracast to an iPhone near you

    One of the issues holding Auracast back from wider mainstream use is some companies' lack of support for the Bluetooth technology - Apple being a prime example. With iOS having 58 percent of the market share in North America and nearly 28 percent worldwide, a decision by Apple to enable native Auracast support would potentially

  • AI is coming for collectibles next

    AI toys, companions, and robots have been everywhere at CES this year, but among the horde of waddling plushies and light-up emoji eyes, two stood out to me. HeyMates and Buddyo are each betting that the collectible figurine boom is going to come back with an AI-powered vengeance, letting us chat to sports stars and

- Advertisement -
about us

We influence 20 million users and is the number one business and technology news network on the planet.

Advertise

  • Advertise With Us
  • Newsletters
  • Partnerships
  • Brand Collaborations
  • Press Enquiries

Top Categories

  • Artificial Intelligence
  • Technology
  • Bussiness
  • Politics
  • Marketing
  • Science
  • Sports
  • White Paper

Legal

  • About Us
  • Contact Us
  • Privacy Policy
  • Affiliate Disclaimer
  • Legal

Find Us on Socials

The Tech MarketerThe Tech Marketer
© The Tech Marketer. All Rights Reserved.
Join Us!

Subscribe to our newsletter and never miss our latest news, podcasts etc..

Zero spam, Unsubscribe at any time.
Welcome Back!

Sign in to your account

Lost your password?