Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large datasets. However, manually labelling massive amounts of data is time-consuming and laborious.
Using pre-labelled datasets can be problematic, as evidenced by MIT having to pull its 80 Million Tiny Images datasets. For those unaware, the popular dataset was found to contain thousands of racist and misogynistic labels that could have been used to train AI models.
AI News caught up with Devang Sachdev, VP of Marketing at Snorkel AI, to find out how the company is easing the laborious process of labelling data in a safe and effective way.
AI News: How is Snorkel helping to ease the laborious process of labelling data?
Devang Sachdev: Snorkel Flow changes the paradigm of training data labelling from the traditional manual process—which is slow, expensive, and unadaptable—to a programmatic process that we’ve proven accelerates training data creation 10x-100x.
Users are able to capture their knowledge and existing resources (both internal, e.g., ontologies and external, e.g., foundation models) as labelling functions, which are applied to training data at scale.
Unlike a rules-based approach, these labelling functions can be imprecise, lack coverage, and conflict with each other. Snorkel Flow uses theoretically grounded weak supervision techniques to intelligently combine the labelling functions to auto-label your training data set en-masse using an optimal Snorkel Flow label model.
Using this initial training data set, users train a larger machine learning model of their choice (with the click of a button from our ‘Model Zoo’) in order to:
- Generalise beyond the output of the label model.
- Generate model-guided error analysis to know exactly where the model is confused and how to iterate. This includes auto-generated suggestions, as well as analysis tools to explore and tag data to identify what labelling functions to edit or add.
This rapid, iterative, and adaptable process becomes much more like software development rather than a tedious, manual process that cannot scale. And much like software development, it allows users to inspect and adapt the code that produced training data labels.
AN: Are there dangers to implementing too much automation in the labelling process?
DS: The labelling process can inherently introduce dangers simply for the fact that as humans, we’re fallible. Human labellers can be fatigued, make mistakes, or have a conscious or unconscious bias which they encode into the model via their manual labels.
When mistakes or biases occur—and they will—the danger is the model or downstream application essentially amplifies the isolated label. These amplifications can lead to consequential impacts at scale. For example, inequities in lending, discrimination in hiring, missed diagnoses for patients, and more. Automation can help.
In addition to these dangers—which have major downstream consequences—there are also more practical risks of attempting to automate too much or taking the human out of the loop of training data development.
Training data is how humans encode their expertise to machine learning models. While there are some cases where specialised expertise isn’t required to label data, in most enterprise settings, there is. For this training data to be effective, it needs to capture the fullness of subject matter experts’ knowledge and the diverse resources they rely on to make a decision on any given datapoint.
However, as we have all experienced, having highly in-demand experts label data manually one-by-one simply isn’t scalable. It also leaves an enormous amount of value on the table by losing the knowledge behind each manual label. We must take a programmatic approach to data labelling and engage in data-centric, rather than model-centric, AI development workflows.
Here’s what this entails:
- Elevating how domain experts label training data from tediously labelling one-by-one to encoding their expertise—the rationale behind what would be their labelling decisions—in a way that can be applied at scale.
- Using weak supervision to intelligently auto-label at scale—this is not auto-magic, of course; it’s an inherently transparent, theoretically grounded approach. Every training data label that’s applied in this step can be inspected to understand why it was labelled as it was.
- Bringing experts into the core AI development loop to assist with iteration and troubleshooting. Using streamlined workflows within the Snorkel Flow platform, data scientists—as subject matter experts—are able to collaborate to identify the root cause of error modes and how to correct them by making simple labelling function updates, additions, or, at times, correcting ground truth or “gold standard” labels that error analysis reveals to be wrong.
AN: How easy is it to identify and update labels based on real-world changes?
DS: A fundamental value of Snorkel Flow’s data-centric approach to AI development is adaptability. We all know that real-world changes are inevitable, whether that’s production data drift or business goals that evolve. Because Snorkel Flow uses programmatic labelling, it’s extremely efficient to respond to these changes.
In the traditional paradigm, if the business comes to you with a change in objectives—say, they were classifying documents three ways but now need a 10-way schema, you’d effectively need to relabel your training data set (often thousands or hundreds of thousands of data points) from scratch. This would mean weeks or months of work before you could deliver on the new objective.
In contrast, with Snorkel Flow, updating the schema is as simple as writing a few additional labelling functions to cover the new classes and applying weak supervision to combine all of your labelling functions and retrain your model.
To identify data drift in production, you can rely on your monitoring system or use Snorkel Flow’s production APIs to bring live data back into the platform and see how your model performs against real-world data.
As you spot performance degradation, you’re able to follow the same workflow: using error analysis to understand patterns, apply auto-suggested actions, and iterate in collaboration with your subject matter experts to refine and add labelling functions.
AN: MIT was forced to pull its ‘80 Million Tiny Images’ dataset after it was found to contain racist and misogynistic labels due to its use of an “automated data collection procedure” based on WordNet. How is Snorkel ensuring that it avoids this labelling problem that is leading to harmful biases in AI systems?
DS: Bias can start anywhere in the system – pre-processing, post-processing, with task design, with modelling choices, etc. And in particular issues with labelled training data.
To understand underlying bias, it is important to understand the rationale used by labellers. This is impractical when every datapoint is hand labelled and the logic behind labelling it one way or another is not captured. Moreover, information about label author and dataset versioning is rarely available. Often labelling is outsourced or in-house labellers have moved on to other projects or organizations.
Snorkel AI’s programmatic labelling approach helps discover, manage, and mitigate bias. Instead of discarding the rationale behind each manually labelled datapoint, Snorkel Flow, our data-centric AI platform, captures the labellers’ (subject matter experts, data scientists, and others) knowledge as a labelling function and generates probabilistic labels using theoretical grounded algorithms encoded in a novel label model.
With Snorkel Flow, users can understand exactly why a certain datapoint was labelled the way it is. This process, along with label function and label dataset versioning, allows users to audit, interpret, and even explain model behaviours. This shift from manual to programmatic labelling is key to managing bias.
AN: A group led by Snorkel researcher Stephen Bach recently had their paper on Zero-Shot Learning with Common Sense Knowledge Graphs (ZSL-KG) published. I’d direct readers to the paper for the full details, but can you give us a brief overview of what it is and how it improves over existing WordNet-based methods?
DS: ZSL-KG improves graph-based zero-shot learning in two ways: richer models and richer data. On the modelling side, ZSL-KG is based on a new type of graph neural network called a transformer graph convolutional network (TrGCN).
Many graph neural networks learn to represent nodes in a graph through linear combinations of neighbouring representations, which is limiting. TrGCN uses small transformers at each node to combine neighbourhood representations in more complex ways.
On the data side, ZSL-KG uses common sense knowledge graphs, which use natural language and graph structures to make explicit many types of relationships among concepts. They are much richer than the typical ImageNet subtype hierarchy.
AN: Gartner designated Snorkel a ‘Cool Vendor’ in its 2022 AI Core Technologies report. What do you think makes you stand out from the competition?
DS: Data labelling is one of the biggest challenges for enterprise AI. Most organisations realise that current approaches are unscalable and often ridden with quality, explainability, and adaptability issues. Snorkel AI not only provides a solution for automating data labelling but also uniquely offers an AI development platform to adopt a data-centric approach and leverage knowledge resources including subject matter experts and existing systems.
In addition to the technology, Snorkel AI brings together 7+ years of R&D (which began at the Stanford AI Lab) and a highly-talented team of machine learning engineers, success managers, and researchers to successfully assist and advise customer development as well as bring new innovations to market.
Snorkel Flow unifies all the necessary components of a programmatic, data-centric AI development workflow—training data creation/management, model iteration, error analysis tooling, and data/application export or deployment—while also being completely interoperable at each stage via a Python SDK and a range of other connectors.
This unified platform also provides an intuitive interface and streamlined workflow for critical collaboration between SME annotators, data scientists, and other roles, to accelerate AI development. It allows data science and ML teams to iterate on both data and models within a single platform and use insights from one to guide the development of the other, leading to rapid development cycles.