Zero-Shot Learning: Learning Without Labels
Key Points
- Humans can recognize objects (e.g., a pen) by matching them to known attributes, enabling us to distinguish roughly 30,000 categorical concepts without seeing every instance.
- Traditional supervised deep‑learning models require large, labeled datasets for each category, making it costly and computationally intensive to achieve human‑level breadth across thousands of classes.
- N‑shot learning (few‑shot, one‑shot) mitigates this by leveraging transfer and meta‑learning to generalize from very few examples, but it still depends on at least one labeled instance per new class.
- Zero‑shot learning eliminates the need for any labeled examples by using semantic knowledge (e.g., textual descriptions or attribute lists) to infer new categories, mirroring how a child can identify a “bird” after reading a description rather than seeing a picture.
Sections
- From Supervised to Few‑Shot Learning - The passage contrasts human ability to recognize tens of thousands of objects with the data‑intensive demands of supervised deep learning, motivating N‑shot (few‑shot) approaches that use transfer and meta‑learning to generalize to many categories with minimal training.
- Attribute-Based Zero‑Shot Learning Explained - The speaker illustrates how zero‑shot learning uses descriptive attributes (e.g., color, shape, wings) to enable a model to recognize unseen classes like birds or pens by inferring their labels from learned feature concepts.
- Embedding and Generative Zero‑Shot Methods - The passage explains how joint embedding spaces align multimodal vectors for embedding‑based zero‑shot learning and how generative approaches—including large language models and GANs—create synthetic examples to recognize unseen classes.
Full Transcript
# Zero-Shot Learning: Learning Without Labels **Source:** [https://www.youtube.com/watch?v=pVpr4GYLzAo](https://www.youtube.com/watch?v=pVpr4GYLzAo) **Duration:** 00:08:54 ## Summary - Humans can recognize objects (e.g., a pen) by matching them to known attributes, enabling us to distinguish roughly 30,000 categorical concepts without seeing every instance. - Traditional supervised deep‑learning models require large, labeled datasets for each category, making it costly and computationally intensive to achieve human‑level breadth across thousands of classes. - N‑shot learning (few‑shot, one‑shot) mitigates this by leveraging transfer and meta‑learning to generalize from very few examples, but it still depends on at least one labeled instance per new class. - Zero‑shot learning eliminates the need for any labeled examples by using semantic knowledge (e.g., textual descriptions or attribute lists) to infer new categories, mirroring how a child can identify a “bird” after reading a description rather than seeing a picture. ## Sections - [00:00:00](https://www.youtube.com/watch?v=pVpr4GYLzAo&t=0s) **From Supervised to Few‑Shot Learning** - The passage contrasts human ability to recognize tens of thousands of objects with the data‑intensive demands of supervised deep learning, motivating N‑shot (few‑shot) approaches that use transfer and meta‑learning to generalize to many categories with minimal training. - [00:03:08](https://www.youtube.com/watch?v=pVpr4GYLzAo&t=188s) **Attribute-Based Zero‑Shot Learning Explained** - The speaker illustrates how zero‑shot learning uses descriptive attributes (e.g., color, shape, wings) to enable a model to recognize unseen classes like birds or pens by inferring their labels from learned feature concepts. - [00:06:14](https://www.youtube.com/watch?v=pVpr4GYLzAo&t=374s) **Embedding and Generative Zero‑Shot Methods** - The passage explains how joint embedding spaces align multimodal vectors for embedding‑based zero‑shot learning and how generative approaches—including large language models and GANs—create synthetic examples to recognize unseen classes. ## Full Transcript
If I asked you to identify this object in my hand, you'd likely say
it's a pen.
Even if you've never seen this specific pen, which is a special marker for light boards.
It shares enough attributes with other pens for you to recognize it.
In fact, you and most humans can recognize approximately
30,000 individually distinguishable object categories.
Now, to train a deep learning model to recognize objects,
we often turn to something called supervised learning.
A form of deep learning, and that requires many labeled examples.
Models learn by making predictions on a bunch of labels in a data set.
These labels provide the correct answers,
or the ground truth for each example.
The model adjusts its weights to minimize the difference between its predictions and the ground truth,and this process then needs a whole bunch
of labeled samples for many rounds of training.
So if we want AI models to remotely approach human capabilities
using supervised learning, they must be explicitly trained on labeled
data for something like 30,000 object categories.
That's a lot of time, cost and compute.
So the need for machine learning models to be able to generalize quickly
to a large number of semantic categories with minimal training overhead,
has given rise to something called N-shot learning.
That's a subset of machine learning that includes a number of categories.
So we have few shot learning, which uses transfer learning and meta learning methods to train models to recognize new classes.
Then we have one shot learning,
which just uses a single labeled example to learn.
But what if we don't want to use any labeled examples at all?
Well, that is the focus of this video.
And that is zero shot learning where instead of providing labeled examples,
the model is asked to make predictions on unseen classes post training.
Zero shot learning has become a notable area of research in data science,
particularly in the fields of computer vision and natural language processing.
So how does it work without explicit annotations to guide it?
Is there a shot learning requires a more fundamental understanding of the labels meaning because after all, that's how we humans do it.
So imagine a child wants to learn what a bird looks like.
In a process similar to a few shot learning, the child would learn by looking at images labeled bird in a book of animal pictures.
Moving forward, you'll recognize a bird because it resembles the bird images she's already seen.
But in a zero shot learning scenario, no such labeled examples are available.
So instead, the child might read a written story on birds and then they're small or medium sized
animals with feathers, beaks and wings that can fly through the air.
So it should then be able to recognize the bird in the real world,
even though she's never seen one before,
because she has learned the concept of a bird.
Just as even if you've never seen this pen before,
you can still classify it because of its cylindrical shape.
It has a tip and that it's leaving colored markings
when it comes in contact with the glass in front of me.
And yes, there is actually glass in front of me.
I'm not writing into thin air.
Now, what I've described is one example
of a method of zero shot learning, a way to implement zero shot learning.
And that method that I've introduced is called
attribute based.
So this is an attribute based zero shot learning method.
That's where we are training on labeled features
like color, shape, and other characteristics.
Even without seeing the target classes during the training,
the model infers labels based on similar attributes.
So for example, a model can learn about different types of animals.
So let's say it starts off by learning about stripes,
and it learns about stripes from looking at images of tigers and zebras.
Then it can learn about yellow
the color yellow from images, let's say of canaries.
And then it can learn about flying insects
from, let's say, just looking at,
well, pictures of flies.
Now the model can now perform zero shot classification of a new animal,
let's say bees, despite the absence of bee images in the training set,
because it can understand them as a combination of learned features
so striped plus yellow
plus flying insect that might equal
bee.
Now attribute based methods are quite versatile, but they do have some drawbacks.
They rely on the assumption that every class can be described
with a single vector of attributes, which isn't always the case.
So for example, like a Tesla Cybertruck and a Volkswagen Beetle,
they're both cars, but they differ greatly in shape, size, materials, and features.
Now, many in zero shot learning methods use an alternative method,
and that is known as embedding.
So an embedding based approach
to zero shot learning.
Now this works by representing both classes
and samples and vector embeddings that reflect their features and relationships.
Classification is determined by measuring similarity between these embeddings
using metrics like cosine similarity or Euclidean distance.
Similar to K-nearest neighbors algorithms.
And because embedding based methods typically process inputs from
multiple modalities like word embeddings that describe a class label
and image embeddings of a photograph that might belong to the same class,
they require a way to compare between embeddings of different data types,
and that's where joint embedding space can help normalize those vector embeddings.
Now, another method of
zero shot learning relates to generative
based methods of zero shot learning.
Now, if we think of the first example
of generative based, we have to think of LLMs.
That's large language models and large language models have a natural ability
to perform zero shot learning based on their ability to fundamentally
understand the meaning of the words used to name data.
Classes and limbs are pre-trained through self-supervised
learning on a massive corpus of text that may contain incidental references
to knowledge about unseen data classes, which the LLM can learn to make sense of.
And then, beyond just LLMs, another zero shot generative
based approach is my favorite type of neural network.
And that is a GAN.
Gan that's an acronym for Generative Adversarial Network.
And it actually consists of two competing neural networks
jointly trained in an adversarial zero sum game.
So there's a generator component
that uses semantic attributes and Gaussian noise to synthesize samples.
And then there is a discriminator network as well.
Lin determines whether samples are real or fake
fake meaning they were synthesized by the generator.
Now feedback from the discriminator is used to train the generator
until the discriminator can no longer distinguish
between the real and the synthetic samples.
That's perfect for generating synthetic data that mimics
the attributes of unseen classes, thereby enabling models
to learn from these synthesized examples as if they were labeled.
So that is zero shot learning.
It's something you do effortlessly every time you see a new object.
And it's something deep learning models can be taught to do as well.
Zero shot learning shows AI's potential to generalize for minimal information,
saving time compute and the hassle of labeling data.
If you like this video and
want to see more like it, please like and subscribe.
If you have any questions or want to share your thoughts about this topic,
please leave a comment below.