This article is meant as an introduction into LLMs, diffusion models, their underlying architectures and their implications for video classification as a whole.

First, let me make my stance on this project feasibility and solution explicitly clear from the start, so there is no confusion:

On the topic of video classification for our project, I believe that using LLM/Diffusion-based systems to do zero-shot/few-shot video classification for our desired tags is unfeasible.

That said, I belive it is completely solvable using a supervised, fine-tuned custom model, which might be LLM/Diffusion-based.

Introduction - glossary and “what it mean”?

In this section, I’ll quickly review all the typical AI lingo that we’ll might stumble upon both in this document, but also in general. A complete overview of all glossary can be found in this page:

Glossary

Supervised, unsupervised, few-shot, zero-shot, n-shot

These terms are often thrown around to distinguish between model types, but they actually refer to different types of different concepts!

Supervised, unsupervised and self-supervised are different methods of training an AI model, while zero-shot, few-shot and many-shot (or n-shot), can either refer to different inference methods or refer to different capabilities the model

What is inference?

To make things more confusing, many models today is trained both supervised and self-supervised and some models can do many different shot types.

Training types

Supervised, self-supervised and unsupervised learning/training distinguishes if the dataset used is labelled, if a labelled is created automatically or if the data contains no labels.

Supervised training means that the model is given a human labelled dataset, where each input of the dataset is labelled with the desired output. During training, the model is given the input, which it uses to make a prediction or generate something, then the models output is compared to the label. How “bad” the model did, is used to correct the model so it can learn from its mistakes.
- For example, you have a dataset containing images of dogs and cats, each image has an associated label, specifying if the image contains either a dog or a cat. When you train the model, its given an image and it then guesses weather its a dog or cat. If guesses incorrectly, it uses that error to update itself and learn what it did wrong
Unsupervised training means that the model is just given a dataset, there is no label to the dataset. This is usually very limiting and only allows the model to find differences in the data.
- For example, outlier detection and clustering is a typical example of unsupervised learning, where the model tries to cluster similar datapoints “close to each other” by assigning a vector to the dataset, which can be used to find outliers and relationships between datapoints.
- Unsupervised and self-supervised learning are very often confused because both do not require humans to label data, but they are different approaches.
Self-supervised training means that the model creates it’s own labels from an unlabelled dataset. This is generally not so useful in and on itself, but is really powerful when combined with supervised learning.
- For example, the GPT family of models, are first pretrained on text in an unsupervised manner, where a chunk of text is given to the model as input, with one or more words missing and the model is asked to guess what the missing words are. As with supervised learning, how “bad” the model did is used to correct the model, so it can learn.
  - This is what the name GPT refers to: Generative Pre-trained Transformer, where the pre-trained part, refers to this unsupervised learning
  <aside> 🤓
  
  A concrete example on pretraining for LLMs
  1. During pretaining of an LLM like GPT-4, a piece of text is given as input:
```
The quick brown fox jumps over the lazy dog
```
  2. This text is then masked, meaning we remove one or more words from it:
```
The quick brown __ jumps over the lazy dog
```
  3. We now have a label for our dataset, we can give the masked text as an input and ask the model to guess what the missing word is, and the label is the missing word fox :
```
Input: The quick brown __ jumps over the lazy dog
Label: fox
```
  Generally, this only gives you a model that can guess missing words, however this lets AI models learn the pattern and structure of our language.
  
  </aside>

Introduction - glossary and “what it mean”?

Supervised, unsupervised, few-shot, zero-shot, n-shot

Training types

A concrete example on pretraining for LLMs