This article is meant as an introduction into LLMs, diffusion models, their underlying architectures and their implications for video classification as a whole.
First, let me make my stance on this project feasibility and solution explicitly clear from the start, so there is no confusion:
On the topic of video classification for our project, I believe that using LLM/Diffusion-based systems to do zero-shot/few-shot video classification for our desired tags is unfeasible.
That said, I belive it is completely solvable using a supervised, fine-tuned custom model, which might be LLM/Diffusion-based.
In this section, I’ll quickly review all the typical AI lingo that we’ll might stumble upon both in this document, but also in general. A complete overview of all glossary can be found in this page:
These terms are often thrown around to distinguish between model types, but they actually refer to different types of different concepts!
Supervised, unsupervised and self-supervised are different methods of training an AI model, while zero-shot, few-shot and many-shot (or n-shot), can either refer to different inference methods or refer to different capabilities the model
To make things more confusing, many models today is trained both supervised and self-supervised and some models can do many different shot types.
Supervised, self-supervised and unsupervised learning/training distinguishes if the dataset used is labelled, if a labelled is created automatically or if the data contains no labels.
For example, the GPT family of models, are first pretrained on text in an unsupervised manner, where a chunk of text is given to the model as input, with one or more words missing and the model is asked to guess what the missing words are. As with supervised learning, how “bad” the model did is used to correct the model, so it can learn.
<aside> 🤓
During pretaining of an LLM like GPT-4, a piece of text is given as input:
The quick brown fox jumps over the lazy dog
This text is then masked, meaning we remove one or more words from it:
The quick brown __ jumps over the lazy dog
We now have a label for our dataset, we can give the masked text as an input and ask the model to guess what the missing word is, and the label is the missing word fox :
Input: The quick brown __ jumps over the lazy dog
Label: fox
Generally, this only gives you a model that can guess missing words, however this lets AI models learn the pattern and structure of our language.
</aside>