This article is meant as an introduction into LLMs, diffusion models, their underlying architectures and their implications for video classification as a whole.

First, let me make my stance on this project feasibility and solution explicitly clear from the start, so there is no confusion:

On the topic of video classification for our project, I believe that using LLM/Diffusion-based systems to do zero-shot/few-shot video classification for our desired tags is unfeasible.

That said, I belive it is completely solvable using a supervised, fine-tuned custom model, which might be LLM/Diffusion-based.

Introduction - glossary and “what it mean”?

In this section, I’ll quickly review all the typical AI lingo that we’ll might stumble upon both in this document, but also in general. A complete overview of all glossary can be found in this page:

Glossary

Supervised, unsupervised, few-shot, zero-shot, n-shot

These terms are often thrown around to distinguish between model types, but they actually refer to different types of different concepts!

Supervised, unsupervised and self-supervised are different methods of training an AI model, while zero-shot, few-shot and many-shot (or n-shot), can either refer to different inference methods or refer to different capabilities the model

To make things more confusing, many models today is trained both supervised and self-supervised and some models can do many different shot types.

Training types

Supervised, self-supervised and unsupervised learning/training distinguishes if the dataset used is labelled, if a labelled is created automatically or if the data contains no labels.