Name	Name	Last commit message	Last commit date
parent directory ..
configs	configs
resources	resources
tboard_logs	tboard_logs
Assignment7.html	Assignment7.html
Assignment7.ipynb	Assignment7.ipynb
README.md	README.md
Session7.ipynb	Session7.ipynb
dataloader.py	dataloader.py
models.py	models.py
trainer.py	trainer.py
transformations.py	transformations.py
utils.py	utils.py

Assignment 7: Vision Transformers for Action Recognition

📋 Overview

This assignment focuses on implementing and training Vision Transformers (ViT) for action recognition on the KTH-Actions dataset. The project explores transformer-based architectures for video classification, comparing different patch sizes and evaluating performance against RNN models from Assignment 3.

Sample output from the Vision Transformer on KTH-Actions (visualization)

🎯 Objectives

Implement a Vision Transformer (ViT) for action recognition on video sequences
Process video frames by breaking them into patches and jointly processing with transformers
Compare different patch sizes and their impact on model performance
Evaluate ViT performance against RNN models from Assignment 3
Explore Video Vision Transformer (ViViT) with Space-Time attention (extra credit)

📊 Dataset

KTH-Actions Dataset - 6-class action recognition dataset

Task: Action recognition from video sequences
Classes: walking, jogging, running, boxing, handwaving, handclapping
Image size: 64×64×1 (grayscale)
Frame processing:
- Maximum 80 frames per sequence
- Temporal slicing with step size of 8 (resulting in 10 frames per training sample)
- Random temporal sampling to handle dataset disparities (empty frames)
Split:
- Training: Person IDs 0-16
- Testing: Person IDs 17-25

The dataset is located at /home/nfs/inf6/data/datasets/kth_actions/processed/

🏗️ Models Implemented

Vision Transformer (ViT)

A transformer-based architecture for video action recognition:

Architecture Components:

Patchifier: Breaks each frame into patches (configurable patch size)
Patch Projection: Projects patches to token dimension with LayerNorm
CLS Token: Learnable classification token for each frame
Positional Encoding: Adds positional information to tokens
Transformer Blocks: Stack of transformer encoder blocks with:
- Multi-Head Self-Attention
- MLP (Feed-Forward Network)
- Residual connections and LayerNorm
Classifier: Linear layer for final classification

Key Features:

Processes video sequences frame-by-frame
Each frame is divided into patches and processed independently
CLS tokens from all frames are averaged for final classification
Supports configurable patch sizes, token dimensions, and number of layers

Default Configuration:

Patch size: 16×16 (configurable: 8, 16, 32, 64)
Token dimension: 192
Attention dimension: 192
Number of heads: 4
MLP size: 768
Number of transformer layers: 6
Number of classes: 6

Multi-Head Self-Attention

Implements scaled dot-product attention with multiple heads:

Efficient head splitting and merging for batch processing
Supports 4D input tensors (batch, sequence, tokens, dimensions)
Attention maps can be extracted for visualization

Transformer Block

Standard transformer encoder block:

Multi-Head Self-Attention
Residual connections
Layer Normalization
MLP with GELU activation
Dropout for regularization

🔬 Experiments

The project includes experiments comparing different configurations:

Configuration	Patch Size	Epochs	Token Dim	MLP Size	Layers
ViT_patch_size_8	8×8	60	128	512	4
ViT_patch_size_16_epochs_60	16×16	60	192	768	6
ViT_patch_size_16_epochs_100	16×16	100	192	768	6
ViT_patch_size_32	32×32	60	-	-	-
ViT_patch_size_64	64×64	60	-	-	-

Training Configuration

Optimizer: Adam
Learning rate: 3e-4
Batch size: 32
Epochs: 60-100 (configurable)
Loss function: CrossEntropyLoss
Scheduler: StepLR (step_size=10, gamma=1/3, optional)
Validation: Evaluated on test set

🛠️ Key Features

Data Preprocessing

Temporal Processing:

Random temporal sampling: Selects 80 frames with random start index from each sequence
Temporal slicing: Samples every 8th frame (resulting in 10 frames per sample)
Handles dataset disparities by avoiding empty frames

Spatial Augmentations (Training):

Random horizontal flip (p=0.5)
Random rotation (±25 degrees)

Temporal Augmentations (Training):

Random temporal sampling (slicing step=8)
Random temporal reversal (p=0.3)

Test Transforms:

Temporal sampling only (no augmentation)

Training Infrastructure

TensorBoard logging: Training/validation loss, accuracy, and learning rate curves
Model checkpointing: Saves best models with training configurations
Progress tracking: Real-time training progress with tqdm
Evaluation metrics: Accuracy, loss tracking
Configuration management: YAML-based config files for experiment tracking
Reproducibility: Fixed random seeds for consistent results

DataLoader

KTHActionDataset:

Handles video sequence loading
Supports train/test splits based on person IDs
Random frame selection within sequences
Padding for sequences shorter than max_frames
Grayscale image conversion and resizing

📁 Project Structure

Assignment7/
├── Assignment7.ipynb          # Main assignment notebook
├── Session7.ipynb             # Lab session materials
├── models.py                  # ViT and transformer components
├── trainer.py                 # Training script
├── utils.py                   # Utility functions (training, evaluation, visualization)
├── dataloader.py              # KTHActionDataset implementation
├── transformations.py         # Data augmentation transforms
├── configs/                   # Experiment configuration files
│   ├── ViT_patch_size_16_epochs_100.yaml
│   └── ViT_patch_size_16_epochs_60.yaml
├── tboard_logs/               # TensorBoard logs
│   ├── ViT_patch_size_8_epochs_60/
│   ├── ViT_patch_size_16_epochs_60/
│   ├── ViT_patch_size_16_epochs_100/
│   ├── ViT_patch_size_32_epochs_60/
│   └── ViT_patch_size_64_epochs_60/
└── resources/                 # Reference images and documentation
    ├── vit_img.png
    ├── seminar.png
    └── ...

📈 Analysis & Results

Model Comparison

The notebook includes comprehensive analysis:

Learning curves: Training vs validation loss over epochs
Accuracy metrics: Overall classification accuracy
Patch size comparison: Performance across different patch sizes
Comparison with RNN models: Evaluation against Assignment 3 results

Key Findings

Patch Size Impact: Smaller patch sizes (8×8) provide more tokens per frame but increase computational cost
Temporal Processing: Averaging CLS tokens across frames effectively captures temporal information
Data Handling: Random temporal sampling helps avoid empty frames and improves training stability
Transformer Architecture: ViT shows competitive performance for action recognition tasks
Attention Mechanisms: Multi-head attention allows the model to focus on different spatial regions

🚀 Usage

Setup

Install dependencies:

pip install torch torchvision numpy matplotlib seaborn tqdm pyyaml tensorboard pytorch-lightning

Ensure dataset is available:
- Dataset should be located at /home/nfs/inf6/data/datasets/kth_actions/processed/
- Or modify root_dir in the training script
Open the notebook:
```
jupyter notebook Assignment7.ipynb
```

Running Experiments

Using the Notebook:
- Execute cells sequentially to:
  - Load and inspect the dataset
  - Define and initialize the ViT model
  - Train with different configurations
  - Evaluate models and visualize results
Using the Training Script:
```
python trainer.py
```
- Modify configs dictionary in trainer.py to change hyperparameters
- Configurations are automatically saved to YAML files

Custom Configuration:

configs = {   
    "model_name": "ViT",
    "batch_size": 32,
    "num_epochs": 100,
    "lr": 3e-4,
    "patch_size": 16,
    "token_dim": 192,
    "attn_dim": 192,
    "num_heads": 4,
    "mlp_size": 768,
    "num_tf_layers": 6,
    "num_classes": 6,
    "max_frames": 80,
    "slicing_step": 8
}

Viewing TensorBoard Logs

tensorboard --logdir=tboard_logs

Then open http://localhost:6006 in your browser to view training curves.

Loading Saved Models

from utils import load_model
checkpoint = torch.load('checkpoints/checkpoint_ViT_patch_size_16_epochs_100.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
stats = checkpoint['stats']

🔧 Utility Functions

The utils.py file provides:

train_model(): Complete training loop with TensorBoard logging
train_epoch(): Training for one epoch
eval_model(): Model evaluation on validation/test set
save_model() / load_model(): Model checkpointing
count_model_params(): Count learnable parameters
smooth(): Loss curve smoothing for visualization
set_random_seed(): Reproducibility utilities

🎓 Extra Credit: Video Vision Transformer (ViViT)

The assignment mentions implementing ViViT with Space-Time attention as an extra credit task. ViViT extends ViT to explicitly model temporal relationships in video sequences using space-time attention mechanisms.

Key Differences from Standard ViT:

Space-Time Attention: Jointly attends to spatial and temporal dimensions
Temporal Modeling: Explicitly models relationships between frames
3D Patches: Can process spatiotemporal patches instead of frame-by-frame

Reference: ViViT: A Video Vision Transformer

🔗 References

💬 Support

If you found this project helpful, you can support my work by buying me a coffee or via paypal!

This assignment demonstrates transformer-based architectures for video action recognition, exploring how Vision Transformers can be adapted for temporal sequence modeling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Assignment 7: Vision Transformers for Action Recognition

📋 Overview

🎯 Objectives

📊 Dataset

🏗️ Models Implemented

Vision Transformer (ViT)

Multi-Head Self-Attention

Transformer Block

🔬 Experiments

Training Configuration

🛠️ Key Features

Data Preprocessing

Training Infrastructure

DataLoader

📁 Project Structure

📈 Analysis & Results

Model Comparison

Key Findings

🚀 Usage

Setup

Running Experiments

Viewing TensorBoard Logs

Loading Saved Models

🔧 Utility Functions

🎓 Extra Credit: Video Vision Transformer (ViViT)

🔗 References

💬 Support

FilesExpand file tree

Assignment7

Directory actions

More options

Directory actions

More options

Latest commit

History

Assignment7

Folders and files

parent directory

README.md

Assignment 7: Vision Transformers for Action Recognition

📋 Overview

🎯 Objectives

📊 Dataset

🏗️ Models Implemented

Vision Transformer (ViT)

Multi-Head Self-Attention

Transformer Block

🔬 Experiments

Training Configuration

🛠️ Key Features

Data Preprocessing

Training Infrastructure

DataLoader

📁 Project Structure

📈 Analysis & Results

Model Comparison

Key Findings

🚀 Usage

Setup

Running Experiments

Viewing TensorBoard Logs

Loading Saved Models

🔧 Utility Functions

🎓 Extra Credit: Video Vision Transformer (ViViT)

🔗 References

💬 Support