This assignment focuses on implementing and training Vision Transformers (ViT) for action recognition on the KTH-Actions dataset. The project explores transformer-based architectures for video classification, comparing different patch sizes and evaluating performance against RNN models from Assignment 3.
- Implement a Vision Transformer (ViT) for action recognition on video sequences
- Process video frames by breaking them into patches and jointly processing with transformers
- Compare different patch sizes and their impact on model performance
- Evaluate ViT performance against RNN models from Assignment 3
- Explore Video Vision Transformer (ViViT) with Space-Time attention (extra credit)
KTH-Actions Dataset - 6-class action recognition dataset
- Task: Action recognition from video sequences
- Classes: walking, jogging, running, boxing, handwaving, handclapping
- Image size: 64×64×1 (grayscale)
- Frame processing:
- Maximum 80 frames per sequence
- Temporal slicing with step size of 8 (resulting in 10 frames per training sample)
- Random temporal sampling to handle dataset disparities (empty frames)
- Split:
- Training: Person IDs 0-16
- Testing: Person IDs 17-25
The dataset is located at /home/nfs/inf6/data/datasets/kth_actions/processed/
A transformer-based architecture for video action recognition:
Architecture Components:
- Patchifier: Breaks each frame into patches (configurable patch size)
- Patch Projection: Projects patches to token dimension with LayerNorm
- CLS Token: Learnable classification token for each frame
- Positional Encoding: Adds positional information to tokens
- Transformer Blocks: Stack of transformer encoder blocks with:
- Multi-Head Self-Attention
- MLP (Feed-Forward Network)
- Residual connections and LayerNorm
- Classifier: Linear layer for final classification
Key Features:
- Processes video sequences frame-by-frame
- Each frame is divided into patches and processed independently
- CLS tokens from all frames are averaged for final classification
- Supports configurable patch sizes, token dimensions, and number of layers
Default Configuration:
- Patch size: 16×16 (configurable: 8, 16, 32, 64)
- Token dimension: 192
- Attention dimension: 192
- Number of heads: 4
- MLP size: 768
- Number of transformer layers: 6
- Number of classes: 6
Implements scaled dot-product attention with multiple heads:
- Efficient head splitting and merging for batch processing
- Supports 4D input tensors (batch, sequence, tokens, dimensions)
- Attention maps can be extracted for visualization
Standard transformer encoder block:
- Multi-Head Self-Attention
- Residual connections
- Layer Normalization
- MLP with GELU activation
- Dropout for regularization
The project includes experiments comparing different configurations:
| Configuration | Patch Size | Epochs | Token Dim | MLP Size | Layers |
|---|---|---|---|---|---|
| ViT_patch_size_8 | 8×8 | 60 | 128 | 512 | 4 |
| ViT_patch_size_16_epochs_60 | 16×16 | 60 | 192 | 768 | 6 |
| ViT_patch_size_16_epochs_100 | 16×16 | 100 | 192 | 768 | 6 |
| ViT_patch_size_32 | 32×32 | 60 | - | - | - |
| ViT_patch_size_64 | 64×64 | 60 | - | - | - |
- Optimizer: Adam
- Learning rate: 3e-4
- Batch size: 32
- Epochs: 60-100 (configurable)
- Loss function: CrossEntropyLoss
- Scheduler: StepLR (step_size=10, gamma=1/3, optional)
- Validation: Evaluated on test set
Temporal Processing:
- Random temporal sampling: Selects 80 frames with random start index from each sequence
- Temporal slicing: Samples every 8th frame (resulting in 10 frames per sample)
- Handles dataset disparities by avoiding empty frames
Spatial Augmentations (Training):
- Random horizontal flip (p=0.5)
- Random rotation (±25 degrees)
Temporal Augmentations (Training):
- Random temporal sampling (slicing step=8)
- Random temporal reversal (p=0.3)
Test Transforms:
- Temporal sampling only (no augmentation)
- TensorBoard logging: Training/validation loss, accuracy, and learning rate curves
- Model checkpointing: Saves best models with training configurations
- Progress tracking: Real-time training progress with tqdm
- Evaluation metrics: Accuracy, loss tracking
- Configuration management: YAML-based config files for experiment tracking
- Reproducibility: Fixed random seeds for consistent results
KTHActionDataset:
- Handles video sequence loading
- Supports train/test splits based on person IDs
- Random frame selection within sequences
- Padding for sequences shorter than max_frames
- Grayscale image conversion and resizing
Assignment7/
├── Assignment7.ipynb # Main assignment notebook
├── Session7.ipynb # Lab session materials
├── models.py # ViT and transformer components
├── trainer.py # Training script
├── utils.py # Utility functions (training, evaluation, visualization)
├── dataloader.py # KTHActionDataset implementation
├── transformations.py # Data augmentation transforms
├── configs/ # Experiment configuration files
│ ├── ViT_patch_size_16_epochs_100.yaml
│ └── ViT_patch_size_16_epochs_60.yaml
├── tboard_logs/ # TensorBoard logs
│ ├── ViT_patch_size_8_epochs_60/
│ ├── ViT_patch_size_16_epochs_60/
│ ├── ViT_patch_size_16_epochs_100/
│ ├── ViT_patch_size_32_epochs_60/
│ └── ViT_patch_size_64_epochs_60/
└── resources/ # Reference images and documentation
├── vit_img.png
├── seminar.png
└── ...
The notebook includes comprehensive analysis:
- Learning curves: Training vs validation loss over epochs
- Accuracy metrics: Overall classification accuracy
- Patch size comparison: Performance across different patch sizes
- Comparison with RNN models: Evaluation against Assignment 3 results
- Patch Size Impact: Smaller patch sizes (8×8) provide more tokens per frame but increase computational cost
- Temporal Processing: Averaging CLS tokens across frames effectively captures temporal information
- Data Handling: Random temporal sampling helps avoid empty frames and improves training stability
- Transformer Architecture: ViT shows competitive performance for action recognition tasks
- Attention Mechanisms: Multi-head attention allows the model to focus on different spatial regions
-
Install dependencies:
pip install torch torchvision numpy matplotlib seaborn tqdm pyyaml tensorboard pytorch-lightning
-
Ensure dataset is available:
- Dataset should be located at
/home/nfs/inf6/data/datasets/kth_actions/processed/ - Or modify
root_dirin the training script
- Dataset should be located at
-
Open the notebook:
jupyter notebook Assignment7.ipynb
-
Using the Notebook:
- Execute cells sequentially to:
- Load and inspect the dataset
- Define and initialize the ViT model
- Train with different configurations
- Evaluate models and visualize results
- Execute cells sequentially to:
-
Using the Training Script:
python trainer.py
- Modify configs dictionary in
trainer.pyto change hyperparameters - Configurations are automatically saved to YAML files
- Modify configs dictionary in
-
Custom Configuration:
configs = { "model_name": "ViT", "batch_size": 32, "num_epochs": 100, "lr": 3e-4, "patch_size": 16, "token_dim": 192, "attn_dim": 192, "num_heads": 4, "mlp_size": 768, "num_tf_layers": 6, "num_classes": 6, "max_frames": 80, "slicing_step": 8 }
tensorboard --logdir=tboard_logsThen open http://localhost:6006 in your browser to view training curves.
from utils import load_model
checkpoint = torch.load('checkpoints/checkpoint_ViT_patch_size_16_epochs_100.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
stats = checkpoint['stats']The utils.py file provides:
train_model(): Complete training loop with TensorBoard loggingtrain_epoch(): Training for one epocheval_model(): Model evaluation on validation/test setsave_model()/load_model(): Model checkpointingcount_model_params(): Count learnable parameterssmooth(): Loss curve smoothing for visualizationset_random_seed(): Reproducibility utilities
The assignment mentions implementing ViViT with Space-Time attention as an extra credit task. ViViT extends ViT to explicitly model temporal relationships in video sequences using space-time attention mechanisms.
Key Differences from Standard ViT:
- Space-Time Attention: Jointly attends to spatial and temporal dimensions
- Temporal Modeling: Explicitly models relationships between frames
- 3D Patches: Can process spatiotemporal patches instead of frame-by-frame
Reference: ViViT: A Video Vision Transformer
- Vision Transformer (ViT) Paper
- ViViT: A Video Vision Transformer
- KTH-Actions Dataset
- PyTorch Documentation
- TensorBoard
- Attention Is All You Need
If you found this project helpful, you can support my work by buying me a coffee or via paypal!
This assignment demonstrates transformer-based architectures for video action recognition, exploring how Vision Transformers can be adapted for temporal sequence modeling.
