This assignment focuses on implementing Recurrent Neural Networks (RNNs) from scratch and applying them to action recognition tasks. The project implements custom LSTM and Convolutional LSTM cells, then compares them with PyTorch's built-in RNN modules on the KTH-Actions dataset for video action classification.
- Implement LSTM and ConvLSTM cells from scratch
- Build an action recognition pipeline using RNNs
- Compare different RNN architectures (LSTMCell, GRUCell, custom implementations)
- Evaluate models on accuracy, training/inference time, and parameter count
- Implement 3D-CNN (R(2+1)d-Net) for action classification (extra credit)
KTH-Actions Dataset - Human action recognition dataset
- Actions: walking, jogging, running, boxing, handwaving, handclapping
- Frame size: 64×64 pixels (grayscale)
- Sequence length: 10 frames per sample
- Split: Person IDs 0-16 for training, 17-25 for testing
- Source: KTH-Actions Dataset
The dataset is automatically loaded using the custom KTHActionDataset class in src/dataloader.py.
A fully custom LSTM implementation from scratch with the following components:
Architecture:
- Forget Gate:
f_t = σ(W_f · [h_{t-1}, x_t] + b_f) - Input Gate:
i_t = σ(W_i · [h_{t-1}, x_t] + b_i) - Candidate Gate:
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) - Output Gate:
o_t = σ(W_o · [h_{t-1}, x_t] + b_o) - Cell State:
C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t - Hidden State:
h_t = o_t ⊙ tanh(C_t)
Features:
- Xavier weight initialization
- Supports both single-step and sequence inputs
- Custom forward pass implementation
- Final linear layer for classification output
A convolutional variant of LSTM that preserves spatial information:
Architecture:
- Uses 1D convolutions instead of linear layers
- Maintains spatial dimensions through the sequence
- Separate convolutional layers for each gate (forget, input, candidate, output)
- Kernel size: 3 (default), with padding to preserve dimensions
Features:
- Processes spatial-temporal data efficiently
- Suitable for video sequences with spatial structure
- Custom implementation matching standard ConvLSTM formulation
A complete action recognition model with three main components:
-
Option 1: Custom CNN encoder
- 5 convolutional blocks with BatchNorm and GELU activation
- Progressive channel expansion: 1 → 16 → 32 → 64 → 128 → emb_dim
- Adaptive average pooling to fixed size
-
Option 2: Pretrained ResNet18 encoder
- Modified first layer for grayscale input (1 channel)
- Feature extraction with projection to embedding dimension
Supports multiple RNN architectures:
- LSTMCell: PyTorch's built-in LSTM cell
- GRUCell: PyTorch's built-in GRU cell
- OwnLSTM: Custom LSTM implementation
- OwnConvLSTM: Custom ConvLSTM implementation
- Conv1d layer for temporal feature extraction
- Adaptive average pooling
- Fully connected layer for final classification (6 classes)
The project includes multiple experiments comparing different RNN architectures:
| Experiment | RNN Type | Pretrained Encoder | Scheduler | Description |
|---|---|---|---|---|
| LSTMCell | PyTorch LSTM | ❌ | ✅ | Baseline with PyTorch LSTM |
| LSTMCell_NoScheduler | PyTorch LSTM | ❌ | ❌ | LSTM without learning rate scheduling |
| GRUCell | PyTorch GRU | ❌ | ✅ | GRU-based model |
| GRUCell_NoScheduler | PyTorch GRU | ❌ | ❌ | GRU without scheduling |
| OwnLSTM | Custom LSTM | ❌ | ✅ | Custom LSTM implementation |
| LSTMCell_PretEncoder | PyTorch LSTM | ✅ | ❌ | LSTM with pretrained ResNet encoder |
| LSTMCell_PretEncoder_Scheduler | PyTorch LSTM | ✅ | ✅ | LSTM with pretrained encoder + scheduler |
All experiments use:
- Optimizer: Adam
- Learning rate: 0.001 (with optional scheduler)
- Batch size: 32
- Epochs: 50-100 (varies by experiment)
- Loss function: CrossEntropyLoss
- Embedding dimension: 128
- Hidden dimension: 128
- Number of layers: 2
Spatial Augmentations:
- Random horizontal flip (p=0.5)
- Random rotation (±25 degrees)
Temporal Augmentations:
- Random temporal sampling (slicing step)
- Random temporal reversal (p=0.3)
- TensorBoard logging: Training/validation loss, accuracy, and learning rate curves
- Model checkpointing: Saves best models with training configurations
- Progress tracking: Real-time training progress with tqdm
- Evaluation metrics: Accuracy, per-class performance
- Experiment management: YAML configuration files for each experiment
- Seed management: Reproducible experiments
- Model evaluation: Comprehensive evaluation functions
- Visualization: Sequence visualization tools
- Data loading: Efficient dataset handling with proper train/test splits
Assignment3/
├── Assignment3.ipynb # Main assignment notebook
├── session3.ipynb # Lab session materials
├── src/
│ ├── models.py # Custom LSTM and ConvLSTM implementations
│ ├── dataloader.py # KTHActionDataset class
│ ├── transformations.py # Data augmentation transforms
│ ├── utils.py # Training and evaluation utilities
│ └── devel/
│ ├── task1.ipynb # Task 1 development notebook
│ ├── task2.ipynb # Task 2 development notebook
│ └── task3.ipynb # Task 3 (extra credit) notebook
├── data/
│ └── README.md # Dataset information
├── models/
│ └── README.md # Model checkpoints directory
├── tboard_logs/ # TensorBoard logs for all experiments
│ ├── LSTMCell/
│ ├── GRUCell/
│ ├── OwnLSTM/
│ └── ...
└── imgs/ # Visualization images and GIFs
├── pipeline.png
├── gif_*.gif
└── ...
The notebook includes comprehensive analysis:
- Learning curves: Training vs validation loss and accuracy over epochs
- Performance metrics: Overall and per-class accuracy
- Parameter count: Comparison of model sizes
- Training/inference time: Efficiency analysis
- Failure case analysis: Visualization of misclassified sequences
- GRU Performance: GRUCell achieved the best performance on the dataset
- LSTM vs GRU: GRU's simpler architecture (no cell state) can be more efficient while maintaining performance
- Custom Implementation: OwnLSTM showed competitive results, validating the implementation
- Pretrained Encoders: Using pretrained ResNet encoders improved feature extraction
- Learning Rate Scheduling: Schedulers helped stabilize training and improve convergence
- Temporal Augmentations: Effective for improving generalization
-
Install dependencies:
pip install torch torchvision numpy matplotlib seaborn tqdm pyyaml tensorboard pillow
-
Download the KTH-Actions dataset:
- The dataset should be placed in the appropriate directory
- Or modify the
root_dirparameter inKTHActionDataset
-
Open the notebook:
jupyter notebook Assignment3.ipynb
-
Run experiments: Execute cells sequentially to:
- Implement custom LSTM and ConvLSTM cells (Task 1)
- Load and preprocess the KTH-Actions dataset
- Train different RNN architectures (Task 2)
- Evaluate and compare models
- Visualize results
tensorboard --logdir=tboard_logsThen open http://localhost:6006 in your browser to view training curves for all experiments.
checkpoint = torch.load('models/experiment_name/checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])from src.models import OwnLSTM, ConvLSTMCell
from src.dataloader import KTHActionDataset
from src.transformations import get_train_transforms, get_test_transforms
# Initialize custom LSTM
lstm = OwnLSTM(input_size=128, hidden_size=128, output_size=128)
# Load dataset
train_dataset = KTHActionDataset(
root_dir='path/to/kth_actions',
split='train',
transform=get_train_transforms(slicing_step=2),
max_frames=10,
img_size=(64, 64)
)The project includes an implementation of R(2+1)d-Net for action recognition:
- Architecture: Factorized 3D convolutions (2D spatial + 1D temporal)
- Advantages: More efficient than full 3D convolutions while maintaining performance
- Comparison: Evaluated against RNN-based models
See src/devel/task3.ipynb for implementation details.
- KTH-Actions Dataset
- Understanding LSTMs
- Convolutional LSTM Network
- R(2+1)D Networks
- PyTorch Documentation
- TensorBoard
Date: 18.05.2025
If you found this project helpful, you can support my work by buying me a coffee or via PayPal!
This assignment demonstrates deep understanding of recurrent neural networks, including custom implementations of LSTM and ConvLSTM cells, and their application to video action recognition tasks.
