Skip to content

sstante/large-scale-ai-project

Repository files navigation

Large-Scale AI Course Project

Project for the Large-Scale AI Engineering course at ETH Zurich. This repo contains a custom implementation of the DeltaNet and Mamba SSM with the purpose of comparing them with the Transformer as a baseline. Additionally, we use distributed data parallel (DDP) and gradient checkpointing to fully utilize the available compute. We integrated Weights & Biases to monitor and evaluate runs. We also set up Hydra configurations to easily adapt hyperparameters and to start runs quickly over the command line.

Setup Instructions

Environment

We use uenv to manage containerized development environments. To set up the environment, run the following commands:

uenv image pull prgenv-gnu/25.6:v2

This uenv will be used when launching jobs on the Slurm cluster that require NVIDIA Nsight Systems tracing.

You can also start an interactive shell within the uenv. The --view=default flag ensures that the default view is used, which includes necessary libraries and tools. For example:

uenv start --view=default prgenv-gnu/25.6:v2

Dependencies

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
# Restart your shell to have uv in your PATH

Create a virtual environment and install dependencies:

uv venv --python 3.13 # Create a new virtual environment

uv sync --all-extras # Install dependencies from pyproject.toml

Ensure that torch is installed with CUDA support (the following command should print True and 12.6):

uv run python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

Pre-commit Hooks

We use pre-commit to ensure code quality and formatting. It will automatically run checks (like ruff for linting and formatting) before every commit. To install the pre-commit hooks:

uv run pre-commit install

To run the hooks manually on all files:

uv run pre-commit run --all-files

Usage

Interactive Session

For development, you can get an interactive session on the cluster with:

srun -A large-sc-2 --partition debug -t 1:00:00 --pty ${SHELL:-bash}

To run training, use the following command:

uv run python -m project.train model=transformer

Available models:

  • deltanet
  • mamba
  • transformer (baseline)

You can override any parameter from the command line using Hydra syntax. For example:

# Override batch size
uv run python -m project.train model=transformer trainer.batch_size=32

# Override model architecture
uv run python -m project.train model=transformer model.n_layer=12

# Override multiple parameters
uv run python -m project.train model=transformer trainer.batch_size=32 trainer.learning_rate=1e-4

You can run hyperparameter sweeps using the --multirun (or -m) flag. This allows you to run the same code with different configurations. For example:

# Run with two different learning rates
uv run python -m project.train --multirun model=transformer trainer.learning_rate=1e-4,1e-5

Reproducing Results

Compute results:

bash slurm_all.sh

About

Project for the ETH Large-Scale AI Engineering course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors