Large-Scale AI Course Project

Project for the Large-Scale AI Engineering course at ETH Zurich. This repo contains a custom implementation of the DeltaNet and Mamba SSM with the purpose of comparing them with the Transformer as a baseline. Additionally, we use distributed data parallel (DDP) and gradient checkpointing to fully utilize the available compute. We integrated Weights & Biases to monitor and evaluate runs. We also set up Hydra configurations to easily adapt hyperparameters and to start runs quickly over the command line.

Setup Instructions

Environment

We use uenv to manage containerized development environments. To set up the environment, run the following commands:

uenv image pull prgenv-gnu/25.6:v2

This uenv will be used when launching jobs on the Slurm cluster that require NVIDIA Nsight Systems tracing.

You can also start an interactive shell within the uenv. The --view=default flag ensures that the default view is used, which includes necessary libraries and tools. For example:

uenv start --view=default prgenv-gnu/25.6:v2

Dependencies

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh
# Restart your shell to have uv in your PATH

Create a virtual environment and install dependencies:

uv venv --python 3.13 # Create a new virtual environment

uv sync --all-extras # Install dependencies from pyproject.toml

Ensure that torch is installed with CUDA support (the following command should print True and 12.6):

uv run python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

Pre-commit Hooks

We use pre-commit to ensure code quality and formatting. It will automatically run checks (like ruff for linting and formatting) before every commit. To install the pre-commit hooks:

uv run pre-commit install

To run the hooks manually on all files:

uv run pre-commit run --all-files

Usage

Interactive Session

For development, you can get an interactive session on the cluster with:

srun -A large-sc-2 --partition debug -t 1:00:00 --pty ${SHELL:-bash}

To run training, use the following command:

uv run python -m project.train model=transformer

Available models:

deltanet
mamba
transformer (baseline)

You can override any parameter from the command line using Hydra syntax. For example:

# Override batch size
uv run python -m project.train model=transformer trainer.batch_size=32

# Override model architecture
uv run python -m project.train model=transformer model.n_layer=12

# Override multiple parameters
uv run python -m project.train model=transformer trainer.batch_size=32 trainer.learning_rate=1e-4

You can run hyperparameter sweeps using the --multirun (or -m) flag. This allows you to run the same code with different configurations. For example:

# Run with two different learning rates
uv run python -m project.train --multirun model=transformer trainer.learning_rate=1e-4,1e-5

Reproducing Results

Compute results:

bash slurm_all.sh

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
configs		configs
data		data
project		project
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
slurm.job		slurm.job
slurm_all.sh		slurm_all.sh
slurm_ddp_16.job		slurm_ddp_16.job
slurm_ddp_4.job		slurm_ddp_4.job
slurm_trace.job		slurm_trace.job
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large-Scale AI Course Project

Setup Instructions

Environment

Dependencies

Pre-commit Hooks

Usage

Interactive Session

Reproducing Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Large-Scale AI Course Project

Setup Instructions

Environment

Dependencies

Pre-commit Hooks

Usage

Interactive Session

Reproducing Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages