From b53ce2aa4c3760b0377ccefecb1b34cd198d3b53 Mon Sep 17 00:00:00 2001 From: tykoo-chen Date: Wed, 11 Mar 2026 18:53:43 +0000 Subject: [PATCH] docs: add Troubleshooting section to README Common issues and solutions for new users: - No CUDA GPU detected - CUDA out of memory - kernels/Flash Attention errors - Loss not decreasing - Script hangs at startup --- README.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/README.md b/README.md index 2bc30516..0ac8c8a9 100644 --- a/README.md +++ b/README.md @@ -86,6 +86,30 @@ I think these would be the reasonable hyperparameters to play with. Ask your fav - [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) (MacOS) - [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx) (Windows) + +## Troubleshooting + +**"No CUDA GPU detected"** +- This script requires an NVIDIA GPU with CUDA support +- Mac users: see the MLX forks in Notable forks section +- Windows/Linux: ensure CUDA drivers are installed (`nvidia-smi` should work) + +**"CUDA out of memory"** +- Reduce `DEVICE_BATCH_SIZE` in `train.py` (e.g., from 128 to 64) +- For GPUs with <40GB VRAM, also consider reducing `DEPTH` + +**"kernels module not found" or Flash Attention errors** +- Ensure you're using `uv run` (not plain `python`) +- Try `uv sync --reinstall` to rebuild dependencies + +**Training runs but loss doesn't decrease** +- This is expected for the first ~10 steps (warmup) +- If loss stays flat after step 50+, the experiment may need different hyperparameters + +**Script hangs at startup** +- First run compiles the model with `torch.compile`, which can take 1-2 minutes +- Subsequent runs should start faster + ## License MIT