Practical onboarding for the Nelli bioinformatics team. This repository gives new joiners a curated walk-through of the core metagenomics workflow—read QC, assembly, binning, clustering, and phylogenetics—using tiny, reproducible datasets and a single environment definition.
The goal: make it possible to land on a fresh laptop and, within an afternoon, understand what we do, why each step matters, and how to reproduce it.
- Install pixi and Docker (BBTools helpers auto-fallback to a container when no native binaries are found). Rootless Apptainer also works.
- Fetch the sandbox datasets and regenerate the mock reads (progress + file sizes are printed):
bash scripts/setup_test_data.sh - Solve the environment exactly as defined in
pixi.toml:pixi install - Run the interactive workflow walk-through:
pixi run demo(press Enter between stages unless you exportDEMO_AUTO=1; adjust threads withDEMO_THREADS)
Every command is verbose and will stop with a clear error message if a dependency is missing.
pixi run demo orchestrates the entire QC→assembly→binning→clustering→phylogeny toy pipeline in a single, reproducible command.
flowchart LR
setup[Setup test data<br/>scripts/setup_test_data.sh] --> env[pixi install]
env --> demo[[pixi run demo]]
demo --> qc[Read QC]
qc --> assembly[Assembly]
assembly --> binning[Binning]
binning --> clustering[Clustering]
clustering --> phylo[Phylogenetics]
phylo --> review[Results & review]
pixi run demo runs scripts/demo_all.sh, printing the exact command for each stage, the rationale, the container/native backend in use, and a post-run summary (contig counts, ortholog headers, etc.). The walkthrough defaults to 16 threads—override with DEMO_THREADS—and is interactive by default; export DEMO_AUTO=1 to replay it non-interactively. Reruns overwrite the deterministic artefacts under results/.
- pixi (recommended) — provides Python plus the CLI tools we rely on (SPAdes, MetaBAT2, MAFFT, IQ-TREE, MMseqs2, Pyrodigal, etc.). The manifest lives in
pixi.tomland mirrorsenvs/pixi.toml. - Docker/Apptainer —
scripts/run_bbtools.shtransparently mounts the repo and launches thebryce911/bbtools:latestimage whenbbmap.sh,bbduk.sh,randomreads.sh, etc. are absent locally. - uv — fastest route to the Python packages and notebooks in
notebooks/. Use it when you do not need heavy CLI tools:uv sync -p 3.11 && uv run jupyter lab.
Follow docs/onboarding.md for a guided tour. It covers:
- Orientation and workflow diagram
- Data standards (FASTA headers, metadata expectations)
- Hands-on labs for QC, assembly, binning, clustering, tree building, and taxonomy
- How to interpret outputs and decide the next action at each step
Each topic links to deeper reference notes (for example docs/04_assembly.md for assembly details) and the companion scripts in scripts/ that implement the reproducible commands.
docs/— onboarding guide plus topic deep dives.scripts/— small, composable shell wrappers for every workflow step.python/— Python helpers, e.g. FASTA header normalization.notebooks/— exploration notebooks that mirror the scripted pipeline.data/— tiny example genomes, FASTQs, and download scripts.results/— automatically created; holds QC, assembly, binning, clustering, annotations, and phylogeny outputs; safe to delete.envs/— alternative environment manifests (pixi.toml,pyproject.toml).
results/
qc/clean_R{1,2}.fq.gz # trimmed reads from bbduk
mapping/{mapped.bam,depth.txt}
assembly/spades/contigs.fasta
binning/quickbin/*.fna
annotations/quickbin/*.faa # per-bin proteins from Pyrodigal
phylogeny/quickbin/ # ortholog FASTA, alignment, IQ-TREE outputs
clustering/mmseqs2/clusters.tsv
taxonomy/quickclade*.tsv # per-contig assignments + per-bin consensus
reference/example.norm.faa # header-normalised reference proteins
make clean removes the entire tree; make clean-hard also wipes cached downloads and mock reads under data/.
data/examples/example.fnabundles three bacterial genomes, each split over multiple contigs. FASTA headers use the patternGENOME_ID|CONTIG_ID. The interactive demo previews the genome→contig mapping before the read simulation step.scripts/make_mock_reads.shuses BBToolsrandomreads.shwith 150 bp paired reads, coverage 50×, and mild SNP/indel noise to mimic a mini metagenome.data/examples/example.faacontains the corresponding protein-coding sequences;scripts/normalize_headers.pyremaps headers to theGENOME|PRODUCTscheme for clustering demos.scripts/setup_test_data.shrunsdata/downloads.sh, which grabs a <50 MB CAMI II toy pair and logs the on-disk sizes so you know the download footprint.- QuickBin-derived ortholog alignments and MAG phylogenies are written to
results/phylogeny/quickbin/.
Consistent FASTA headers keep downstream tooling happy. Use scripts/normalize_headers.py to enforce the >GENOME|ID pattern:
python scripts/normalize_headers.py -i input.fna -o normalized.fna --genome-id SAMPLE1
python scripts/normalize_headers.py -i input.faa -o normalized.faa --genome-id SAMPLE1- Activate the environment and launch Jupyter Lab without opening a browser on the server:
pixi run nb -- --no-browser --ip=127.0.0.1 --port=8888
- From your laptop, create an SSH tunnel to the server:
ssh -N -L 8888:localhost:8888 <user>@<server>
- Open http://localhost:8888 in your local browser, paste the token shown in the server log, and select the
Python 3 (pixi)kernel. - Recommended notebooks under
notebooks/:01_qc_and_stats.ipynb– inspect read/coverage summaries afterpixi run demo.02_downstream_ml.ipynb– build a simple classifier from GC + coverage features.
- Run
pixi run demoend-to-end to prove your setup and capture notes on the workflow. - Walk through the annotated sections in
docs/onboarding.mdwith the scripts and notebooks open. Capture notes on coverage profiles, contig statistics, bin composition, and tree topology. - Use the Makefile targets (
make demo,make assemble-spades, …) to practice running individual stages or to rerun from the middle of the workflow. - Run
bash scripts/grade_bins.shto capture BBTools GradeBins completeness/contamination estimates inresults/binning/quickbin/gradebins.txt. - Run
make phylo-quickbin(wrapper aroundscripts/phylo_quickbin.sh) to regenerate proteins, ortholog alignment, and the MAG tree inresults/phylogeny/quickbin/tree.treefile. - Pair with a teammate on a real dataset once you are comfortable with the mock runs—and update these docs with any gaps you discover.
If you spot a gap or need a new tutorial, open an issue or drop your notes in docs/—onboarding is a living document.