nelli-getting-started

Practical onboarding for the Nelli bioinformatics team. This repository gives new joiners a curated walk-through of the core metagenomics workflow—read QC, assembly, binning, clustering, and phylogenetics—using tiny, reproducible datasets and a single environment definition.

The goal: make it possible to land on a fresh laptop and, within an afternoon, understand what we do, why each step matters, and how to reproduce it.

Quick Start Checklist (≈10 min)

Install pixi and Docker (BBTools helpers auto-fallback to a container when no native binaries are found). Rootless Apptainer also works.
Fetch the sandbox datasets and regenerate the mock reads (progress + file sizes are printed): bash scripts/setup_test_data.sh
Solve the environment exactly as defined in pixi.toml: pixi install
Run the interactive workflow walk-through: pixi run demo (press Enter between stages unless you export DEMO_AUTO=1; adjust threads with DEMO_THREADS)

Every command is verbose and will stop with a clear error message if a dependency is missing.

pixi run demo orchestrates the entire QC→assembly→binning→clustering→phylogeny toy pipeline in a single, reproducible command.

Workflow at a Glance

flowchart LR
    setup[Setup test data<br/>scripts/setup_test_data.sh] --> env[pixi install]
    env --> demo[[pixi run demo]]
    demo --> qc[Read QC]
    qc --> assembly[Assembly]
    assembly --> binning[Binning]
    binning --> clustering[Clustering]
    clustering --> phylo[Phylogenetics]
    phylo --> review[Results & review]

Interactive Demo

pixi run demo runs scripts/demo_all.sh, printing the exact command for each stage, the rationale, the container/native backend in use, and a post-run summary (contig counts, ortholog headers, etc.). The walkthrough defaults to 16 threads—override with DEMO_THREADS—and is interactive by default; export DEMO_AUTO=1 to replay it non-interactively. Reruns overwrite the deterministic artefacts under results/.

Environment Options

pixi (recommended) — provides Python plus the CLI tools we rely on (SPAdes, MetaBAT2, MAFFT, IQ-TREE, MMseqs2, Pyrodigal, etc.). The manifest lives in pixi.toml and mirrors envs/pixi.toml.
Docker/Apptainer — scripts/run_bbtools.sh transparently mounts the repo and launches the bryce911/bbtools:latest image when bbmap.sh, bbduk.sh, randomreads.sh, etc. are absent locally.
uv — fastest route to the Python packages and notebooks in notebooks/. Use it when you do not need heavy CLI tools: uv sync -p 3.11 && uv run jupyter lab.

Learning Roadmap

Follow docs/onboarding.md for a guided tour. It covers:

Orientation and workflow diagram
Data standards (FASTA headers, metadata expectations)
Hands-on labs for QC, assembly, binning, clustering, tree building, and taxonomy
How to interpret outputs and decide the next action at each step

Each topic links to deeper reference notes (for example docs/04_assembly.md for assembly details) and the companion scripts in scripts/ that implement the reproducible commands.

Repository Layout

docs/ — onboarding guide plus topic deep dives.
scripts/ — small, composable shell wrappers for every workflow step.
python/ — Python helpers, e.g. FASTA header normalization.
notebooks/ — exploration notebooks that mirror the scripted pipeline.
data/ — tiny example genomes, FASTQs, and download scripts.
results/ — automatically created; holds QC, assembly, binning, clustering, annotations, and phylogeny outputs; safe to delete.
envs/ — alternative environment manifests (pixi.toml, pyproject.toml).

Results Layout

results/
  qc/clean_R{1,2}.fq.gz         # trimmed reads from bbduk
  mapping/{mapped.bam,depth.txt}
  assembly/spades/contigs.fasta
  binning/quickbin/*.fna
  annotations/quickbin/*.faa    # per-bin proteins from Pyrodigal
  phylogeny/quickbin/           # ortholog FASTA, alignment, IQ-TREE outputs
  clustering/mmseqs2/clusters.tsv
  taxonomy/quickclade*.tsv      # per-contig assignments + per-bin consensus
  reference/example.norm.faa    # header-normalised reference proteins

make clean removes the entire tree; make clean-hard also wipes cached downloads and mock reads under data/.

Toy Dataset

data/examples/example.fna bundles three bacterial genomes, each split over multiple contigs. FASTA headers use the pattern GENOME_ID|CONTIG_ID. The interactive demo previews the genome→contig mapping before the read simulation step.
scripts/make_mock_reads.sh uses BBTools randomreads.sh with 150 bp paired reads, coverage 50×, and mild SNP/indel noise to mimic a mini metagenome.
data/examples/example.faa contains the corresponding protein-coding sequences; scripts/normalize_headers.py remaps headers to the GENOME|PRODUCT scheme for clustering demos.
scripts/setup_test_data.sh runs data/downloads.sh, which grabs a <50 MB CAMI II toy pair and logs the on-disk sizes so you know the download footprint.
QuickBin-derived ortholog alignments and MAG phylogenies are written to results/phylogeny/quickbin/.

Header Normalization Utility

Consistent FASTA headers keep downstream tooling happy. Use scripts/normalize_headers.py to enforce the >GENOME|ID pattern:

python scripts/normalize_headers.py -i input.fna -o normalized.fna --genome-id SAMPLE1
python scripts/normalize_headers.py -i input.faa -o normalized.faa --genome-id SAMPLE1

Running Notebooks (Remote-Friendly)

Activate the environment and launch Jupyter Lab without opening a browser on the server:
```
pixi run nb -- --no-browser --ip=127.0.0.1 --port=8888
```
From your laptop, create an SSH tunnel to the server:
```
ssh -N -L 8888:localhost:8888 <user>@<server>
```
Open http://localhost:8888 in your local browser, paste the token shown in the server log, and select the Python 3 (pixi) kernel.
Recommended notebooks under notebooks/:
1. 01_qc_and_stats.ipynb – inspect read/coverage summaries after pixi run demo.
2. 02_downstream_ml.ipynb – build a simple classifier from GC + coverage features.

Where to Go Next

Run pixi run demo end-to-end to prove your setup and capture notes on the workflow.
Walk through the annotated sections in docs/onboarding.md with the scripts and notebooks open. Capture notes on coverage profiles, contig statistics, bin composition, and tree topology.
Use the Makefile targets (make demo, make assemble-spades, …) to practice running individual stages or to rerun from the middle of the workflow.
Run bash scripts/grade_bins.sh to capture BBTools GradeBins completeness/contamination estimates in results/binning/quickbin/gradebins.txt.
Run make phylo-quickbin (wrapper around scripts/phylo_quickbin.sh) to regenerate proteins, ortholog alignment, and the MAG tree in results/phylogeny/quickbin/tree.treefile.
Pair with a teammate on a real dataset once you are comfortable with the mock runs—and update these docs with any gaps you discover.

If you spot a gap or need a new tutorial, open an issue or drop your notes in docs/—onboarding is a living document.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
community		community
data		data
docs		docs
envs		envs
notebooks		notebooks
python		python
ref/genome/1		ref/genome/1
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nelli-getting-started

Quick Start Checklist (≈10 min)

Workflow at a Glance

Interactive Demo

Environment Options

Learning Roadmap

Repository Layout

Results Layout

Toy Dataset

Header Normalization Utility

Running Notebooks (Remote-Friendly)

Where to Go Next

About

Uh oh!

Releases

Packages

Languages

License

NeLLi-team/nelli-getting-started

Folders and files

Latest commit

History

Repository files navigation

nelli-getting-started

Quick Start Checklist (≈10 min)

Workflow at a Glance

Interactive Demo

Environment Options

Learning Roadmap

Repository Layout

Results Layout

Toy Dataset

Header Normalization Utility

Running Notebooks (Remote-Friendly)

Where to Go Next

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Quick Start Checklist (≈10 min)

Packages