Skip to content

NeLLi-team/nelli-getting-started

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nelli-getting-started

Practical onboarding for the Nelli bioinformatics team. This repository gives new joiners a curated walk-through of the core metagenomics workflow—read QC, assembly, binning, clustering, and phylogenetics—using tiny, reproducible datasets and a single environment definition.

The goal: make it possible to land on a fresh laptop and, within an afternoon, understand what we do, why each step matters, and how to reproduce it.

Quick Start Checklist (≈10 min)

  1. Install pixi and Docker (BBTools helpers auto-fallback to a container when no native binaries are found). Rootless Apptainer also works.
  2. Fetch the sandbox datasets and regenerate the mock reads (progress + file sizes are printed): bash scripts/setup_test_data.sh
  3. Solve the environment exactly as defined in pixi.toml: pixi install
  4. Run the interactive workflow walk-through: pixi run demo (press Enter between stages unless you export DEMO_AUTO=1; adjust threads with DEMO_THREADS)

Every command is verbose and will stop with a clear error message if a dependency is missing.

pixi run demo orchestrates the entire QC→assembly→binning→clustering→phylogeny toy pipeline in a single, reproducible command.

Workflow at a Glance

flowchart LR
    setup[Setup test data<br/>scripts/setup_test_data.sh] --> env[pixi install]
    env --> demo[[pixi run demo]]
    demo --> qc[Read QC]
    qc --> assembly[Assembly]
    assembly --> binning[Binning]
    binning --> clustering[Clustering]
    clustering --> phylo[Phylogenetics]
    phylo --> review[Results & review]
Loading

Interactive Demo

pixi run demo runs scripts/demo_all.sh, printing the exact command for each stage, the rationale, the container/native backend in use, and a post-run summary (contig counts, ortholog headers, etc.). The walkthrough defaults to 16 threads—override with DEMO_THREADS—and is interactive by default; export DEMO_AUTO=1 to replay it non-interactively. Reruns overwrite the deterministic artefacts under results/.

Environment Options

  • pixi (recommended) — provides Python plus the CLI tools we rely on (SPAdes, MetaBAT2, MAFFT, IQ-TREE, MMseqs2, Pyrodigal, etc.). The manifest lives in pixi.toml and mirrors envs/pixi.toml.
  • Docker/Apptainerscripts/run_bbtools.sh transparently mounts the repo and launches the bryce911/bbtools:latest image when bbmap.sh, bbduk.sh, randomreads.sh, etc. are absent locally.
  • uv — fastest route to the Python packages and notebooks in notebooks/. Use it when you do not need heavy CLI tools: uv sync -p 3.11 && uv run jupyter lab.

Learning Roadmap

Follow docs/onboarding.md for a guided tour. It covers:

  • Orientation and workflow diagram
  • Data standards (FASTA headers, metadata expectations)
  • Hands-on labs for QC, assembly, binning, clustering, tree building, and taxonomy
  • How to interpret outputs and decide the next action at each step

Each topic links to deeper reference notes (for example docs/04_assembly.md for assembly details) and the companion scripts in scripts/ that implement the reproducible commands.

Repository Layout

  • docs/ — onboarding guide plus topic deep dives.
  • scripts/ — small, composable shell wrappers for every workflow step.
  • python/ — Python helpers, e.g. FASTA header normalization.
  • notebooks/ — exploration notebooks that mirror the scripted pipeline.
  • data/ — tiny example genomes, FASTQs, and download scripts.
  • results/ — automatically created; holds QC, assembly, binning, clustering, annotations, and phylogeny outputs; safe to delete.
  • envs/ — alternative environment manifests (pixi.toml, pyproject.toml).

Results Layout

results/
  qc/clean_R{1,2}.fq.gz         # trimmed reads from bbduk
  mapping/{mapped.bam,depth.txt}
  assembly/spades/contigs.fasta
  binning/quickbin/*.fna
  annotations/quickbin/*.faa    # per-bin proteins from Pyrodigal
  phylogeny/quickbin/           # ortholog FASTA, alignment, IQ-TREE outputs
  clustering/mmseqs2/clusters.tsv
  taxonomy/quickclade*.tsv      # per-contig assignments + per-bin consensus
  reference/example.norm.faa    # header-normalised reference proteins

make clean removes the entire tree; make clean-hard also wipes cached downloads and mock reads under data/.

Toy Dataset

  • data/examples/example.fna bundles three bacterial genomes, each split over multiple contigs. FASTA headers use the pattern GENOME_ID|CONTIG_ID. The interactive demo previews the genome→contig mapping before the read simulation step.
  • scripts/make_mock_reads.sh uses BBTools randomreads.sh with 150 bp paired reads, coverage 50×, and mild SNP/indel noise to mimic a mini metagenome.
  • data/examples/example.faa contains the corresponding protein-coding sequences; scripts/normalize_headers.py remaps headers to the GENOME|PRODUCT scheme for clustering demos.
  • scripts/setup_test_data.sh runs data/downloads.sh, which grabs a <50 MB CAMI II toy pair and logs the on-disk sizes so you know the download footprint.
  • QuickBin-derived ortholog alignments and MAG phylogenies are written to results/phylogeny/quickbin/.

Header Normalization Utility

Consistent FASTA headers keep downstream tooling happy. Use scripts/normalize_headers.py to enforce the >GENOME|ID pattern:

python scripts/normalize_headers.py -i input.fna -o normalized.fna --genome-id SAMPLE1
python scripts/normalize_headers.py -i input.faa -o normalized.faa --genome-id SAMPLE1

Running Notebooks (Remote-Friendly)

  • Activate the environment and launch Jupyter Lab without opening a browser on the server:
    pixi run nb -- --no-browser --ip=127.0.0.1 --port=8888
  • From your laptop, create an SSH tunnel to the server:
    ssh -N -L 8888:localhost:8888 <user>@<server>
  • Open http://localhost:8888 in your local browser, paste the token shown in the server log, and select the Python 3 (pixi) kernel.
  • Recommended notebooks under notebooks/:
    1. 01_qc_and_stats.ipynb – inspect read/coverage summaries after pixi run demo.
    2. 02_downstream_ml.ipynb – build a simple classifier from GC + coverage features.

Where to Go Next

  1. Run pixi run demo end-to-end to prove your setup and capture notes on the workflow.
  2. Walk through the annotated sections in docs/onboarding.md with the scripts and notebooks open. Capture notes on coverage profiles, contig statistics, bin composition, and tree topology.
  3. Use the Makefile targets (make demo, make assemble-spades, …) to practice running individual stages or to rerun from the middle of the workflow.
  4. Run bash scripts/grade_bins.sh to capture BBTools GradeBins completeness/contamination estimates in results/binning/quickbin/gradebins.txt.
  5. Run make phylo-quickbin (wrapper around scripts/phylo_quickbin.sh) to regenerate proteins, ortholog alignment, and the MAG tree in results/phylogeny/quickbin/tree.treefile.
  6. Pair with a teammate on a real dataset once you are comfortable with the mock runs—and update these docs with any gaps you discover.

If you spot a gap or need a new tutorial, open an issue or drop your notes in docs/—onboarding is a living document.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published