CMU × NVIDIA Hackathon — January 7–9, 2026
Pangenomes offer a powerful alternative to single-reference genomes by representing genetic variation across many individuals. However, current pangenome graphs typically include only tens of samples, while modern biobanks (UK Biobank, All of Us, HPRC) now contain thousands of long-read genomes spanning diverse populations.
A major barrier remains:
There is no practical way for independent biobanks to collaboratively construct population-scale pangenomes without sharing raw genomic data.
This leads to:
- Silos across cohorts
- Persistent population bias
- Limited downstream interpretability
OmniGenome explores whether federated learning principles can be applied to pangenome construction and genetic association analysis.
Instead of sharing raw FASTA files, participating sites:
- Build local pangenome graphs
- Share graph representations only
- Aggregate information per chromosome
We demonstrate this idea through:
- Federated pangenome graph construction (HPRC)
- Genomic background hashing for phenotype association at the APOE locus
This quick start runs a single local PGGB job to demonstrate the core building block used throughout the project.
- Docker (with Compose)
- ~8 GB RAM
- bgzip-compatible FASTA input
# Start Docker
systemctl start docker
# Build containers
docker compose build
# Run PGGB on an example FASTA
docker compose run pggb pggb \
-i /data/input.fa.gz \
-o /output/my_graph \
-n 12 \
-t 8 -p 90 -s 10000
⚠️ FASTA files must be bgzip-compressed, not gzip.
This produces a local GFA pangenome graph, which is the unit shared in the federated workflow.
Simulates how multiple biobanks can collaboratively build pangenomes while keeping raw genomic data local.
Uses localized pangenome context to encode anonymized haploblock structure for federated association analysis.
Together, these form an end-to-end framework for privacy-preserving, population-scale pangenomics.
- Human Pangenome Reference Consortium (HPRC)
- Chromosomes 19 and 22 (tractable proof-of-principle)
Each simulated “site”:
- Holds private FASTA data
- Builds a local pangenome graph using PGGB
- Shares only the resulting GFA graph
Graphs are then:
- Aggregated per chromosome
- Used to refine local graphs iteratively
⚠️ Graphs are never combined across chromosomes
- Download HPRC assemblies
- Extract chr19 / chr22
- Subsample individuals to simulate biobank cohorts
- Run PGGB locally
- Aggregate graphs with
vg combine - (Optional) Feedback using
minigraph
docker compose run pggb pggb \
-i /data/input.fa.gz \
-o /output/run_name \
-n 20 \
-t 8 -p 90 -s 10000
To demonstrate downstream utility, we focus on the APOE locus, a major genetic risk factor for Alzheimer’s disease.
GWAS hits often lack genomic context:
- Identical risk alleles can appear in different haplotypic backgrounds
- These backgrounds may influence penetrance or downstream effects
Pangenomes provide a natural representation of this structure.
- Identify APOE-associated loci from GWAS summary statistics
- Extract the corresponding pangenome subgraph
- Encode anonymized haploblock structure as hashes
- Use hashes as a federated genomic background feature
This enables:
- Privacy-preserving association testing
- Context-aware interpretation of risk alleles
- Cross-cohort comparison without raw genotype sharing
| Tool | Role |
|---|---|
| PGGB | Local pangenome construction |
| vg | Graph conversion & aggregation |
| minigraph | Incremental graph feedback |
| odgi | Graph slicing & analysis |
| Docker | Reproducibility |
CMU × NVIDIA Hackathon 2026
- Rob Loughnan
- Adam Kehl
- Jedrzej Kubica
- Kumar Koushik Telaprolu
- Jeff Winchell
- Sanjnaa Sridhar
- Dhruv Gor
- Samarpan Mohanty
- Wightman DP et al. Nature Genetics (2021)
- Garrison E et al. Nature Methods (2024)
- Garrison E et al. Nature Biotechnology (2018)
- Guarracino A et al. Bioinformatics (2022)
- Sirén J et al. Science (2021)
- Liao W-W et al. Nature (2023)
- Li H et al. Genome Biology (2020)
MIT License