This document provides a high-level architectural overview of the just-dna-lite ecosystem and its interconnected components. If you are a contributor looking to understand how everything fits together or where to make specific changes, this guide will orient you.
The ecosystem is designed to take raw genomic data (VCF files), normalize it, and annotate it using curated biological knowledge and Polygenic Risk Scores (PRS). The architecture strictly separates data preparation (upstream) from data consumption (local client).
The core philosophy is local-first privacy: while reference datasets are downloaded from the cloud, the user's actual genome never leaves their computer. All computationally intensive joins and scoring happen locally using Polars and DuckDB.
graph TD
subgraph "Upstream Data Preparation (Cloud / CI)"
PA[prepare-annotations\nDagster Pipeline]
PRSP[just-prs/prs-pipeline\nDagster Pipeline]
RawEnsembl[(Raw Ensembl VCF)] --> PA
RawOakVar[(OakVar Modules)] --> PA
RawClinVar[(Raw ClinVar)] --> PA
RawPGS[(Raw PGS Catalog FTP)] --> PRSP
RefPanel[(1000G / HGDP Reference)] --> PRSP
end
subgraph "Hugging Face Hub (CDN)"
HF_Mods[(just-dna-seq/annotators\nModule Parquets)]
HF_PGS[(just-dna-seq/polygenic_risk_scores\nCleaned Metadata)]
HF_Perc[(just-dna-seq/prs-percentiles\nReference Distributions)]
PA -- Publishes --> HF_Mods
PRSP -- Publishes --> HF_PGS
PRSP -- Publishes --> HF_Perc
end
subgraph "External Knowledge Bases"
BioContext[BioContext KB MCP\n(Ensembl, PubMed, UniProt)]
end
subgraph "Local Client (User's Machine)"
JDL[just-dna-lite\nDagster + Web UI]
JPRS[just-prs\nCore Library]
PRSUI[prs-ui\nReflex Components]
RMDG[reflex-mui-datagrid\nUI Components]
AIAgent[AI Module Manager\nAgno Solo or Team]
UserVCF[(User's Raw VCF)] --> JDL
Research[(Research PDF / CSV)] --> AIAgent
BioContext -. Tools .-> AIAgent
AIAgent -- Compiles Custom Module --> JDL
HF_Mods -- Downloads via fsspec --> JDL
HF_PGS -- Downloads via fsspec --> JPRS
HF_Perc -- Downloads via fsspec --> JPRS
JPRS --> JDL
PRSUI --> JDL
RMDG --> JDL
RMDG --> PRSUI
JDL -- Polars / DuckDB Join --> Annotated[(Annotated Local Parquet)]
JDL -- Generates --> Report[Interactive Web / PDF Report]
PRSUI -- Computes --> PRSResults[PRS Scores & Percentiles]
PRSResults -. Displayed In .-> JDL
end
The ecosystem consists of four main projects:
just-dna-lite: The main user-facing application (pipelines + web UI) with AI module creation.just-prs: The engine and UI components for Polygenic Risk Scores.prepare-annotations: The upstream data preparation pipelines.reflex-mui-datagrid: The core UI component library for rendering large genomic datasets.- AI Module Manager (Agno Agents): An agentic pipeline that turns arbitrary input (articles, CSVs) into compiled modules.
Repository: just-dna-lite
This is the primary workspace and the entry point for end users. It acts as the orchestrator that brings all other components together.
just-dna-pipelines: A Dagster-based pipeline library. It handles VCF ingestion, normalization (strippingchrprefixes, quality filtering), and the actual annotation. It downloads reference datasets from Hugging Face and performs fast, memory-efficient joins (using Polars or DuckDB) against the user's normalized VCF. The final outputs of these pipelines are annotated Parquet files and rich, structured HTML/PDF reports.webui: A Reflex-based web frontend. It provides the interface for users to upload VCF files, select annotation modules, trigger Dagster jobs, and view the resulting annotated data and visual reports. It also houses the AI Module Manager (see AI Module Creation), an agentic chat interface to create custom annotation modules from research papers.
Role in Ecosystem:
- The central orchestrator that users interact with.
- Merges user data with curated knowledge to generate interactive HTML/PDF reports and annotated Parquet files for downstream analysis.
Repository: just-prs (included as a workspace member in just-dna-lite)
A dedicated suite for computing Polygenic Risk Scores from the PGS Catalog. It acts as a pure Python alternative to tools like PLINK2.
just_prs(Core Library): Parses scoring files, normalizes variants, and computes scores usingpgenlib, Polars, and NumPy.prs-ui: Reusable Reflex components (PRSComputeStateMixin,prs_section()) that provide the UI for browsing the PGS Catalog and computing scores.just-dna-liteembeds these components directly.prs-pipeline: A Dagster pipeline that computes reference distributions (percentiles) by scoring thousands of PGS IDs against population panels (like 1000 Genomes).
Role in Ecosystem:
- Provides the heavy-lifting logic for calculating Polygenic Risk Scores locally.
- Provides embeddable UI components for interacting with the PGS Catalog.
- Computes and surfaces PRS results and population percentiles alongside traditional annotations in the
just-dna-liteUI.
Repository: prepare-annotations
This is a separate, upstream repository. It is a Dagster project dedicated to downloading massive, messy public datasets (Ensembl, ClinVar, OakVar modules) and converting them into clean, standardized Parquet files.
The Unified Schema: It converts various knowledge sources into three standard tables:
annotations.parquet: Variant-level facts (gene, phenotype).studies.parquet: Literature evidence (PubMed IDs, p-values).weights.parquet: Curator-defined scoring (genotype-specific weights).
Role in Ecosystem:
- Processes large, messy upstream biological datasets (Ensembl, ClinVar, OakVar) and standardizes them into Parquet format.
- Publishes these datasets to Hugging Face Hub for consumption by
just-dna-lite.
Repository: reflex-mui-datagrid
A standalone Reflex wrapper for the MUI X DataGrid React component.
- It is heavily optimized for rendering genomic data.
- It natively supports Polars
LazyFrameand integrates withpolars-bioto render VCFs directly without fully loading them into memory. - It provides server-side pagination, sorting, and filtering via the
LazyFrameGridMixin.
Role in Ecosystem:
- Handles the UI challenge of visualizing multi-million row Parquet/VCF files smoothly in the browser.
- Used across the web apps (
just-dna-liteandprs-ui) to display data grids.
Location: just-dna-pipelines/src/just_dna_pipelines/agents within just-dna-lite
See AI Module Creation for full documentation of the DSL and agent prompts.
We actively encourage users to build their own AI-based annotation modules rather than relying strictly on upstream curators. The ecosystem includes an agentic architecture (powered by Agno) that transforms arbitrary unstructured inputs—such as a PDF research article, a CSV, or a natural language description—into a fully functional annotation module.
The Workflow:
- Agent(s): A user submits a query and optionally attaches research files. The agent parses the text, extracts variants (resolving them via Ensembl if necessary), assigns weights/conclusions, and outputs a structured Domain Specific Language (DSL) consisting of
module_spec.yamland CSV files. - Compiler: A deterministic Python function takes the DSL, validates it against schemas, resolves missing positions, normalizes genotypes, and generates Parquet files (
weights.parquet,annotations.parquet,studies.parquet). - Registry: The compiled Parquet files are written to disk and registered in the
modules.yamlconfiguration, making the new custom module instantly available in the Dagster pipeline and web UI.
Solo vs. Team Mode:
- Solo Mode: A single agent (powered by Gemini) utilizes all tools (compile, validate, register) and queries the knowledge base to build the module from start to finish.
- Team Mode (Research Team): A Principal Investigator (PI) coordinates a multi-agent team.
- Researchers: 1-3 agents (can be a mix of Gemini, Claude, and OpenAI depending on available API keys) that research specific variants using the knowledge base.
- Reviewer: A quality-control agent equipped with Google Search to fact-check the drafted module.
- Principal Investigator (PI): Coordinates the researchers and reviewer, handles the deterministic compilation tools, and produces the final module.
BioContext KB MCP:
The AI agents are supercharged by a hosted Model Context Protocol (MCP) server named BioContext KB. When building modules, researchers and solo agents query this MCP server to fetch real-time variant details from Ensembl, literature references from EuropePMC, pathway data from Reactome, and protein data from UniProt. This completely eliminates hallucinations regarding genetic positions and reference alleles.
Role in Ecosystem:
- Democratizes the creation of genetic interpretation modules.
- Allows rapid integration of newly published research without waiting for centralized curation.
A critical part of the architecture is how data moves from the upstream preparation pipelines to the local client. To avoid distributing massive flat files or querying slow APIs at runtime, the ecosystem relies heavily on Hugging Face Hub as a CDN for Parquet files.
-
Annotation Modules (
prepare-annotations)- The
prepare-annotationspipelines publish standardized modules (e.g.,longevitymap,lipidmetabolism,coronary) to the Hugging Face dataset: just-dna-seq/annotators. - Each module contains the unified
annotations.parquet,studies.parquet, andweights.parquetfiles.
- The
-
PGS Catalog Metadata (
just-prs)- The
just-prscleanup pipeline processes raw EBI FTP data and publishes cleaned, snake_case Parquet files to: just-dna-seq/polygenic_risk_scores.
- The
-
PRS Reference Percentiles (
just-prs)- The
prs-pipelinescores the 1000 Genomes panel to generate statistical distributions for every PGS score, allowing the app to tell a user what percentile they fall into. These distributions are published to: just-dna-seq/prs-percentiles.
- The
When a user runs an annotation job in just-dna-lite:
- The local Dagster pipeline (
just-dna-pipelines) discovers available modules viamodules.yaml. - It uses
fsspec(viaHfFileSystem) or directhf://Polars URIs to download the required Parquet files from Hugging Face. - These files are cached locally in the user's
~/.cache/just-dna-pipelines/directory to avoid redundant downloads. - The pipeline performs a fast, local join between the downloaded weights/annotations and the user's normalized VCF.
The system is designed to gracefully handle updates at various levels—whether it's adding entirely new annotation modules, updating the Ensembl reference genome, or importing the latest Polygenic Risk Scores from the PGS Catalog.
What it means: A module (e.g., longevitymap, superhuman) needs to be refreshed from a new underlying dataset, or a user wants to create a new one using the AI Module Manager.
How to update:
- Pre-curated / Upstream Modules:
- Open the
prepare-annotationsrepository. - If the data source has been updated (e.g., a new release from OakVar), run the corresponding Dagster pipeline (e.g.,
uv run prepare longevitymap). - The pipeline will automatically fetch the new data, convert it to the unified Parquet schema (
annotations.parquet,studies.parquet,weights.parquet), and publish the new versions to the Hugging Facejust-dna-seq/annotatorsdataset. - Local clients running
just-dna-litewill automatically pull the updated Parquet files the next time they run an annotation job (provided their local cache is invalidated or updated).
- Open the
- Custom AI Modules:
- In the
just-dna-liteweb UI, open the Module Manager. - The user interacts with the AI agents to define a new module (or update an existing one) based on research papers or prompt inputs.
- Once the AI finalizes the module, the user clicks Register Module. This compiles the generated annotations into Parquet format and saves it locally.
- The module is immediately available for selection in the main annotation pipeline without needing to touch Hugging Face.
- In the
What it means: A new release of the Ensembl reference database (e.g., v111 -> v112) or ClinVar is available.
How to update:
- Open the
prepare-annotationsrepository. - In the
prepare-annotationspipeline configuration, update the target release version (e.g., Ensembl release number or ClinVar FTP URL). - Run the Ensembl/ClinVar Dagster pipeline (
uv run dagster-ensembl). - This pipeline is heavy. It will download the massive VCFs, split them by chromosome, parse them into Parquet format, and upload the partitioned dataset to Hugging Face.
- In
just-dna-lite, update the Ensembl reference pointer in themodules.yamlconfiguration to point to the newly published dataset version, so local pipelines know to fetch the latest reference chunks.
For deep details on PRS, see the just-prs documentation or the Architecture Overview.
What it means: The PGS Catalog has published new scoring files, or the underlying reference panels (1000G, HGDP) used to calculate percentiles need to be refreshed.
How to update:
- Open the
just-prsrepository. - To update the PGS Catalog Metadata (the searchable list of all scores and their performance metrics):
- Run the cleanup pipeline:
prs catalog bulk clean-metadata. - This downloads the latest bulk FTP data from EBI, cleans up genome builds and metric strings, and produces
scores.parquet,performance.parquet, andbest_performance.parquet. - Push the new metadata to Hugging Face:
prs push-hf.
- Run the cleanup pipeline:
- To update Reference Percentiles (the statistical distributions that determine a user's percentile for a given score):
- Run the batch scoring pipeline using the CLI:
prs reference score-batch(or launch theprs-pipelineDagster UI). - The engine uses DuckDB and Polars to score all ~5,000+ PGS IDs against the reference panel (
.pgenfiles). - This computes the mean, standard deviation, and full distribution metrics for every score across different ancestries, producing
{panel}_distributions.parquet. - The resulting percentiles data is pushed to the
just-dna-seq/prs-percentilesHugging Face repository.
- Run the batch scoring pipeline using the CLI:
- When users launch the
prs-uior runjust-dna-lite, thePRSCatalogclass will automatically pull the updated metadata and percentile Parquets from Hugging Face into their local cache.