AI Module Creation

Vision

Annotation modules in just-dna-lite are curated SNP filter sets: a geneticist selects variants, assigns per-genotype weights/states/conclusions, links literature evidence, and packages everything as three parquet tables (weights, annotations, studies).

This is labour-intensive but structurally simple — the perfect target for AI assistance.

Goal: an agentic pipeline that turns arbitrary input (research article, CSV dump, free-text panel description) into a valid, deployable annotation module.

┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  Arbitrary   │      │   Module     │      │   Parquet    │      │  Registered  │
│   Input      │─────▶│   Spec (DSL) │─────▶│   Module     │─────▶│  & Live in   │
│  (article,   │  AI  │  YAML + CSV  │ Det. │  weights/    │ Reg. │  UI + CLI    │
│   CSV, md)   │ Agent│              │Compil│  annotations/│ istry│              │
└──────────────┘      └──────────────┘  er  │  studies     │      └──────────────┘
                                            └──────────────┘

The chain has three parts:

Part	Input → Output	Nature
Agent (Geneticist)	Arbitrary text → Module Spec	Creative, AI-driven
Compiler	Module Spec → Parquet	Deterministic, tested
Registry	Parquet → Live module	Automatic (persists to modules.yaml, refreshes discovery)

The compiler and registry are tools the agent calls to validate, build, and deploy its output — fast feedback loop.

DSL Format: Module Spec

A module spec is a directory containing:

my_module/
├── module_spec.yaml   # metadata + settings
├── variants.csv       # weights + annotations (combined)
└── studies.csv        # literature evidence (optional)

`module_spec.yaml`

schema_version: "1.0"

module:
  name: my_module                       # machine name, lowercase, underscores
  title: "My Module"                    # human-readable title
  description: "What this module does"  # one-liner
  report_title: "Report Section Title"  # title in PDF/HTML reports
  icon: heart-pulse                     # Fomantic UI icon name
  color: "#21ba45"                      # hex color for UI

defaults:
  curator: ai-module-creator            # default curator for all rows
  method: literature-review             # default annotation method
  priority: medium                      # default priority

genome_build: GRCh38                    # reference genome (positions must match)

`variants.csv`

One row per (rsid, genotype) combination. Combines weights + annotation data.

Column	Required	Type	Description
`rsid`	yes	string	dbSNP ID, e.g. `rs1801133`
`chrom`	no*	string	Chromosome without "chr" prefix, e.g. `1`
`start`	no*	int	1-based genomic position (GRCh38)
`ref`	no*	string	Reference allele
`alts`	no*	string	Alt allele(s), comma-separated if multiple
`genotype`	yes	string	Slash-separated sorted alleles, e.g. `A/G`
`weight`	yes	float	Annotation score (positive = protective, negative = risk)
`state`	yes	string	One of: `risk`, `protective`, `neutral`, `significant`, `alt`, `ref`
`conclusion`	yes	string	Human-readable interpretation for this genotype
`priority`	no	string	Priority level (overrides default)
`gene`	yes	string	Gene symbol, e.g. `MTHFR`
`phenotype`	yes	string	Associated trait/phenotype
`category`	yes	string	Grouping category within module
`clinvar`	no	bool	Is this variant in ClinVar?
`pathogenic`	no	bool	ClinVar pathogenic flag
`benign`	no	bool	ClinVar benign flag
`curator`	no	string	Overrides default curator
`method`	no	string	Overrides default method

* Position columns (chrom, start, ref, alts) can be omitted if the compiler has rsid resolution enabled. The compiler resolves them from dbSNP. If provided, they're used as-is (faster, no network call).

Genotype convention: alleles are alphabetically sorted and slash-separated. A/G not G/A. Homozygous: T/T. The compiler normalizes to List[String] for parquet.

`studies.csv`

One row per (rsid, pmid) combination.

Column	Required	Type	Description
`rsid`	yes	string	dbSNP ID
`pmid`	yes	string	PubMed ID (digits only)
`population`	no	string	Study population, e.g. `European`
`p_value`	no	string	Statistical significance, e.g. `<0.001`
`conclusion`	yes	string	Study-specific conclusion
`study_design`	no	string	e.g. `meta-analysis`, `GWAS`, `case-control`

Compiler: Spec → Parquet

The compiler is a deterministic Python function (no AI). It:

Validates the spec (Pydantic models, required columns, value ranges)
Resolves missing positions via rsid lookup (optional, requires Ensembl DuckDB)
Normalizes genotypes ("A/G" → ["A", "G"]), chromosomes, types
Splits variants.csv into weights.parquet + annotations.parquet
Writes studies.parquet from studies.csv
Validates output against the existing pipeline's schema expectations

Split logic (variants.csv → two parquet tables)

variants.csv
    │
    ├──▶ weights.parquet
    │      columns: rsid, chrom, start, ref, alts, genotype, weight,
    │               state, conclusion, priority, module, curator, method,
    │               clinvar, pathogenic, benign, likely_pathogenic, likely_benign
    │
    └──▶ annotations.parquet
           columns: rsid, gene, phenotype, category
           (deduplicated — one row per unique rsid)

Compiler output structure

output/my_module/
├── weights.parquet
├── annotations.parquet
└── studies.parquet

This is directly compatible with the existing module discovery system — drop it into any configured source and it's auto-discovered.

Package layout

just-dna-pipelines/src/just_dna_pipelines/module_compiler/
├── __init__.py      # Public API re-exports
├── models.py        # Pydantic 2 models (DSL schema + result types)
├── compiler.py      # validate_spec() + compile_module()
└── cli.py           # Typer commands (module validate, module compile)

Compilation pipeline

Validation runs before any output is written. Errors are collected and returned as a list.

Field-level (Pydantic): rsid matches ^rs\d+$, genotype is alphabetically sorted slash-separated alleles, state is one of six valid enum values, chrom is 1-22/X/Y/MT, pmid is digits-only, module.name is lowercase alphanumeric + underscores.

Cross-row: positional consistency (all rows for the same rsid share chrom/start), uniqueness of (rsid, genotype) and (rsid, pmid) pairs, and study rsid coverage warnings.

Directional warnings (not errors): state=risk with positive weight, state=protective with negative weight.

Validation vs compilation

	`validate_spec()`	`compile_module()`
Side effects	None	Writes parquet files
On error	`valid=False` + error list	`success=False` + error list

compile_module calls validate_spec internally — if validation fails, no output is produced.

Design decisions

CSV is read via stdlib csv.DictReader (not Polars) because the conclusion field contains commas, quotes, and long text that Polars CSV inference can misparse. Each row is validated through Pydantic before being collected into a DataFrame.

Validation collects all errors, never raises. When used as agent tools, the caller needs a complete error list to fix all issues in one pass.

annotations.parquet has one row per rsid. When the same rsid appears with multiple genotypes, the first row's gene/phenotype/category wins. This matches the pipeline's expectation that annotations are variant-level, not genotype-level.

Module Registry — Python API & CLI

The module registry exposes a complete API for programmatic module management.

Python API (`just_dna_pipelines.module_registry`)

from just_dna_pipelines.module_registry import (
    validate_module_spec,     # dry-run validation (no side effects)
    register_custom_module,   # compile + register + refresh
    unregister_custom_module, # remove + refresh
    list_custom_modules,      # names of custom modules on disk
    get_custom_module_specs,  # {name: output_dir} dict
    refresh_module_registry,  # reload modules.yaml + rediscover
    CUSTOM_MODULES_DIR,       # Path to compiled custom modules
)

Function	Input	Output	Side effects
`validate_module_spec(spec_dir)`	Path to spec folder	`ValidationResult` (.valid, .errors, .warnings, .stats)	None
`register_custom_module(spec_dir)`	Path to spec folder	`CompilationResult` (.success, .errors, .stats, .output_dir)	Writes parquet, updates modules.yaml, refreshes globals
`unregister_custom_module(name)`	Module machine name	`bool` (True if removed)	Deletes parquet, updates modules.yaml, refreshes globals
`list_custom_modules()`	—	`List[str]` of names	None
`refresh_module_registry()`	—	`List[str]` of all modules	Reloads config, re-discovers modules

CLI (`uv run pipelines module ...`)

uv run pipelines module validate data/module_specs/my_panel/
uv run pipelines module register data/module_specs/my_panel/
uv run pipelines module unregister my_panel
uv run pipelines module list-custom
uv run pipelines module compile data/module_specs/my_panel/ -o data/output/modules/my_panel/

What happens on `register`

Validates the spec (Pydantic models, CSV rows, cross-row checks)
Compiles to parquet in data/output/modules/<module_name>/
Ensures a local collection source for data/output/modules/ exists in modules.yaml
Adds display metadata (title, description, icon, color) from module_spec.yaml to modules.yaml
Refreshes in-memory module discovery (MODULE_INFOS, DISCOVERED_MODULES)
Module is immediately selectable in the web UI and CLI without restart

register_custom_module is idempotent — calling it again for the same spec overwrites the existing parquet and refreshes metadata. This enables iterative development: edit the DSL spec, re-register, check annotation results, repeat.

Agent Design (Agno)

Solo mode

┌────────────────────────────────────────────┐
│              Module Creator Agent           │
│                                            │
│  System prompt: geneticist persona         │
│  Model: Gemini Pro                         │
│                                            │
│  Tools:                                    │
│  ├─ validate_module_spec → dry-run check   │
│  ├─ register_custom_module → deploy module │
│  ├─ lookup_rsid     → dbSNP/Ensembl info   │
│  ├─ search_pubmed   → find PMIDs           │
│  └─ diff_modules    → compare with ref     │
│                                            │
│  Output: module_spec.yaml + CSVs           │
└────────────────────────────────────────────┘

Workflow: parse input → identify variants → look up rsid details → assign per-genotype weights → find PubMed references → write spec files → validate → fix errors (loop up to 10 iterations) → register → module is live.

Team mode (research swarm)

The team has 3–5 agents depending on which API keys are configured:

Agent	Role	Model
PI (Principal Investigator)	Coordinator. Delegates tasks, synthesizes consensus, writes final module.	Gemini Pro
Researcher 1	Independent variant research via BioContext MCP. Always present.	Gemini Pro
Researcher 2	Same role, different LLM for diversity. Present if `OPENAI_API_KEY` is set.	GPT
Researcher 3	Same role, third perspective. Present if `ANTHROPIC_API_KEY` is set.	Claude Sonnet
Reviewer	Quality review: variant integrity, provenance, weight consistency, PMID validity.	Gemini Flash

Team flow: PI delegates research to all Researchers in parallel → each returns a variant list → PI synthesizes consensus (variants confirmed by ≥2 researchers are included, weight disagreements use median) → PI sends draft to Reviewer → Reviewer returns ERRORS/WARNINGS/OK → PI fixes errors → writes and registers module.

There are no human-approve/reject steps between phases. The PI orchestrates end-to-end. After completion, the user reviews the module in the editing slot and can iterate via chat.

BioContext KB MCP

Variant lookup and literature search use a hosted MCP server: BioContext KB (https://biocontext-kb.fastmcp.app/mcp), integrated via Agno's MCPTools(url=...).

Available databases: Ensembl, EuropePMC, UniProt, Open Targets, Reactome, Human Protein Atlas, KEGG, ClinicalTrials.gov, AlphaFold, InterPro, OLS, STRINGDb, bioRxiv, Google Scholar.

Limitations

Max 5 file attachments per message (PDF, CSV, Markdown, plain text)
Each Researcher is limited to 9 tool calls to prevent context overflow
No human-in-the-loop between agent phases
Gemini key is required; OpenAI and Anthropic keys are optional (add researchers)
Only GRCh38 and GRCh37 genome builds
Only SNPs/indels (no structural or copy-number variants)
No persistent audit trail across page reloads

How modules are discovered and loaded

Configuration is read from modules.yaml at project root (or fallback inside the package)
discover_all_modules() iterates over all configured sources
For each source, the loader checks the protocol: hf:// → HuggingFace, github:// → GitHub, s3:// → S3, https:// → HTTP, /absolute/path → local filesystem
Auto-detection: if weights.parquet exists at root → single module; otherwise scan subfolders for a collection
Results stored in MODULE_INFOS and DISCOVERED_MODULES globals, populated at import time
refresh_modules() re-reads config and re-discovers without process restart

Modules are pure data (YAML + CSV). They have no Python dependencies. The annotation engine loads modules via Polars LazyFrame — no module code is executed. The pipeline joins weights.parquet against the normalized VCF using rsid or position matching and appends annotation columns.

Minimal "Hello World" module

module_spec.yaml:

schema_version: "1.0"
module:
  name: hello_world
  version: 1
  title: "Hello World"
  description: "Minimal example module with one variant."
  report_title: "Hello World Annotations"
  icon: dna
  color: "#21ba45"
defaults:
  curator: human
  method: literature-review
  priority: low
genome_build: GRCh38

variants.csv:

rsid,genotype,weight,state,conclusion,gene,phenotype,category
rs1801133,C/C,0,neutral,"Typical MTHFR activity.",MTHFR,Metabolism,Example
rs1801133,C/T,-0.6,risk,"Reduced MTHFR activity.",MTHFR,Metabolism,Example
rs1801133,T/T,-1.1,risk,"Severely reduced MTHFR activity.",MTHFR,Metabolism,Example

Eval test inputs

Two reference panels serve as agent evaluation cases:

data/module_specs/evals/cyp_panel/ — Pharmacogenomics CYP panel
data/module_specs/evals/mthfr_nad/ — Methylation & NAD+ metabolism panel

Evaluation criteria: schema validity (must pass), variant coverage, genotype completeness, weight directionality (risk=negative, protective=positive), state correctness, conclusion quality, study reference validity, gene/phenotype accuracy.

Implementation status

All phases are complete: DSL schema, reverse-engineering existing modules, compiler, round-trip testing, custom module registry (API + Web UI + CLI), Agno agent (solo + team), BioContext KB MCP integration. Remaining work: eval iteration on test inputs (Phase 9) and per-module report enhancement (Phase 10, deferred).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Module Creation

Vision

DSL Format: Module Spec

`module_spec.yaml`

`variants.csv`

`studies.csv`

Compiler: Spec → Parquet

Split logic (variants.csv → two parquet tables)

Compiler output structure

Package layout

Compilation pipeline

Validation vs compilation

Design decisions

Module Registry — Python API & CLI

Python API (`just_dna_pipelines.module_registry`)

CLI (`uv run pipelines module ...`)

What happens on `register`

Agent Design (Agno)

Solo mode

Team mode (research swarm)

BioContext KB MCP

Limitations

How modules are discovered and loaded

Minimal "Hello World" module

Eval test inputs

Implementation status

FilesExpand file tree

AI_MODULE_CREATION.md

Latest commit

History

AI_MODULE_CREATION.md

File metadata and controls

AI Module Creation

Vision

DSL Format: Module Spec

module_spec.yaml

variants.csv

studies.csv

Compiler: Spec → Parquet

Split logic (variants.csv → two parquet tables)

Compiler output structure

Package layout

Compilation pipeline

Validation vs compilation

Design decisions

Module Registry — Python API & CLI

Python API (just_dna_pipelines.module_registry)

CLI (uv run pipelines module ...)

What happens on register

Agent Design (Agno)

Solo mode

Team mode (research swarm)

BioContext KB MCP

Limitations

How modules are discovered and loaded

Minimal "Hello World" module

Eval test inputs

Implementation status

`module_spec.yaml`

`variants.csv`

`studies.csv`

Python API (`just_dna_pipelines.module_registry`)

CLI (`uv run pipelines module ...`)

What happens on `register`