dna-seq
diff --git a/‎AGENTS.md‎
Lines changed: 106 additions & 1 deletion b/‎AGENTS.md‎
Lines changed: 106 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 126 additions & 4 deletions b/‎README.md‎
Lines changed: 126 additions & 4 deletions
@@ -1,19 +1,25 @@
 # Prepare Annotations Agent Guidelines
 
-This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.).
+This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.) and conversion of OakVar modules from the [dna-seq GitHub organization](https://github.com/orgs/dna-seq/repositories).
 
 ## Repository Layout (uv package)
 
 - `src/prepare_annotations/`: Core logic and CLI.
   - `preparation/`: Source-specific preparation pipelines (Prefect-based).
     - `pipelines.py`: Main flow and pipeline definitions.
+    - `oakvar/`: OakVar module management and conversion.
+      - `modules.py`: CLI for downloading and managing OakVar modules.
+      - `convert_longevitymap.py`: LongevityMap conversion to unified schema.
+      - `convert_module.py`: Generic module conversion utilities.
   - `vortex/`: Vortex data conversion utilities.
   - `cli.py`: Main Typer CLI entrypoint.
   - `io.py`: VCF/Parquet I/O utilities.
   - `runtime.py`: Execution environment and profiling.
   - `models.py`: Pydantic models for results.
 - `dataset_cards/`: Markdown templates for Hugging Face dataset cards.
 - `tests/`: Unit and integration tests.
+  - `conftest.py`: Shared fixtures including OakVar module download helpers.
+  - `test_longevitymap_module.py`: Comprehensive validation of longevitymap conversion.
 
 ## Coding Standards
 
@@ -27,10 +33,109 @@ This repository is dedicated to the preparation of genomic annotation data (Ense
 
 ## Commands
 
+### Main Genomic Data Pipelines
+
 - `uv run prepare-annotations ensembl`: Download and prepare Ensembl variations.
 - `uv run prepare-annotations clinvar`: Download and prepare ClinVar data.
+- `uv run prepare-annotations dbsnp`: Download and prepare dbSNP data.
+- `uv run prepare-annotations gnomad`: Download and prepare gnomAD data.
+
+### OakVar Module Management
+
+- `uv run modules data --repo dna-seq/just_longevitymap`: Download module data files.
+- `uv run modules clone --repo dna-seq/just_longevitymap`: Clone full module repository.
+- `uv run modules convert-longevitymap`: Convert LongevityMap to unified annotation schema.
+
+### Unified Annotation Schema
+
+The module conversion produces three standardized parquet files:
+
+1. **annotations.parquet**: Variant-level facts
+   - Schema: `rsid, module, gene, phenotype, category`
+   - Links variants to genes and phenotype categories
+
+2. **studies.parquet**: Per-study evidence
+   - Schema: `rsid, module, pmid, population, p_value, conclusion, study_design`
+   - Scientific evidence from publications
+
+3. **weights.parquet**: Curator-defined scoring
+   - Schema: `rsid, genotype, module, weight, state, priority, conclusion, curator, method`
+   - Curated weight assignments for variant impact
+   - State: `protective`, `risk`, or `neutral`
+   - Genotype: Normalized (e.g., `CT`, `TT`, `AA`)
+
+### Available Modules
+
+Modules from https://github.com/orgs/dna-seq/repositories:
+- `just_longevitymap`: Longevity-associated variants
+- `just_pathogenic`: Pathogenic variant annotations
+- `just_cancer`: Cancer-associated genes
+- `just_coronary`: Coronary disease variants
+- `just_vo2max`: VO2max-related variants
+- `just_lipidmetabolism`: Lipid metabolism variants
+- `just_prs`: Polygenic risk score data
+- `just_drugs`: Pharmacogenomic data
+- `just_superhuman`: Elite performance genetics
 
 ## Deployment
 
 Datasets are typically uploaded to the `just-dna-seq` organization on Hugging Face Hub.
 
+## Testing
+
+### Test Philosophy
+
+- **Integration tests**: Use real data, no mocking unless necessary
+- **Auto-download**: Tests automatically download required data from GitHub
+- **Validation**: Comprehensive checks ensuring data integrity during conversion
+
+### Running Tests
+
+```bash
+# Run all tests (excluding large downloads)
+uv run pytest
+
+# Run specific module tests
+uv run pytest tests/test_longevitymap_module.py -v
+
+# Run with verbose output
+uv run pytest -vvv
+```
+
+### Test Fixtures
+
+The `conftest.py` provides shared fixtures for OakVar module testing:
+
+- `ensure_oakvar_module_data()`: Downloads module data if not present
+- `download_oakvar_module_data()`: Directly downloads from GitHub repositories
+
+These fixtures are automatically used by test modules to ensure data availability.
+
+### Example: LongevityMap Validation
+
+The `test_longevitymap_module.py` includes 47 tests validating:
+
+1. **Weights Table** (1043 rows, 528 variants)
+   - Row counts match between SQLite and Parquet
+   - Weight values preserved (sum, min, max, mean)
+   - Per-rsid weight sums match
+   - Negative (risk) weights correctly identified
+
+2. **APOE Variants** (Critical longevity markers)
+   - rs7412 (APOE e2): protective weights
+   - rs429358 (APOE e4): risk weights
+
+3. **Schema Transformations**
+   - Genotype format (het → CT, hom → TT)
+   - State values (protective/risk)
+   - Module column correctness
+
+4. **Studies & Annotations**
+   - All PMIDs preserved (270 unique)
+   - Categories preserved (12 categories)
+   - Populations preserved (81 populations)
+
+Tests automatically:
+1. Download SQLite from `dna-seq/just_longevitymap` if missing
+2. Convert to parquet if needed
+3. Validate data integrity
@@ -10,6 +10,7 @@ A dedicated toolkit for downloading, processing, and preparing genomic annotatio
   - **ClinVar**: Clinical variant data.
   - **dbSNP**: Single Nucleotide Polymorphism database.
   - **gnomAD**: Genome Aggregation Database.
+- **OakVar Module Management**: Download and convert data from [dna-seq](https://github.com/orgs/dna-seq/repositories) OakVar modules.
 - **VCF to Parquet**: Efficient conversion of large VCF files to columnar format.
 - **Variant Splitting**: Splitting variants by type (SNV, Indel, etc.) for optimized annotation.
 - **Hugging Face Hub Integration**: Direct upload of processed datasets with automatic dataset card generation.
@@ -26,9 +27,9 @@ uv sync
 
 ## Usage
 
-### Command Line Interface
+### Main Genomic Data Pipeline
 
-The main entry point is the `prepare-annotations` command.
+The `prepare-annotations` command handles large-scale genomic data downloads and processing.
 
 ```bash
 # Show version
@@ -39,23 +40,144 @@ uv run prepare-annotations ensembl --split --upload
 
 # Download and process ClinVar data
 uv run prepare-annotations clinvar --split --upload
+
+# Download and process dbSNP data
+uv run prepare-annotations dbsnp --build GRCh38 --split
+
+# Download and process gnomAD data
+uv run prepare-annotations gnomad --version v4 --split
 ```
 
-### Options
+#### Main Pipeline Options
 
 - `--dest-dir`: Destination directory for downloads.
 - `--split`: Split downloaded files by variant type.
 - `--upload`: Upload results to Hugging Face Hub.
 - `--repo-id`: Custom Hugging Face repository ID.
 
+### OakVar Module Management
+
+The `modules` command manages OakVar modules from the [dna-seq GitHub organization](https://github.com/orgs/dna-seq/repositories).
+
+#### Download Module Data
+
+Download data files (SQLite databases, etc.) from module repositories:
+
+```bash
+# Download longevitymap data
+uv run modules data --repo dna-seq/just_longevitymap
+
+# Download other modules
+uv run modules data --repo dna-seq/just_pathogenic
+uv run modules data --repo dna-seq/just_cancer
+uv run modules data --repo dna-seq/just_coronary
+uv run modules data --repo dna-seq/just_vo2max
+uv run modules data --repo dna-seq/just_lipidmetabolism
+
+# Download with specific extensions
+uv run modules data --ext .parquet --ext .csv
+
+# Download to custom directory
+uv run modules data --output-dir /path/to/output
+```
+
+#### Clone Full Module Repository
+
+Clone entire module repositories:
+
+```bash
+# Clone longevitymap module
+uv run modules clone --repo dna-seq/just_longevitymap
+
+# Clone to specific directory
+uv run modules clone --repo dna-seq/just_pathogenic --output-dir ./modules/
+```
+
+#### Convert Module Data
+
+Convert OakVar module data to unified annotation schema:
+
+```bash
+# Convert LongevityMap to unified schema (3 parquet files)
+uv run modules convert-longevitymap
+
+# With custom paths
+uv run modules convert-longevitymap \
+  --db-path data/modules/just_longevitymap/longevitymap.sqlite \
+  --output-dir data/output/modules/longevitymap \
+  --curator "Olga Borysova" \
+  --method "literature_review"
+```
+
+The conversion produces three parquet files:
+- **annotations.parquet**: Variant-level facts (rsid, module, gene, phenotype, category)
+- **studies.parquet**: Per-study evidence (rsid, module, pmid, population, conclusion, study_design)
+- **weights.parquet**: Curator-defined scoring (rsid, genotype, module, weight, state, priority, curator, method)
+
+### Available Modules
+
+The following modules are available from the [dna-seq organization](https://github.com/orgs/dna-seq/repositories):
+
+- **just_longevitymap**: Longevity-associated variants
+- **just_pathogenic**: Pathogenic variant annotations
+- **just_cancer**: Cancer-associated genes
+- **just_coronary**: Coronary disease variants
+- **just_vo2max**: VO2max-related variants
+- **just_lipidmetabolism**: Lipid metabolism variants
+- **just_prs**: Polygenic risk score data
+- **just_drugs**: Pharmacogenomic data
+- **just_superhuman**: Elite performance genetics
+
 ## Development
 
 See [AGENTS.md](AGENTS.md) for development guidelines and repository layout.
 
 ### Running Tests
 
+The project includes comprehensive test suites with automatic data download:
+
 ```bash
-uv run python -m pytest
+# Run all tests (excluding large downloads)
+uv run pytest
+
+# Run specific test file
+uv run pytest tests/test_longevitymap_module.py -v
+
+# Run with all markers (including large downloads)
+uv run pytest -m ""
+```
+
+#### Test Features
+
+- **Auto-download**: Tests automatically download required data from GitHub if not present
+- **Integration tests**: Real data validation (no mocking unless necessary)
+- **Module validation**: Comprehensive validation of converted module data
+
+Example test modules:
+- `test_longevitymap_module.py`: 47 tests validating longevitymap conversion accuracy
+  - Validates weights table preservation (1043 rows, 528 variants)
+  - Verifies APOE variant weights (rs7412, rs429358)
+  - Tests schema transformations
+  - Validates studies and annotations tables
+
+The tests will automatically:
+1. Download SQLite data from `dna-seq/just_longevitymap` if missing
+2. Convert to unified parquet schema if needed
+3. Run comprehensive validation checks
+
+### Data Directories
+
+```
+data/
+├── modules/                    # Downloaded module data
+│   └── just_longevitymap/
+│       └── longevitymap.sqlite
+└── output/                     # Converted/processed data
+    └── modules/
+        └── longevitymap/
+            ├── annotations.parquet
+            ├── studies.parquet
+            └── weights.parquet
 ```
 
 ## License