Skip to content

Commit bdf2a08

Browse files
committed
conversion of our modules
1 parent 9f12397 commit bdf2a08

19 files changed

Lines changed: 5169 additions & 7 deletions

AGENTS.md

Lines changed: 106 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,25 @@
11
# Prepare Annotations Agent Guidelines
22

3-
This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.).
3+
This repository is dedicated to the preparation of genomic annotation data (Ensembl, ClinVar, dbSNP, gnomAD, etc.) and conversion of OakVar modules from the [dna-seq GitHub organization](https://github.com/orgs/dna-seq/repositories).
44

55
## Repository Layout (uv package)
66

77
- `src/prepare_annotations/`: Core logic and CLI.
88
- `preparation/`: Source-specific preparation pipelines (Prefect-based).
99
- `pipelines.py`: Main flow and pipeline definitions.
10+
- `oakvar/`: OakVar module management and conversion.
11+
- `modules.py`: CLI for downloading and managing OakVar modules.
12+
- `convert_longevitymap.py`: LongevityMap conversion to unified schema.
13+
- `convert_module.py`: Generic module conversion utilities.
1014
- `vortex/`: Vortex data conversion utilities.
1115
- `cli.py`: Main Typer CLI entrypoint.
1216
- `io.py`: VCF/Parquet I/O utilities.
1317
- `runtime.py`: Execution environment and profiling.
1418
- `models.py`: Pydantic models for results.
1519
- `dataset_cards/`: Markdown templates for Hugging Face dataset cards.
1620
- `tests/`: Unit and integration tests.
21+
- `conftest.py`: Shared fixtures including OakVar module download helpers.
22+
- `test_longevitymap_module.py`: Comprehensive validation of longevitymap conversion.
1723

1824
## Coding Standards
1925

@@ -27,10 +33,109 @@ This repository is dedicated to the preparation of genomic annotation data (Ense
2733

2834
## Commands
2935

36+
### Main Genomic Data Pipelines
37+
3038
- `uv run prepare-annotations ensembl`: Download and prepare Ensembl variations.
3139
- `uv run prepare-annotations clinvar`: Download and prepare ClinVar data.
40+
- `uv run prepare-annotations dbsnp`: Download and prepare dbSNP data.
41+
- `uv run prepare-annotations gnomad`: Download and prepare gnomAD data.
42+
43+
### OakVar Module Management
44+
45+
- `uv run modules data --repo dna-seq/just_longevitymap`: Download module data files.
46+
- `uv run modules clone --repo dna-seq/just_longevitymap`: Clone full module repository.
47+
- `uv run modules convert-longevitymap`: Convert LongevityMap to unified annotation schema.
48+
49+
### Unified Annotation Schema
50+
51+
The module conversion produces three standardized parquet files:
52+
53+
1. **annotations.parquet**: Variant-level facts
54+
- Schema: `rsid, module, gene, phenotype, category`
55+
- Links variants to genes and phenotype categories
56+
57+
2. **studies.parquet**: Per-study evidence
58+
- Schema: `rsid, module, pmid, population, p_value, conclusion, study_design`
59+
- Scientific evidence from publications
60+
61+
3. **weights.parquet**: Curator-defined scoring
62+
- Schema: `rsid, genotype, module, weight, state, priority, conclusion, curator, method`
63+
- Curated weight assignments for variant impact
64+
- State: `protective`, `risk`, or `neutral`
65+
- Genotype: Normalized (e.g., `CT`, `TT`, `AA`)
66+
67+
### Available Modules
68+
69+
Modules from https://github.com/orgs/dna-seq/repositories:
70+
- `just_longevitymap`: Longevity-associated variants
71+
- `just_pathogenic`: Pathogenic variant annotations
72+
- `just_cancer`: Cancer-associated genes
73+
- `just_coronary`: Coronary disease variants
74+
- `just_vo2max`: VO2max-related variants
75+
- `just_lipidmetabolism`: Lipid metabolism variants
76+
- `just_prs`: Polygenic risk score data
77+
- `just_drugs`: Pharmacogenomic data
78+
- `just_superhuman`: Elite performance genetics
3279

3380
## Deployment
3481

3582
Datasets are typically uploaded to the `just-dna-seq` organization on Hugging Face Hub.
3683

84+
## Testing
85+
86+
### Test Philosophy
87+
88+
- **Integration tests**: Use real data, no mocking unless necessary
89+
- **Auto-download**: Tests automatically download required data from GitHub
90+
- **Validation**: Comprehensive checks ensuring data integrity during conversion
91+
92+
### Running Tests
93+
94+
```bash
95+
# Run all tests (excluding large downloads)
96+
uv run pytest
97+
98+
# Run specific module tests
99+
uv run pytest tests/test_longevitymap_module.py -v
100+
101+
# Run with verbose output
102+
uv run pytest -vvv
103+
```
104+
105+
### Test Fixtures
106+
107+
The `conftest.py` provides shared fixtures for OakVar module testing:
108+
109+
- `ensure_oakvar_module_data()`: Downloads module data if not present
110+
- `download_oakvar_module_data()`: Directly downloads from GitHub repositories
111+
112+
These fixtures are automatically used by test modules to ensure data availability.
113+
114+
### Example: LongevityMap Validation
115+
116+
The `test_longevitymap_module.py` includes 47 tests validating:
117+
118+
1. **Weights Table** (1043 rows, 528 variants)
119+
- Row counts match between SQLite and Parquet
120+
- Weight values preserved (sum, min, max, mean)
121+
- Per-rsid weight sums match
122+
- Negative (risk) weights correctly identified
123+
124+
2. **APOE Variants** (Critical longevity markers)
125+
- rs7412 (APOE e2): protective weights
126+
- rs429358 (APOE e4): risk weights
127+
128+
3. **Schema Transformations**
129+
- Genotype format (het → CT, hom → TT)
130+
- State values (protective/risk)
131+
- Module column correctness
132+
133+
4. **Studies & Annotations**
134+
- All PMIDs preserved (270 unique)
135+
- Categories preserved (12 categories)
136+
- Populations preserved (81 populations)
137+
138+
Tests automatically:
139+
1. Download SQLite from `dna-seq/just_longevitymap` if missing
140+
2. Convert to parquet if needed
141+
3. Validate data integrity

README.md

Lines changed: 126 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ A dedicated toolkit for downloading, processing, and preparing genomic annotatio
1010
- **ClinVar**: Clinical variant data.
1111
- **dbSNP**: Single Nucleotide Polymorphism database.
1212
- **gnomAD**: Genome Aggregation Database.
13+
- **OakVar Module Management**: Download and convert data from [dna-seq](https://github.com/orgs/dna-seq/repositories) OakVar modules.
1314
- **VCF to Parquet**: Efficient conversion of large VCF files to columnar format.
1415
- **Variant Splitting**: Splitting variants by type (SNV, Indel, etc.) for optimized annotation.
1516
- **Hugging Face Hub Integration**: Direct upload of processed datasets with automatic dataset card generation.
@@ -26,9 +27,9 @@ uv sync
2627

2728
## Usage
2829

29-
### Command Line Interface
30+
### Main Genomic Data Pipeline
3031

31-
The main entry point is the `prepare-annotations` command.
32+
The `prepare-annotations` command handles large-scale genomic data downloads and processing.
3233

3334
```bash
3435
# Show version
@@ -39,23 +40,144 @@ uv run prepare-annotations ensembl --split --upload
3940

4041
# Download and process ClinVar data
4142
uv run prepare-annotations clinvar --split --upload
43+
44+
# Download and process dbSNP data
45+
uv run prepare-annotations dbsnp --build GRCh38 --split
46+
47+
# Download and process gnomAD data
48+
uv run prepare-annotations gnomad --version v4 --split
4249
```
4350

44-
### Options
51+
#### Main Pipeline Options
4552

4653
- `--dest-dir`: Destination directory for downloads.
4754
- `--split`: Split downloaded files by variant type.
4855
- `--upload`: Upload results to Hugging Face Hub.
4956
- `--repo-id`: Custom Hugging Face repository ID.
5057

58+
### OakVar Module Management
59+
60+
The `modules` command manages OakVar modules from the [dna-seq GitHub organization](https://github.com/orgs/dna-seq/repositories).
61+
62+
#### Download Module Data
63+
64+
Download data files (SQLite databases, etc.) from module repositories:
65+
66+
```bash
67+
# Download longevitymap data
68+
uv run modules data --repo dna-seq/just_longevitymap
69+
70+
# Download other modules
71+
uv run modules data --repo dna-seq/just_pathogenic
72+
uv run modules data --repo dna-seq/just_cancer
73+
uv run modules data --repo dna-seq/just_coronary
74+
uv run modules data --repo dna-seq/just_vo2max
75+
uv run modules data --repo dna-seq/just_lipidmetabolism
76+
77+
# Download with specific extensions
78+
uv run modules data --ext .parquet --ext .csv
79+
80+
# Download to custom directory
81+
uv run modules data --output-dir /path/to/output
82+
```
83+
84+
#### Clone Full Module Repository
85+
86+
Clone entire module repositories:
87+
88+
```bash
89+
# Clone longevitymap module
90+
uv run modules clone --repo dna-seq/just_longevitymap
91+
92+
# Clone to specific directory
93+
uv run modules clone --repo dna-seq/just_pathogenic --output-dir ./modules/
94+
```
95+
96+
#### Convert Module Data
97+
98+
Convert OakVar module data to unified annotation schema:
99+
100+
```bash
101+
# Convert LongevityMap to unified schema (3 parquet files)
102+
uv run modules convert-longevitymap
103+
104+
# With custom paths
105+
uv run modules convert-longevitymap \
106+
--db-path data/modules/just_longevitymap/longevitymap.sqlite \
107+
--output-dir data/output/modules/longevitymap \
108+
--curator "Olga Borysova" \
109+
--method "literature_review"
110+
```
111+
112+
The conversion produces three parquet files:
113+
- **annotations.parquet**: Variant-level facts (rsid, module, gene, phenotype, category)
114+
- **studies.parquet**: Per-study evidence (rsid, module, pmid, population, conclusion, study_design)
115+
- **weights.parquet**: Curator-defined scoring (rsid, genotype, module, weight, state, priority, curator, method)
116+
117+
### Available Modules
118+
119+
The following modules are available from the [dna-seq organization](https://github.com/orgs/dna-seq/repositories):
120+
121+
- **just_longevitymap**: Longevity-associated variants
122+
- **just_pathogenic**: Pathogenic variant annotations
123+
- **just_cancer**: Cancer-associated genes
124+
- **just_coronary**: Coronary disease variants
125+
- **just_vo2max**: VO2max-related variants
126+
- **just_lipidmetabolism**: Lipid metabolism variants
127+
- **just_prs**: Polygenic risk score data
128+
- **just_drugs**: Pharmacogenomic data
129+
- **just_superhuman**: Elite performance genetics
130+
51131
## Development
52132

53133
See [AGENTS.md](AGENTS.md) for development guidelines and repository layout.
54134

55135
### Running Tests
56136

137+
The project includes comprehensive test suites with automatic data download:
138+
57139
```bash
58-
uv run python -m pytest
140+
# Run all tests (excluding large downloads)
141+
uv run pytest
142+
143+
# Run specific test file
144+
uv run pytest tests/test_longevitymap_module.py -v
145+
146+
# Run with all markers (including large downloads)
147+
uv run pytest -m ""
148+
```
149+
150+
#### Test Features
151+
152+
- **Auto-download**: Tests automatically download required data from GitHub if not present
153+
- **Integration tests**: Real data validation (no mocking unless necessary)
154+
- **Module validation**: Comprehensive validation of converted module data
155+
156+
Example test modules:
157+
- `test_longevitymap_module.py`: 47 tests validating longevitymap conversion accuracy
158+
- Validates weights table preservation (1043 rows, 528 variants)
159+
- Verifies APOE variant weights (rs7412, rs429358)
160+
- Tests schema transformations
161+
- Validates studies and annotations tables
162+
163+
The tests will automatically:
164+
1. Download SQLite data from `dna-seq/just_longevitymap` if missing
165+
2. Convert to unified parquet schema if needed
166+
3. Run comprehensive validation checks
167+
168+
### Data Directories
169+
170+
```
171+
data/
172+
├── modules/ # Downloaded module data
173+
│ └── just_longevitymap/
174+
│ └── longevitymap.sqlite
175+
└── output/ # Converted/processed data
176+
└── modules/
177+
└── longevitymap/
178+
├── annotations.parquet
179+
├── studies.parquet
180+
└── weights.parquet
59181
```
60182

61183
## License

0 commit comments

Comments
 (0)