@@ -10,6 +10,7 @@ A dedicated toolkit for downloading, processing, and preparing genomic annotatio
1010 - ** ClinVar** : Clinical variant data.
1111 - ** dbSNP** : Single Nucleotide Polymorphism database.
1212 - ** gnomAD** : Genome Aggregation Database.
13+ - ** OakVar Module Management** : Download and convert data from [ dna-seq] ( https://github.com/orgs/dna-seq/repositories ) OakVar modules.
1314- ** VCF to Parquet** : Efficient conversion of large VCF files to columnar format.
1415- ** Variant Splitting** : Splitting variants by type (SNV, Indel, etc.) for optimized annotation.
1516- ** Hugging Face Hub Integration** : Direct upload of processed datasets with automatic dataset card generation.
@@ -26,9 +27,9 @@ uv sync
2627
2728## Usage
2829
29- ### Command Line Interface
30+ ### Main Genomic Data Pipeline
3031
31- The main entry point is the ` prepare-annotations ` command.
32+ The ` prepare-annotations ` command handles large-scale genomic data downloads and processing .
3233
3334``` bash
3435# Show version
@@ -39,23 +40,144 @@ uv run prepare-annotations ensembl --split --upload
3940
4041# Download and process ClinVar data
4142uv run prepare-annotations clinvar --split --upload
43+
44+ # Download and process dbSNP data
45+ uv run prepare-annotations dbsnp --build GRCh38 --split
46+
47+ # Download and process gnomAD data
48+ uv run prepare-annotations gnomad --version v4 --split
4249```
4350
44- ### Options
51+ #### Main Pipeline Options
4552
4653- ` --dest-dir ` : Destination directory for downloads.
4754- ` --split ` : Split downloaded files by variant type.
4855- ` --upload ` : Upload results to Hugging Face Hub.
4956- ` --repo-id ` : Custom Hugging Face repository ID.
5057
58+ ### OakVar Module Management
59+
60+ The ` modules ` command manages OakVar modules from the [ dna-seq GitHub organization] ( https://github.com/orgs/dna-seq/repositories ) .
61+
62+ #### Download Module Data
63+
64+ Download data files (SQLite databases, etc.) from module repositories:
65+
66+ ``` bash
67+ # Download longevitymap data
68+ uv run modules data --repo dna-seq/just_longevitymap
69+
70+ # Download other modules
71+ uv run modules data --repo dna-seq/just_pathogenic
72+ uv run modules data --repo dna-seq/just_cancer
73+ uv run modules data --repo dna-seq/just_coronary
74+ uv run modules data --repo dna-seq/just_vo2max
75+ uv run modules data --repo dna-seq/just_lipidmetabolism
76+
77+ # Download with specific extensions
78+ uv run modules data --ext .parquet --ext .csv
79+
80+ # Download to custom directory
81+ uv run modules data --output-dir /path/to/output
82+ ```
83+
84+ #### Clone Full Module Repository
85+
86+ Clone entire module repositories:
87+
88+ ``` bash
89+ # Clone longevitymap module
90+ uv run modules clone --repo dna-seq/just_longevitymap
91+
92+ # Clone to specific directory
93+ uv run modules clone --repo dna-seq/just_pathogenic --output-dir ./modules/
94+ ```
95+
96+ #### Convert Module Data
97+
98+ Convert OakVar module data to unified annotation schema:
99+
100+ ``` bash
101+ # Convert LongevityMap to unified schema (3 parquet files)
102+ uv run modules convert-longevitymap
103+
104+ # With custom paths
105+ uv run modules convert-longevitymap \
106+ --db-path data/modules/just_longevitymap/longevitymap.sqlite \
107+ --output-dir data/output/modules/longevitymap \
108+ --curator " Olga Borysova" \
109+ --method " literature_review"
110+ ```
111+
112+ The conversion produces three parquet files:
113+ - ** annotations.parquet** : Variant-level facts (rsid, module, gene, phenotype, category)
114+ - ** studies.parquet** : Per-study evidence (rsid, module, pmid, population, conclusion, study_design)
115+ - ** weights.parquet** : Curator-defined scoring (rsid, genotype, module, weight, state, priority, curator, method)
116+
117+ ### Available Modules
118+
119+ The following modules are available from the [ dna-seq organization] ( https://github.com/orgs/dna-seq/repositories ) :
120+
121+ - ** just_longevitymap** : Longevity-associated variants
122+ - ** just_pathogenic** : Pathogenic variant annotations
123+ - ** just_cancer** : Cancer-associated genes
124+ - ** just_coronary** : Coronary disease variants
125+ - ** just_vo2max** : VO2max-related variants
126+ - ** just_lipidmetabolism** : Lipid metabolism variants
127+ - ** just_prs** : Polygenic risk score data
128+ - ** just_drugs** : Pharmacogenomic data
129+ - ** just_superhuman** : Elite performance genetics
130+
51131## Development
52132
53133See [ AGENTS.md] ( AGENTS.md ) for development guidelines and repository layout.
54134
55135### Running Tests
56136
137+ The project includes comprehensive test suites with automatic data download:
138+
57139``` bash
58- uv run python -m pytest
140+ # Run all tests (excluding large downloads)
141+ uv run pytest
142+
143+ # Run specific test file
144+ uv run pytest tests/test_longevitymap_module.py -v
145+
146+ # Run with all markers (including large downloads)
147+ uv run pytest -m " "
148+ ```
149+
150+ #### Test Features
151+
152+ - ** Auto-download** : Tests automatically download required data from GitHub if not present
153+ - ** Integration tests** : Real data validation (no mocking unless necessary)
154+ - ** Module validation** : Comprehensive validation of converted module data
155+
156+ Example test modules:
157+ - ` test_longevitymap_module.py ` : 47 tests validating longevitymap conversion accuracy
158+ - Validates weights table preservation (1043 rows, 528 variants)
159+ - Verifies APOE variant weights (rs7412, rs429358)
160+ - Tests schema transformations
161+ - Validates studies and annotations tables
162+
163+ The tests will automatically:
164+ 1 . Download SQLite data from ` dna-seq/just_longevitymap ` if missing
165+ 2 . Convert to unified parquet schema if needed
166+ 3 . Run comprehensive validation checks
167+
168+ ### Data Directories
169+
170+ ```
171+ data/
172+ ├── modules/ # Downloaded module data
173+ │ └── just_longevitymap/
174+ │ └── longevitymap.sqlite
175+ └── output/ # Converted/processed data
176+ └── modules/
177+ └── longevitymap/
178+ ├── annotations.parquet
179+ ├── studies.parquet
180+ └── weights.parquet
59181```
60182
61183## License
0 commit comments