You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-22Lines changed: 20 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,15 +4,14 @@ A dedicated toolkit for downloading, processing, and preparing genomic annotatio
4
4
5
5
## Features
6
6
7
-
-**Prefect-based Pipelines**: robust workflows for data preparation.
7
+
-**Dagster-based Pipelines (Primary)**: Software-Defined Assets (SDA) with full lineage tracking, parallel execution, and automated Hugging Face uploads.
8
8
-**Support for multiple sources**:
9
9
-**Ensembl**: Human genetic variations.
10
10
-**ClinVar**: Clinical variant data.
11
11
-**dbSNP**: Single Nucleotide Polymorphism database.
12
12
-**gnomAD**: Genome Aggregation Database.
13
13
-**OakVar Module Management**: Download and convert data from [dna-seq](https://github.com/orgs/dna-seq/repositories) OakVar modules.
14
-
-**VCF to Parquet**: Efficient conversion of large VCF files to columnar format.
15
-
-**Variant Splitting**: Splitting variants by type (SNV, Indel, etc.) for optimized annotation.
14
+
-**VCF to Parquet**: Efficient conversion of large VCF files to columnar format using `polars-bio`.
16
15
-**Hugging Face Hub Integration**: Direct upload of processed datasets with automatic dataset card generation.
17
16
18
17
## Installation
@@ -27,33 +26,32 @@ uv sync
27
26
28
27
## Usage
29
28
30
-
### Main Genomic Data Pipeline
29
+
### 🔷 Dagster Pipelines
31
30
32
-
The `prepare-annotations` command handles large-scale genomic data downloads and processing.
31
+
The primary way to run pipelines is using Dagster. This provides parallel execution, resumable downloads, and integrated Hugging Face uploads.
33
32
34
-
```bash
35
-
# Show version
36
-
uv run prepare-annotations version
37
-
38
-
# Download and process Ensembl variations
39
-
uv run prepare-annotations ensembl --split --upload
33
+
#### Ensembl Pipeline
40
34
41
-
# Download and process ClinVar data
42
-
uv run prepare-annotations clinvar --split --upload
35
+
```bash
36
+
# Run the full pipeline (download → convert → upload)
37
+
uv run dagster-ensembl
43
38
44
-
#Download and process dbSNP data
45
-
uv run prepare-annotations dbsnp --build GRCh38 --split
39
+
#Start the Dagster UI for monitoring and interactive execution
40
+
uv run dagster-ensembl ui
46
41
47
-
#Download and process gnomAD data
48
-
uv run prepare-annotations gnomad --version v4 --split
42
+
#Run for a specific species
43
+
uv run dagster-ensembl run --species mus_musculus
49
44
```
50
45
51
-
#### Main Pipeline Options
46
+
#### Other Dagster Commands
52
47
53
-
-`--dest-dir`: Destination directory for downloads.
54
-
-`--split`: Split downloaded files by variant type.
Copy file name to clipboardExpand all lines: docs/DAGSTER_ENSEMBL_PIPELINE.md
+55-10Lines changed: 55 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,16 +3,27 @@
3
3
This repo includes a **Dagster** implementation of the Ensembl preparation pipeline as a parallel alternative to the Prefect flows.
4
4
5
5
The Dagster implementation lives under:
6
-
-`src/prepare_annotations/pipelines_dagster/`
6
+
-`src/prepare_annotations/pipelines/`
7
7
8
8
It is intentionally **file/directory based**: each asset materializes a concrete on-disk artifact (a JSON manifest, a directory of VCFs, a directory of Parquet files, etc.). This makes lineage inspectable and keeps memory usage predictable.
9
9
10
10
---
11
11
12
+
### Core principles
13
+
14
+
-**Lineage-first assets**: each asset returns a concrete on-disk artifact (Path) to avoid passing large in-memory objects.
15
+
-**Dynamic partitioning**: per-file assets are partitioned by filename for fine-grained lineage and UI progress.
16
+
-**Memory safety**: prefer streaming (`LazyFrame.sink_parquet` with `engine="streaming"` by default) and avoid eager materialization during conversion.
17
+
-**Scale-aware joins**: for joins that Polars would materialize in memory, prefer DuckDB or staged filtering.
18
+
-**Resource visibility**: download/convert steps log duration and peak memory where available.
19
+
-**Idempotent outputs**: assets skip work when target files are present and up-to-date.
-`max_concurrent_downloads`: Maximum parallel downloads (default: `4`)
160
195
-`verify_checksums`: Whether to verify checksums (default: `True`)
196
+
-`force_download`: Re-download even if files already exist (default: `False`)
197
+
-`http_max_pool`: HTTP pool size for downloader (default: `20`)
161
198
-`retries`: Number of retry attempts per file (default: `10`)
162
199
-`connect_timeout`: Connection timeout in seconds (default: `10.0`)
163
200
-`sock_read_timeout`: Socket read timeout in seconds (default: `120.0`)
164
201
202
+
**ParquetConversionConfig:**
203
+
-`max_concurrent_conversions`: Maximum parallel conversions. If unset, uses `PREPARE_ANNOTATIONS_PARQUET_WORKERS` env var; defaults to `2`.
204
+
-`threads`: Thread count per conversion (auto-detected if not set).
205
+
-`force_convert`: Re-convert even when parquet is up-to-date.
206
+
207
+
**Environment overrides:**
208
+
-`PREPARE_ANNOTATIONS_PARQUET_WORKERS`: Max concurrent parquet conversions (used when config not set).
209
+
165
210
### HuggingFace upload lineage
166
211
167
212
The upload asset (`ensembl_hf_upload`) depends on the parquet directory output (`ensembl_parquet_files`). In the Dagster UI, this makes it straightforward to answer:
168
213
- "Which local dataset was uploaded?"
169
214
- "When did we last upload, and what was uploaded vs skipped?"
170
215
171
216
Uploads are executed using the existing uploader implementation:
0 commit comments