dna-seq
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 8 additions & 9 deletions b/‎AGENTS.md‎
Lines changed: 8 additions & 9 deletions
diff --git a/‎README.md‎
Lines changed: 20 additions & 22 deletions b/‎README.md‎
Lines changed: 20 additions & 22 deletions
diff --git a/‎dagster.yaml‎
Lines changed: 27 additions & 0 deletions b/‎dagster.yaml‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎docs/DAGSTER_ENSEMBL_PIPELINE.md‎
Lines changed: 55 additions & 10 deletions b/‎docs/DAGSTER_ENSEMBL_PIPELINE.md‎
Lines changed: 55 additions & 10 deletions
diff --git a/‎notebooks/inspect_modules.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎notebooks/inspect_modules.ipynb‎
Lines changed: 2 additions & 2 deletions
@@ -35,3 +35,4 @@ uv.lock
 .vscode/
 .idea/
 .cursor/
+.dagster_home
@@ -6,7 +6,7 @@ This repository is dedicated to the preparation of genomic annotation data (Ense
 
 - `src/prepare_annotations/`: Core logic and CLI.
   - `cli.py`: Main Typer CLI entrypoint.
-  - `pipelines.py`: Main Prefect flow and pipeline definitions.
+  - `pipelines/`: Primary Dagster-based pipelines.
   - `vcf_downloader.py`: VCF download utilities.
   - `genome_downloader.py`: Ensembl genome download utilities.
   - `huggingface_uploader.py`: Upload utilities for HuggingFace Hub.
@@ -21,8 +21,7 @@ This repository is dedicated to the preparation of genomic annotation data (Ense
     - `superhuman.py`: Elite performance genetics conversion.
     - `vo2max.py`: VO2max conversion.
     - `common.py`: Shared conversion utilities.
-  - `vortex/`: Vortex data conversion utilities.
-  - `pipelines_dagster/`: Dagster-based pipeline alternative.
+  - `pipelines/`: Primary Dagster-based pipelines.
   - `io.py`: VCF/Parquet I/O utilities.
   - `runtime.py`: Execution environment and profiling.
   - `models.py`: Pydantic models for results.
@@ -36,19 +35,19 @@ This repository is dedicated to the preparation of genomic annotation data (Ense
 - **Type hints**: Mandatory for all Python code.
 - **Pathlib**: Always use for all file paths.
 - **Polars**: Prefer over Pandas for performance.
-- **Prefect**: Used for workflow orchestration and parallel execution.
+- **Dagster**: Primary tool for workflow orchestration and parallel execution.
 - **Eliot**: Used for structured logging and action tracking.
 - **Typer**: Mandatory for CLI tools.
 - **Pydantic 2**: Mandatory for data classes.
+- **Avoid __all__: avoid __init__.py with __all__ as it confuses where things are located
 
 ## Commands
 
-### Main Genomic Data Pipelines
+### Primary Dagster Pipelines (Recommended)
 
-- `uv run prepare-annotations ensembl`: Download and prepare Ensembl variations.
-- `uv run prepare-annotations clinvar`: Download and prepare ClinVar data.
-- `uv run prepare-annotations dbsnp`: Download and prepare dbSNP data.
-- `uv run prepare-annotations gnomad`: Download and prepare gnomAD data.
+- `uv run dagster-ensembl`: Run the full Ensembl pipeline (download, convert, upload).
+- `uv run dagster-ensembl ui`: Launch Dagster UI for Ensembl pipelines.
+- `uv run dagster-ui`: General Dagster development server entrypoint.
 
 ### OakVar Module Management
 
 
@@ -4,15 +4,14 @@ A dedicated toolkit for downloading, processing, and preparing genomic annotatio
 
 ## Features
 
-- **Prefect-based Pipelines**: robust workflows for data preparation.
+- **Dagster-based Pipelines (Primary)**: Software-Defined Assets (SDA) with full lineage tracking, parallel execution, and automated Hugging Face uploads.
 - **Support for multiple sources**:
   - **Ensembl**: Human genetic variations.
   - **ClinVar**: Clinical variant data.
   - **dbSNP**: Single Nucleotide Polymorphism database.
   - **gnomAD**: Genome Aggregation Database.
 - **OakVar Module Management**: Download and convert data from [dna-seq](https://github.com/orgs/dna-seq/repositories) OakVar modules.
-- **VCF to Parquet**: Efficient conversion of large VCF files to columnar format.
-- **Variant Splitting**: Splitting variants by type (SNV, Indel, etc.) for optimized annotation.
+- **VCF to Parquet**: Efficient conversion of large VCF files to columnar format using `polars-bio`.
 - **Hugging Face Hub Integration**: Direct upload of processed datasets with automatic dataset card generation.
 
 ## Installation
@@ -27,33 +26,32 @@ uv sync
 
 ## Usage
 
-### Main Genomic Data Pipeline
+### 🔷 Dagster Pipelines
 
-The `prepare-annotations` command handles large-scale genomic data downloads and processing.
+The primary way to run pipelines is using Dagster. This provides parallel execution, resumable downloads, and integrated Hugging Face uploads.
 
-```bash
-# Show version
-uv run prepare-annotations version
-
-# Download and process Ensembl variations
-uv run prepare-annotations ensembl --split --upload
+#### Ensembl Pipeline
 
-# Download and process ClinVar data
-uv run prepare-annotations clinvar --split --upload
+```bash
+# Run the full pipeline (download → convert → upload)
+uv run dagster-ensembl
 
-# Download and process dbSNP data
-uv run prepare-annotations dbsnp --build GRCh38 --split
+# Start the Dagster UI for monitoring and interactive execution
+uv run dagster-ensembl ui
 
-# Download and process gnomAD data
-uv run prepare-annotations gnomad --version v4 --split
+# Run for a specific species
+uv run dagster-ensembl run --species mus_musculus
 ```
 
-#### Main Pipeline Options
+#### Other Dagster Commands
 
-- `--dest-dir`: Destination directory for downloads.
-- `--split`: Split downloaded files by variant type.
-- `--upload`: Upload results to Hugging Face Hub.
-- `--repo-id`: Custom Hugging Face repository ID.
+```bash
+# List all available assets
+uv run dagster-ui assets
+
+# Materialize specific assets
+uv run dagster-ui materialize ensembl_vcf_urls
+```
 
 ### OakVar Module Management
 
 
@@ -0,0 +1,27 @@
+# Dagster instance configuration for prepare-annotations
+#
+# Concurrency Control:
+#   Uses tag-based concurrency limits via run_coordinator to prevent
+#   too many memory-intensive operations running in parallel.
+#
+# See: https://docs.dagster.io/deployment/dagster-instance
+
+# Run coordinator with tag-based concurrency limits
+run_coordinator:
+  module: dagster.core.run_coordinator
+  class: QueuedRunCoordinator
+  config:
+    tag_concurrency_limits:
+      # Limit concurrent VCF downloads (I/O bound)
+      - key: "dagster/concurrency_key"
+        value: "ensembl_vcf_download"
+        limit: 4
+      # Limit concurrent parquet conversions (CPU/memory intensive)
+      - key: "dagster/concurrency_key"
+        value: "ensembl_parquet_conversion"
+        limit: 2
+
+# Unified storage configuration (SQLite for local development)
+storage:
+  sqlite:
+    base_dir: .dagster_home
@@ -3,16 +3,27 @@
 This repo includes a **Dagster** implementation of the Ensembl preparation pipeline as a parallel alternative to the Prefect flows.
 
 The Dagster implementation lives under:
-- `src/prepare_annotations/pipelines_dagster/`
+- `src/prepare_annotations/pipelines/`
 
 It is intentionally **file/directory based**: each asset materializes a concrete on-disk artifact (a JSON manifest, a directory of VCFs, a directory of Parquet files, etc.). This makes lineage inspectable and keeps memory usage predictable.
 
 ---
 
+### Core principles
+
+- **Lineage-first assets**: each asset returns a concrete on-disk artifact (Path) to avoid passing large in-memory objects.
+- **Dynamic partitioning**: per-file assets are partitioned by filename for fine-grained lineage and UI progress.
+- **Memory safety**: prefer streaming (`LazyFrame.sink_parquet` with `engine="streaming"` by default) and avoid eager materialization during conversion.
+- **Scale-aware joins**: for joins that Polars would materialize in memory, prefer DuckDB or staged filtering.
+- **Resource visibility**: download/convert steps log duration and peak memory where available.
+- **Idempotent outputs**: assets skip work when target files are present and up-to-date.
+
+---
+
 ### Key Features
 
 - **Parallel downloads**: Configurable concurrent downloads (`max_concurrent_downloads`, default: 4)
-- **Retry policies**: Exponential backoff retry policy on download failures (max 3 retries)
+- **Retry policies**: Dagster retry policy (max 3 attempts) plus downloader retries (default: 10)
 - **Checksum verification**: BSD sum checksum validation using CHECKSUMS file from Ensembl FTP
 - **Resumable downloads**: fsspec filecache-based resumption for interrupted transfers
 - Uploads directly from the non-split Parquet directory (no legacy TSA splitting in Dagster)
@@ -36,9 +47,11 @@ The default pipeline prepares Ensembl VCFs into Parquet format:
 ```mermaid
 flowchart TD
   A[ensembl_ftp_source<br/>(external)] --> B[ensembl_vcf_urls<br/>vcf_urls.json]
-  B --> C[ensembl_vcf_files<br/>vcf/ directory<br/>(parallel downloads)]
-  C --> D[ensembl_parquet_files<br/>species dir (*.parquet)]
-  D --> F[ensembl_hf_upload<br/>(optional)]
+  B --> C1[ensembl_vcf_file<br/>per-file download<br/>(partitioned)]
+  C1 --> C2[ensembl_vcf_files<br/>vcf/ directory<br/>(batch downloads)]
+  C2 --> D1[ensembl_parquet_file<br/>per-file conversion<br/>(partitioned)]
+  D1 --> D2[ensembl_parquet_files<br/>species dir (*.parquet)]
+  D2 --> F[ensembl_hf_upload<br/>(optional)]
 ```
 
 #### ASCII diagram (fallback)
@@ -50,10 +63,16 @@ ensembl_ftp_source  (external)
 ensembl_vcf_urls    (vcf_urls.json)
         |
         v
+ensembl_vcf_file    (per-file download, partitioned)
+        |
+        v
 ensembl_vcf_files   (vcf/ directory, parallel downloads with retries)
         |
         v
-ensembl_parquet_files  (species directory with *.parquet)
+ensembl_parquet_file  (per-file conversion, partitioned)
+        |
+        v
+ensembl_parquet_files (species directory with *.parquet)
         |
         v
 ensembl_hf_upload   (optional)
@@ -63,7 +82,7 @@ ensembl_hf_upload   (optional)
 
 ### On-disk layout (default)
 
-Paths are resolved via `src/prepare_annotations/pipelines_dagster/resources.py`.
+Paths are resolved via `src/prepare_annotations/pipelines/resources.py`.
 
 By default the pipeline writes to your user cache (same convention as other Just DNA tooling):
 - Base cache dir: `~/.cache/just-dna-pipelines/` (or `JUST_DNA_PIPELINES_CACHE_DIR`)
@@ -82,7 +101,9 @@ For Ensembl:
 - **Retry policy** with exponential backoff (30s initial delay, up to 3 retries) at the Dagster asset level.
 - **Resumable downloads** via fsspec filecache (interrupted downloads resume from where they left off).
 - **Checksum verification** using BSD sum (`CHECKSUMS` file from Ensembl FTP); corrupted files are automatically re-downloaded.
-- **VCF → Parquet** uses `polars-bio` scanning and `LazyFrame.sink_parquet(...)` to stream to disk.
+- **VCF → Parquet** uses `polars-bio` scanning and `LazyFrame.sink_parquet(..., engine="streaming")` to stream to disk by default.
+- **Resource logging**: conversion and download steps record duration/peak memory when available.
+- **Join strategy**: when Polars would materialize full datasets on joins, prefer DuckDB or pre-filtered joins to limit memory pressure.
 - Dagster assets return **Paths** (manifest files / directories), not large Python lists, to avoid passing large in-memory objects between steps.
 
 ---
@@ -124,6 +145,14 @@ List available jobs:
 uv run dagster-ensembl jobs
 ```
 
+Run LongevityMap conversion (Dagster module assets):
+
+```bash
+uv run dagster-ensembl longevitymap
+uv run dagster-ensembl longevitymap --full
+uv run dagster-ensembl longevitymap --upload
+```
+
 #### Run via Dagster UI
 
 Start the web interface for interactive execution:
@@ -138,7 +167,7 @@ Then materialize assets / jobs from the UI.
 
 ### Jobs provided
 
-Jobs are defined in `src/prepare_annotations/pipelines_dagster/definitions.py`:
+Jobs are defined in `src/prepare_annotations/pipelines/definitions.py`:
 
 | Job | Description |
 |-----|-------------|
@@ -147,6 +176,9 @@ Jobs are defined in `src/prepare_annotations/pipelines_dagster/definitions.py`:
 | `download` | Download VCF files only (parallel with retries) |
 | `convert` | Convert VCF to Parquet (assumes VCFs downloaded) |
 | `upload` | Upload to HuggingFace Hub (assumes parquet exists) |
+| `longevitymap` | Convert LongevityMap to unified schema (with Ensembl genotype resolution) |
+| `longevitymap_full` | Convert LongevityMap and join with full Ensembl data |
+| `longevitymap_upload` | Convert LongevityMap and upload to `just-dna-seq/annotators` |
 
 ---
 
@@ -156,18 +188,31 @@ Key configuration parameters (set via Dagster config):
 
 **EnsemblDownloadConfig:**
 - `species`: Species name (default: `homo_sapiens`)
+- `base_url`: Ensembl FTP base URL (default: `https://ftp.ensembl.org/pub/current_variation/vcf/`)
+- `pattern`: Regex to filter remote files (default: species-aware pattern)
+- `cache_dir`: Override cache directory (default: `~/.cache/just-dna-pipelines/ensembl/{species}`)
 - `max_concurrent_downloads`: Maximum parallel downloads (default: `4`)
 - `verify_checksums`: Whether to verify checksums (default: `True`)
+- `force_download`: Re-download even if files already exist (default: `False`)
+- `http_max_pool`: HTTP pool size for downloader (default: `20`)
 - `retries`: Number of retry attempts per file (default: `10`)
 - `connect_timeout`: Connection timeout in seconds (default: `10.0`)
 - `sock_read_timeout`: Socket read timeout in seconds (default: `120.0`)
 
+**ParquetConversionConfig:**
+- `max_concurrent_conversions`: Maximum parallel conversions. If unset, uses `PREPARE_ANNOTATIONS_PARQUET_WORKERS` env var; defaults to `2`.
+- `threads`: Thread count per conversion (auto-detected if not set).
+- `force_convert`: Re-convert even when parquet is up-to-date.
+
+**Environment overrides:**
+- `PREPARE_ANNOTATIONS_PARQUET_WORKERS`: Max concurrent parquet conversions (used when config not set).
+
 ### HuggingFace upload lineage
 
 The upload asset (`ensembl_hf_upload`) depends on the parquet directory output (`ensembl_parquet_files`). In the Dagster UI, this makes it straightforward to answer:
 - "Which local dataset was uploaded?"
 - "When did we last upload, and what was uploaded vs skipped?"
 
 Uploads are executed using the existing uploader implementation:
-- `prepare_annotations.preparation.huggingface_uploader.upload_parquet_to_hf`
+- `prepare_annotations.huggingface_uploader.upload_parquet_to_hf`
 
@@ -97,7 +97,7 @@
    ],
    "source": [
     "from os import listdir\n",
-    "from prepare_annotations.paths import (\n",
+    "from prepare_annotations.resources import (\n",
     "    get_cache_dir,\n",
     "    get_ensembl_cache,\n",
     "    get_ensembl_variations_cache,\n",
@@ -628,7 +628,7 @@
     }
    ],
    "source": [
-    "from prepare_annotations.paths import (\n",
+    "from prepare_annotations.resources import (\n",
     "    find_ensembl_genome_fasta,\n",
     "    list_ensembl_genome_fastas,\n",
     ")\n",