A cloud-native, event-driven pipeline on AWS for ingesting, transforming, and querying ClinVar genomic variant data.
This pipeline automates the monthly ingestion of ClinVar VCV release files from NCBI, transforms the nested XML into a flattened Apache Iceberg table, and makes it queryable via Amazon Athena.
NCBI FTP (ClinVar XML)
│
▼ (Lambda — 2nd Friday monthly)
S3: clinvar-raw
│
▼ (EventBridge → Glue Workflow)
S3: clinvar-transformed (Iceberg, partitioned by classification)
│
▼
Amazon Athena
| Component | Service | Purpose |
|---|---|---|
| Ingestion | AWS Lambda (Python 3.12) | Downloads gzipped ClinVar XML from NCBI FTP and stores in S3 |
| Scheduling | EventBridge Scheduler | Triggers Lambda on the 2nd Friday of each month |
| Processing | AWS Glue (Spark, Glue 5.0) | Parses XML, flattens variant records, writes Iceberg table |
| Orchestration | EventBridge + Glue Workflow | S3 object creation triggers the Glue ETL job |
| Storage | Amazon S3 | Raw XML, transformed Iceberg data, Glue scripts, Athena results |
| Query | Amazon Athena (engine v3) | Ad-hoc SQL over the genomics.clinvar_vcv Iceberg table |
| Error handling | SQS Dead Letter Queue | Captures Lambda failures for retry/inspection |
Fields extracted per variant record:
- Variation: ID, name, type, VCV accession + version
- Alleles: ID, SPDI notation, protein changes
- Gene: symbol, ID, HGNC ID, relationship type
- Location: chromosome, GRCh38 position, reference/alternate alleles
- Classification: status (Pathogenic / Benign / VUS), review status, date evaluated
- Conditions: disease mappings
The table is partitioned by classification.
Infrastructure is managed with Terraform and Terragrunt.
| Environment | Target | Notes |
|---|---|---|
dev |
LocalStack 4.3 | Runs locally via Docker Compose |
prod |
AWS Cloud | Managed via Terraform Cloud (paty-training org) |
# Start LocalStack
docker compose up -d
# Deploy dev infrastructure
cd environments/dev
terragrunt applyGitHub Actions workflow (.github/workflows/ci.yml) runs on every push:
- build-and-test-infrastructure — spins up LocalStack, applies dev Terraform, runs
terraform testandpytest - deploy-infrastructure — applies prod Terraform via HCP Terraform API (runs only after job 1 passes)
- Runtime: Python 3.12, managed with Poetry
- Key dependencies:
boto3,pre-commit - Test dependencies:
pytest,pytest-mock,moto[s3]
poetry install
poetry run pytestmodules/
├── storage/ # S3 buckets (raw, transformed, Athena results, Glue scripts)
├── ingestion/ # Lambda + EventBridge Scheduler + DLQ
├── processing/ # Glue job, workflow, EventBridge trigger, IAM
└── analytics/ # Athena workgroup + IAM