genomic-variant-data-pipeline

A cloud-native, event-driven pipeline on AWS for ingesting, transforming, and querying ClinVar genomic variant data.

Overview

This pipeline automates the monthly ingestion of ClinVar VCV release files from NCBI, transforms the nested XML into a flattened Apache Iceberg table, and makes it queryable via Amazon Athena.

NCBI FTP (ClinVar XML)
        │
        ▼ (Lambda — 2nd Friday monthly)
   S3: clinvar-raw
        │
        ▼ (EventBridge → Glue Workflow)
   S3: clinvar-transformed  (Iceberg, partitioned by classification)
        │
        ▼
   Amazon Athena

Architecture

Component	Service	Purpose
Ingestion	AWS Lambda (Python 3.12)	Downloads gzipped ClinVar XML from NCBI FTP and stores in S3
Scheduling	EventBridge Scheduler	Triggers Lambda on the 2nd Friday of each month
Processing	AWS Glue (Spark, Glue 5.0)	Parses XML, flattens variant records, writes Iceberg table
Orchestration	EventBridge + Glue Workflow	S3 object creation triggers the Glue ETL job
Storage	Amazon S3	Raw XML, transformed Iceberg data, Glue scripts, Athena results
Query	Amazon Athena (engine v3)	Ad-hoc SQL over the `genomics.clinvar_vcv` Iceberg table
Error handling	SQS Dead Letter Queue	Captures Lambda failures for retry/inspection

Iceberg Table: `glue_catalog.genomics.clinvar_vcv`

Fields extracted per variant record:

Variation: ID, name, type, VCV accession + version
Alleles: ID, SPDI notation, protein changes
Gene: symbol, ID, HGNC ID, relationship type
Location: chromosome, GRCh38 position, reference/alternate alleles
Classification: status (Pathogenic / Benign / VUS), review status, date evaluated
Conditions: disease mappings

The table is partitioned by classification.

Infrastructure

Infrastructure is managed with Terraform and Terragrunt.

Environment	Target	Notes
`dev`	LocalStack 4.3	Runs locally via Docker Compose
`prod`	AWS Cloud	Managed via Terraform Cloud (`paty-training` org)

Local development

# Start LocalStack
docker compose up -d

# Deploy dev infrastructure
cd environments/dev
terragrunt apply

CI/CD

GitHub Actions workflow (.github/workflows/ci.yml) runs on every push:

build-and-test-infrastructure — spins up LocalStack, applies dev Terraform, runs terraform test and pytest
deploy-infrastructure — applies prod Terraform via HCP Terraform API (runs only after job 1 passes)

Python

Runtime: Python 3.12, managed with Poetry
Key dependencies: boto3, pre-commit
Test dependencies: pytest, pytest-mock, moto[s3]

poetry install
poetry run pytest

Modules

modules/
├── storage/     # S3 buckets (raw, transformed, Athena results, Glue scripts)
├── ingestion/   # Lambda + EventBridge Scheduler + DLQ
├── processing/  # Glue job, workflow, EventBridge trigger, IAM
└── analytics/   # Athena workgroup + IAM

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
.vscode		.vscode
environments		environments
modules		modules
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.terraform.lock.hcl		.terraform.lock.hcl
README.md		README.md
compose.yaml		compose.yaml
main.tf		main.tf
mise.toml		mise.toml
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
terraform.tf		terraform.tf
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genomic-variant-data-pipeline

Overview

Architecture

Iceberg Table: `glue_catalog.genomics.clinvar_vcv`

Infrastructure

Local development

CI/CD

Python

Modules

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

genomic-variant-data-pipeline

Overview

Architecture

Iceberg Table: glue_catalog.genomics.clinvar_vcv

Infrastructure

Local development

CI/CD

Python

Modules

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Iceberg Table: `glue_catalog.genomics.clinvar_vcv`

Packages