A robust, modular pipeline for post-GWAS functional characterization using AlphaGenome. This tool takes summary statistics (REGENIE format), identifies significant loci, refines them using LD information and Bayesian fine-mapping, and predicts functional effects (RNA expression, chromatin accessibility, splicing) in relevant tissues.
- Robust Clumping: Identifies independent genomic loci using distance-based clumping with deterministic sorting.
- Informed Selection: Pre-calculates the 95% Credible Set size for every locus, allowing you to prioritize high-value targets before running expensive AlphaGenome queries.
-
LD-Informed Fine-Mapping: Integrates with LDlink API to filter variants based on genetic linkage (
$R^2 \ge 0.1$ ) rather than just physical proximity. -
Two-Tier Fallback: If LD data is unavailable (e.g., rare variants missing from 1000G), the pipeline automatically:
- Attempts to use nearby "Neighbor Proxies".
- Falls back to a reduced physical window (±50kb) to focus on the immediate vicinity.
- QC Compliance: Respects author-recommended QC filtering by default, ensuring results are scientifically sound.
- Multimodal Scoring: Queries AlphaGenome for RNA-Seq, ATAC-Seq, and Splicing predictions.
-
Tissue Specificity: Automatically filters results for:
-
Brain (
UBERON:0000955) -
Blood (
UBERON:0000178) -
T-Cells (
CL:0000084)
-
Brain (
-
Performance: Uses
awkfor high-speed file scanning and implements aggressive caching for API calls and regional data.
- Python 3.8+
- Dependencies:
pandas,numpy,requests,tqdm,scipy - AlphaGenome Python library installed and authenticated.
You must set the following API keys in your environment:
# Required for functional annotation
export ALPHAGENOME_API_KEY="your_alphagenome_key"
# Optional (highly recommended) for LD-informed fine-mapping
export LDLINK_API_TOKEN="your_ldlink_token"Run the pipeline with default settings (interactive mode). It will scan the GWAS file, show identified loci with their Credible Set sizes, and ask you which to analyze.
python3 analyze_decodeme.pyThe pipeline defaults to strict author-recommended QC filtering. If you want to analyze variants that were excluded (e.g. for exploratory analysis of missing peaks like Chromosome 22), use the --skip-qc flag.
python3 analyze_decodeme.py --chrom 22 --skip-qcRun analysis on a specific chromosome or region to save time.
# Analyze only Chromosome 6
python3 analyze_decodeme.py --chrom 6
# Analyze a specific region
python3 analyze_decodeme.py --chrom 6 --start 25000000 --end 35000000python3 analyze_decodeme.py
--sig-threshold 8.0
--window 1000000
--credible-set 0.99
--output my_custom_analysispython3 analyze_decodeme.py --non-interactive --no-ldlink- Lead Identification: Scans the input summary statistics for variants exceeding the significance threshold.
- Clumping: Groups significant variants into independent genomic loci.
- Regional Extraction: For each locus, extracts all variants within the window.
- Credible Set Pre-calculation: Defines a 95% Credible Set using Approximate Bayes Factors (ABF). The size of this set is displayed to help the user manage AlphaGenome workload.
-
LD Filtering (Tiered):
-
Primary: Queries LDlink for variants in LD (
$R^2 \ge 0.1$ ) with the lead SNP. - Fallback 1: If lead is missing, tries nearby neighbors as proxies.
- Fallback 2: If all fails, reduces the window to ±50kb to focus on the immediate vicinity.
-
Primary: Queries LDlink for variants in LD (
- Annotation: Queries AlphaGenome to predict functional impacts for every variant in the credible set.
- Filtering & Save: Filters predictions for prioritized tissues and saves results to CSV.
The pipeline creates two cache directories:
region_cache/: Stores extracted GWAS summary statistics for specific windows.ld_cache/: Stores LDlink API responses.*_leads.csv: Caches the initial scan of significant hits.