ATLAS: Attention-based Locus Analysis System
ATLAS (Attention-based Locus Analysis System) is a computational framework that leverages genomic language models to identify disease-associated genetic variants through differential attention analysis. By analyzing attention patterns from pretrained genomic foundation models, ATLAS pinpoints pathogenic loci at single-nucleotide and haplotype resolution.
- 𧬠Haplotype-resolved Analysis: Works with phased genomic sequences from case-control cohorts
- π€ Foundation Model Integration: Extracts attention matrices from pretrained genomic language models (Genos-10B)
- π Multi-scale Analysis: Performs both gene-level and base-level differential attention analysis
- π― High-resolution Mapping: Identifies pathogenic regions at single-nucleotide resolution
- π‘ Statistical Rigor: Uses Mann-Whitney U tests with Benjamini-Hochberg correction
ATLAS operates through four sequential steps:
Objective: Process VCF files and prepare haplotype-resolved sequences
ATLAS operates on haplotype-resolved genomic sequences from case and control cohorts. Starting from VCF (Variant Call Format) files, sequences are extracted for specified genomic regions and labeled by sample type.
Scripts: 1.data_vcf2csv/
1.vcf_extract.py- Extract regions from VCF files2.vcf2csv.py- Convert per-region VCFs to labeled CSV sequences
Objective: Extract position-level attention scores from genomic language models
Each sequence is processed by a pretrained genomic language model (Genos-10B in this study), from which attention matrices are extracted from the final layer. Attention scores are averaged across heads and summarized per base by column-wise aggregation to obtain position-level importance scores.
Scripts:
2.attn_score/export_attention_matrix_unified.py- Unified attention extraction with automatic strategy selection2.attn_score/calc_flash_attention_v2.py- FlashAttention-based block computation helper- Documentation:
../docs/attention_extraction_strategies.md
Three Strategies:
- Vanilla (L β€ 4,096): Standard attention for short sequences
- FlashAttention-based (4,096 < L β€ 131,072): Memory-efficient block-wise computation
- Chunked (L > 131,072): Sliding-window for ultra-long sequences
Objective: Identify candidate disease-associated genes through statistical testing
ATLAS performs gene-level differential attention analysis by comparing attention distribution statistics between case and control cohorts. The hypothesis is that pathogenic genes exhibit altered attention distributions relative to background genes.
Scripts: 3.gene_analysis/
1.calculate_17metrics.py- Compute 17 statistical metrics and perform Mann-Whitney U tests2.barplot_17metrics.py- Visualize significant genes with bar plots3.boxplot_17metrics.py- Generate detailed boxplots for individual genes
17 Statistical Metrics:
- Basic statistics (mean, std, median, max, CV, IQR, mode, percentiles)
- Distribution features (skewness, kurtosis)
- Complex metrics (top5%, low5%, peak count/density, Shannon entropy)
Objective: Identify and cluster pathogenic loci within candidate genes
Within candidate genes, ATLAS conducts base-level differential attention analysis to identify loci with significant attention shifts between cohorts. High-confidence loci are clustered into contiguous genomic regions, yielding candidate pathogenic regions at single-nucleotide and haplotype resolution.
Scripts:
4.base_analysis/analyze_base_signal.py- Perform log2 fold-change analysis at base resolution4.base_analysis/cluster_base_signal.py- Cluster significant loci into contiguous regions
Analysis Methods:
- Log2 fold-change calculation between case and control attention scores
- Statistical significance testing at each genomic position
- Clustering of significant loci into candidate pathogenic regions
- Python 3.8+
- CUDA-capable GPU (for attention extraction)
- bcftools (for VCF processing)
- Perl (for sequence extraction)
Install required Python packages:
pip install -r requirements.txt # If availableDownload the pretrained genomic language model:
# Example: Genos-10B or similar genomic foundation model
# Place model in /path/to/model/From the repository root:
cd ATLAScd 1.data_vcf2csv
# 1.1 Extract regions from VCF
python 1.vcf_extract.py \
--region-file regions.txt \
--vcf input.vcf \
--out-dir vcf_output \
--max-workers 8
# 1.2 Convert to labeled CSV
python 2.vcf2csv.py \
--region-file regions.txt \
--vcf-dir vcf_output \
--fasta GRCh38.fa \
--perl-script extract_variant_sequences_with_ref_loci.pl \
--sample-type-file sample_type.csv \
--out-dir labeled_csv \
--max-workers 8Region file format: chr:start-end:strand gene (0-based start coordinate)
cd ../2.attn_score
# Automatic strategy selection based on sequence length
python export_attention_matrix_unified.py \
--input_csv ../1.data_vcf2csv/labeled_csv/chr_start_end_forward_GENE.label.csv \
--model_path /path/to/model \
--vcf_file variants.vcf \
--output_dir ../results/attention_scores/GENE/ \
--strategy auto \
--gpu_id 0Output:
metadata.csv- Sample IDs and labelshap1_attention_collapsed.csv- Attention scores for haplotype 1hap2_attention_collapsed.csv- Attention scores for haplotype 2timing_stats.json- Performance statistics
cd ../3.gene_analysis
# 3.1 Calculate 17 metrics and perform statistical tests
python 1.calculate_17metrics.py
# 3.2 Generate bar plot visualization
python 2.barplot_17metrics.py
# 3.3 Generate boxplot for specific gene
python 3.boxplot_17metrics.pyNote: the scripts in 3.gene_analysis/ currently use hard-coded root_path / save_path variables; update them to point at your ATLAS/results/attention_scores/ (or ../results/attention_scores/ when running inside 3.gene_analysis/) and desired output folder before running.
Required directory structure:
ATLAS/results/attention_scores/
βββ gene1/
β βββ hap1_attention_collapsed.csv
β βββ hap2_attention_collapsed.csv
β βββ metadata.csv
βββ gene2/
β βββ ...
Output:
17stats_pvalues_with_BH_correction.csv- Statistical test results for all genessignificant_genes_BH_17pvalue.png/pdf- Bar plots of significant genes{gene}_17metrics_hap1_box.pdf- Detailed boxplots
cd ../4.base_analysis
# 4.1 Perform log2 fold-change analysis
python analyze_base_signal.py \
--input ../results/attention_scores/candidate_gene/ \
--output ../results/base_analysis/
# 4.2 Cluster significant loci
python cluster_base_signal.py \
--input ../results/base_analysis/log2fc_scores.csv \
--output ../results/clustered_regions/ \
--threshold 0.05Output:
- Log2 fold-change scores at each genomic position
- Clustered pathogenic regions with statistical significance
- Visualization of attention patterns across loci
ATLAS/
βββ readme.md # This file
βββ requirements.txt
β
βββ 1.data_vcf2csv/ # Step 1: Sequence preparation
β βββ 1.vcf_extract.py # Extract regions from VCF (bcftools)
β βββ 2.vcf2csv.py # Convert per-region VCFs to labeled CSV
β βββ extract_variant_sequences_with_ref_loci.pl
β
βββ 2.attn_score/ # Step 2: Attention extraction
β βββ export_attention_matrix_unified.py
β βββ calc_flash_attention_v2.py
β
βββ 3.gene_analysis/ # Step 3: Gene-level differential analysis
β βββ 1.calculate_17metrics.py
β βββ 2.barplot_17metrics.py
β βββ 3.boxplot_17metrics.py
β
βββ 4.base_analysis/ # Step 4: Base-level differential analysis
βββ analyze_base_signal.py
βββ cluster_base_signal.py
Step 1 Input:
- VCF file with phased genotypes
- Region file:
chr:start-end:strand gene(tab-separated) - Reference genome FASTA (e.g., GRCh38)
- Sample metadata CSV with columns:
sample,sample_type(0=control, 2=case)
Step 2 Input:
- Labeled sequence CSV from Step 1
- Pretrained genomic language model
- Optional: VCF file for haplotype information
Step 3 Input:
- Directory of gene folders, each containing:
hap1_attention_collapsed.csvhap2_attention_collapsed.csvmetadata.csv
Step 4 Input:
- Attention score files from Step 2
- Statistical results from Step 3 (for candidate gene selection)
Step 2 Output:
- CSV/Parquet files with attention scores per position
- Columns:
sample,pos_1,pos_2, ...,pos_N
Step 3 Output:
- Statistical results CSV with 52 columns:
genename- 17 Γ
{metric}_p(original P-values) - 17 Γ
{metric}_p_corrected(BH-corrected P-values) - 17 Γ
{metric}_significant(Boolean significance flags)
Step 4 Output:
- Log2 fold-change scores per genomic position
- Clustered pathogenic regions with coordinates and P-values
- Short sequences (β€4K): Use vanilla strategy for maximum speed
- Medium sequences (4K-131K): Use FlashAttention-based strategy for memory efficiency
- Long sequences (>131K): Use chunked strategy with appropriate overlap
- Memory-constrained: Reduce
--block_rowsor--chunk_sizeparameters
We would like to acknowledge the Human Pangenome Reference Consortium (BioProject ID: PRJNA730823) and its funder, the National Human Genome Research Institute (NHGRI).
This project is released under the MIT License.
