Version: 1.0.0
Code for the paper: Machine Learning-Based Prediction of Cell-type Resolved Brain eQTLs Enhances Discovery of Variants Explaining Alzheimer's Disease Heritability
Preprint (medRxiv): https://doi.org/10.64898/2025.12.03.25341562
Authors: Chirag M Lakhani, Giacomo Cavalca, Anjing Liu, Rohan Nidumbur, Ru Feng, Towfique Raj, Philip De Jager, The Alzheimer's Disease Functional Genomics Consortium, Gao Wang, David A. Knowles
scEEMS (single-cell Expression-mediated Effect Modeling System) is a CatBoost-based framework that predicts whether genetic variants are expression quantitative trait loci (eQTLs) in specific brain cell types. The model integrates variant-level genomic annotations (Enformer, ChromBPNet) with gene-level conservation features (GeneBayes gene conservation scores) and cell-type-specific regulatory features (ABC enhancer scores, transcription factor binding) to generate per-variant, per-gene, per-cell-type eQTL probabilities.
These predictions are used for:
- LDSC partitioned heritability analysis to quantify enrichment of Alzheimer's disease heritability
- MAGMA gene-set enrichment to identify disease-relevant gene sets
- Informed fine-mapping priors to improve SuSiE credible set resolution
- Operating system (tested): Ubuntu 22.04.5 LTS (GNU/Linux 5.15; NYGC cluster login node)
- Python: 3.9.18 (conda)
- R: 4.3.3
- Core Python packages: catboost, numpy, pandas, scipy, scikit-learn, pyarrow, dask, shap, pybedtools, pysam, pyranges, synapseclient, pyyaml
- Core R packages: susieR, tidyverse, readr, yaml, pecotmr
- External tools (configured via
config.yaml):- bedtools (v2.31.1)
- MAGMA (gene-set analysis binary; de Leeuw et al., 2015, https://doi.org/10.1371/journal.pcbi.1004219; E-MAGMA extension: Gerring et al., 2021, https://doi.org/10.1093/bioinformatics/btab115)
- PolyFun/LDSC (LD score regression; Weissbrod et al., 2020, https://doi.org/10.1038/s41588-020-00735-5)
- PLINK reference panel (for LD score computation)
- External data (configured via
config.yaml): Enformer/ChromBPNet variant effect predictions, ABC enhancer scores, baseline LD annotations, gnomAD MAF data, TF binding files, GeneBayes gene conservation scores, target lists - Synapse data for model training: download required training/test inputs from the Synapse project and use those paths in
config.yaml(see Additional Information).
The pipeline was run on the NYGC cluster with Ubuntu 22.04.5 LTS, Python 3.9.18, and R 4.3.3. The exact conda environment is recorded in conda_environment_full.txt; a minimal dependency list is provided in environment.yml and requirements.txt.
- No GPU required for the main pipeline.
- Full pipeline: requires a high-memory multi-core environment. Typical runs use 5–20 CPU cores and 30–200 GB RAM depending on step.
- Small-scale/demo runs: can be executed on a standard desktop (8–32 GB RAM) by restricting to a single cell type and chromosome.
| Step | Directory | Description |
|---|---|---|
| 1 | 1_process_datasets/ |
Internal preprocessing of raw fine-mapping RDS files (not publicly distributed) |
| 2 | 2_annotate_variants/ |
Annotate variants with genomic features (Enformer, ChromBPNet) |
| 3 | 3_cell_featurization/ |
Create per-gene feature matrices with cell-type-specific annotations |
| 4 | 4_create_training_data/ |
Sample positive/negative training examples using PIP thresholds |
| 5 | 5_model_training/ |
Train CatBoost classifiers with leave-one-chromosome-out CV |
| 6 | 6_model_inference/ |
Score all variants genome-wide using trained models |
| 7 | 7_aggregate_predictions/ |
Combine per-gene predictions into chromosome-level files |
| 8 | 8_shap_analysis/ |
Compute SHAP values for model interpretability |
| 9 | 9_create_annotations/ |
Create LDSC annotations and run partitioned heritability analysis |
| 10 | 10_magma_analysis/ |
Run MAGMA gene-set enrichment analysis |
| 11 | 11_fine_mapping_new_priors/ |
Re-run SuSiE fine-mapping with scEEMS-informed priors |
Each step directory contains a README.md with detailed documentation and a run_pipeline.sh example script.
For provenance of how preprocessing files were created (and which upstream inputs are non-public), see DATA_PROCESSING_NOTES.md.
git clone https://github.com/daklab/scEEMS.git
cd scEEMS
cp config.yaml.example config.yaml
# Edit config.yaml with your paths and credentialsconda env create -f environment.yml
conda activate scEEMSAlternatively, for pip users:
pip install -r requirements.txtThe file conda_environment_full.txt records the exact conda environment used during development.
R packages (for Steps 1 and 11):
install.packages(c("susieR", "pecotmr", "tidyverse", "readr", "yaml"))Typical install time: ~15–30 minutes on a normal desktop (longer if conda needs to download many packages).
Publicly downloadable Synapse resources in this repository:
model_training(Project:syn72248754): train/test parquet data and supporting files (columns_dict,gene_lof,gnomad_MAF)predictions(Folder:syn71338354): per-cell-type prediction outputs
Download with the helper script:
cp config.yaml.example config.yaml
python download_synapse_data.py --resource allOr selectively:
python download_synapse_data.py --resource model_training
python download_synapse_data.py --resource predictionsBy default files are downloaded to {paths.data_dir}/synapse_public/.
This public demo runs one cell type (Microglia) and one chromosome (chr2 only) for model training (Step 5).
Use 5_model_training/README_minimal_chr2_demo.md for:
- exact required files
- exact download URLs
- parquet-directory setup (
part.*.parquetfiles)
cp config.yaml.example config.yaml
# Set:
# - paths.data_dir to your demo root (with chr2 train/test parquet directories)
# - paths.gnomad_maf_dir to where gnomad_MAF_chr2.tsv is stored
# - paths.columns_dict_file to columns_dict.pklcd 5_model_training
bash run_minimal_training.sh \
Mic_mega_eQTL \
2 \
"<AUX_ROOT>/41588_2024_1820_MOESM4_ESM.xlsx"Equivalent direct command:
cd 5_model_training
python train_model.py Mic_mega_eQTL 2 \
--gene_lof_file "<AUX_ROOT>/41588_2024_1820_MOESM4_ESM.xlsx" \
--yaml_path data_params.yaml \
--single_chromosome_demo- Model file:
{data_dir}/training_data/Mic_mega_eQTL/model_results/model_standard_subset_conservative_weighted_chr_chr2_NPR_10.joblib - Feature importance CSVs
- Test set predictions
- Normal desktop (8–16 GB RAM, 4–8 cores): ~1–3 hours for a single chromosome (varies by data size).
- Cluster: a typical run for this step uses about 5 CPU cores, 50 GB RAM, and up to 5 hours.
For full multi-chromosome training, download the full Synapse model_training resource (syn72248754) and use the standard training workflow.
Each step can be run independently. See the README.md in each directory for specific instructions.
Note: Step 1 uses internal raw inputs that are not publicly distributed.
For public workflows, start from Synapse model_training / predictions data and run downstream steps.
cd 5_model_training && bash run_pipeline.sh
cd ../6_model_inference && bash run_pipeline.sh
# ... and so onIf you are running on a SLURM cluster, use the run_jobs.sh scripts provided in step directories where available and adapt resource requests (paths, partitions, and time limits) to your environment.
To reproduce all manuscript results exactly, run the full pipeline in order (Steps 1–11) using the same external datasets and annotations described in each step’s README.md.
Note: Step 1 raw fine-mapping RDS inputs are not publicly distributed in this repository. Public users can run training/inference workflows from the Synapse model_training and predictions resources.
All paths and credentials are managed through config.yaml. Copy config.yaml.example and fill in your values:
paths:
data_dir: "/path/to/data"
output_dir: "/path/to/output"
fine_mapping_dir: "/path/to/fine_mapping"
scratch_dir: "/tmp"
credentials:
synapse_token: ""
data_sources:
model_training_synapse_id: "syn72248754"
predictions_synapse_id: "syn71338354"Step 2 requires pre-computed variant effect predictions. See 2_annotate_variants/README.md for details on obtaining:
- Enformer variant effect predictions (Avsec et al., 2021)
- ChromBPNet cell-type-specific accessibility predictions (Pampari et al., 2025; https://doi.org/10.1101/2024.12.25.630221)
- GeneBayes gene conservation scores (Zeng et al., 2024; https://doi.org/10.1038/s41588-024-01820-9)
- Baseline LD annotations (Gazal et al., 2017; https://doi.org/10.1038/ng.3954)
The pipeline supports the following brain cell types:
| Abbreviation | Cell Type |
|---|---|
| Ast | Astrocytes |
| Exc | Excitatory neurons |
| Inh | Inhibitory neurons |
| Mic | Microglia |
| Oli | Oligodendrocytes |
| OPC | Oligodendrocyte precursor cells |
-
Data access: Publicly distributed inputs are in Synapse
model_training(syn72248754) andpredictions(syn71338354). Step 1 raw fine-mapping RDS inputs are not publicly distributed in this repo. Step 2 requires large annotation datasets that are not distributed with this repo. -
Synapse data access: Data access requires a Synapse account and may require approval for controlled-access datasets.
-
Synapse model training data (Project: syn72248754):
Training data (per cell type):
Cell Type Synapse ID Astrocytes (Ast) syn72248805 Excitatory neurons (Exc) syn72248803 Inhibitory neurons (Inh) syn72248804 Microglia (Mic) syn72248802 OPC syn72248807 Oligodendrocytes (Oli) syn72248806 Test data (per cell type):
Cell Type Synapse ID Astrocytes (Ast) syn72248942 Excitatory neurons (Exc) syn72248940 Inhibitory neurons (Inh) syn72248941 Microglia (Mic) syn72248939 OPC syn72248944 Oligodendrocytes (Oli) syn72248943 Supporting files:
File Synapse ID README_model_training.txt syn72249887 columns_dict syn72248797 gene_lof syn72248800 gnomad_MAF syn72248799 Predictions root folder: syn71338354
Predictions by cell type:
Resource Synapse ID Ast_mega_eQTL syn71338358 Exc_mega_eQTL syn71338356 Inh_mega_eQTL syn71338357 Mic_mega_eQTL syn71338355 OPC_mega_eQTL syn71338360 Oli_mega_eQTL syn71338359 README_predictions.txt syn71338459 -
Storage: Full pipeline outputs are large (multi-terabyte scale depending on cohorts and annotations). Ensure adequate disk space in
paths.output_dir. -
Compute: The pipeline is designed for HPC. For small tests or demos, limit to a single cohort and chromosome to reduce runtime and memory usage.
- Install and authenticate the Synapse client.
- Download public
model_trainingand/orpredictionsresources by Synapse ID.
Project URLs:
https://www.synapse.org/Synapse:syn72248754
https://www.synapse.org/Synapse:syn71338354
Example commands:
pip install synapseclient
synapse login
# Download model_training project
synapse get syn72248754
# Download predictions folder
synapse get syn71338354
# Or download specific model_training folders
synapse get syn72248797 # columns_dict
synapse get syn72248800 # gene_lof
synapse get syn72248799 # gnomad_MAF
synapse get syn72248798 # train
synapse get syn72248801 # testRaw Step 1 fine-mapping RDS files are not publicly distributed in this repository.
If you use this code, please cite:
@article{lakhani2025sceems,
title={Machine Learning-Based Prediction of Cell-type Resolved Brain eQTLs Enhances Discovery of Variants Explaining Alzheimer's Disease Heritability},
author={Lakhani, Chirag M and Cavalca, Giacomo and Liu, Anjing and Nidumbur, Rohan and Feng, Ru and Raj, Towfique and De Jager, Philip and The Alzheimer's Disease Functional Genomics Consortium and Wang, Gao and Knowles, David A.},
journal={medRxiv},
year={2025},
doi={10.64898/2025.12.03.25341562},
url={https://doi.org/10.64898/2025.12.03.25341562}
}This project is licensed under the MIT License - see the LICENSE file for details.