scEEMS: Machine Learning-Based Prediction of Cell-Type-Resolved Brain eQTLs

Version: 1.0.0

Code for the paper: Machine Learning-Based Prediction of Cell-type Resolved Brain eQTLs Enhances Discovery of Variants Explaining Alzheimer's Disease Heritability

Preprint (medRxiv): https://doi.org/10.64898/2025.12.03.25341562

Authors: Chirag M Lakhani, Giacomo Cavalca, Anjing Liu, Rohan Nidumbur, Ru Feng, Towfique Raj, Philip De Jager, The Alzheimer's Disease Functional Genomics Consortium, Gao Wang, David A. Knowles

Overview

scEEMS (single-cell Expression-mediated Effect Modeling System) is a CatBoost-based framework that predicts whether genetic variants are expression quantitative trait loci (eQTLs) in specific brain cell types. The model integrates variant-level genomic annotations (Enformer, ChromBPNet) with gene-level conservation features (GeneBayes gene conservation scores) and cell-type-specific regulatory features (ABC enhancer scores, transcription factor binding) to generate per-variant, per-gene, per-cell-type eQTL probabilities.

These predictions are used for:

LDSC partitioned heritability analysis to quantify enrichment of Alzheimer's disease heritability
MAGMA gene-set enrichment to identify disease-relevant gene sets
Informed fine-mapping priors to improve SuSiE credible set resolution

System Requirements

Software Dependencies

Operating system (tested): Ubuntu 22.04.5 LTS (GNU/Linux 5.15; NYGC cluster login node)
Python: 3.9.18 (conda)
R: 4.3.3
Core Python packages: catboost, numpy, pandas, scipy, scikit-learn, pyarrow, dask, shap, pybedtools, pysam, pyranges, synapseclient, pyyaml
Core R packages: susieR, tidyverse, readr, yaml, pecotmr
External tools (configured via config.yaml):
- bedtools (v2.31.1)
- MAGMA (gene-set analysis binary; de Leeuw et al., 2015, https://doi.org/10.1371/journal.pcbi.1004219; E-MAGMA extension: Gerring et al., 2021, https://doi.org/10.1093/bioinformatics/btab115)
- PolyFun/LDSC (LD score regression; Weissbrod et al., 2020, https://doi.org/10.1038/s41588-020-00735-5)
- PLINK reference panel (for LD score computation)
External data (configured via config.yaml): Enformer/ChromBPNet variant effect predictions, ABC enhancer scores, baseline LD annotations, gnomAD MAF data, TF binding files, GeneBayes gene conservation scores, target lists
Synapse data for model training: download required training/test inputs from the Synapse project and use those paths in config.yaml (see Additional Information).

Versions Tested

The pipeline was run on the NYGC cluster with Ubuntu 22.04.5 LTS, Python 3.9.18, and R 4.3.3. The exact conda environment is recorded in conda_environment_full.txt; a minimal dependency list is provided in environment.yml and requirements.txt.

Hardware

No GPU required for the main pipeline.
Full pipeline: requires a high-memory multi-core environment. Typical runs use 5–20 CPU cores and 30–200 GB RAM depending on step.
Small-scale/demo runs: can be executed on a standard desktop (8–32 GB RAM) by restricting to a single cell type and chromosome.

Pipeline

Step	Directory	Description
1	`1_process_datasets/`	Internal preprocessing of raw fine-mapping RDS files (not publicly distributed)
2	`2_annotate_variants/`	Annotate variants with genomic features (Enformer, ChromBPNet)
3	`3_cell_featurization/`	Create per-gene feature matrices with cell-type-specific annotations
4	`4_create_training_data/`	Sample positive/negative training examples using PIP thresholds
5	`5_model_training/`	Train CatBoost classifiers with leave-one-chromosome-out CV
6	`6_model_inference/`	Score all variants genome-wide using trained models
7	`7_aggregate_predictions/`	Combine per-gene predictions into chromosome-level files
8	`8_shap_analysis/`	Compute SHAP values for model interpretability
9	`9_create_annotations/`	Create LDSC annotations and run partitioned heritability analysis
10	`10_magma_analysis/`	Run MAGMA gene-set enrichment analysis
11	`11_fine_mapping_new_priors/`	Re-run SuSiE fine-mapping with scEEMS-informed priors

Each step directory contains a README.md with detailed documentation and a run_pipeline.sh example script. For provenance of how preprocessing files were created (and which upstream inputs are non-public), see DATA_PROCESSING_NOTES.md.

Installation Guide

1. Clone and configure

git clone https://github.com/daklab/scEEMS.git
cd scEEMS
cp config.yaml.example config.yaml
# Edit config.yaml with your paths and credentials

2. Install dependencies (conda recommended)

conda env create -f environment.yml
conda activate scEEMS

Alternatively, for pip users:

pip install -r requirements.txt

The file conda_environment_full.txt records the exact conda environment used during development.

R packages (for Steps 1 and 11):

install.packages(c("susieR", "pecotmr", "tidyverse", "readr", "yaml"))

Typical install time: ~15–30 minutes on a normal desktop (longer if conda needs to download many packages).

Public Data Download

Publicly downloadable Synapse resources in this repository:

model_training (Project: syn72248754): train/test parquet data and supporting files (columns_dict, gene_lof, gnomad_MAF)
predictions (Folder: syn71338354): per-cell-type prediction outputs

Download with the helper script:

cp config.yaml.example config.yaml
python download_synapse_data.py --resource all

Or selectively:

python download_synapse_data.py --resource model_training
python download_synapse_data.py --resource predictions

By default files are downloaded to {paths.data_dir}/synapse_public/.

Minimal Training (Chromosome 2 Only)

This public demo runs one cell type (Microglia) and one chromosome (chr2 only) for model training (Step 5).

1. Download chr2 demo files

Use 5_model_training/README_minimal_chr2_demo.md for:

exact required files
exact download URLs
parquet-directory setup (part.*.parquet files)

2. Configure paths

cp config.yaml.example config.yaml
# Set:
# - paths.data_dir to your demo root (with chr2 train/test parquet directories)
# - paths.gnomad_maf_dir to where gnomad_MAF_chr2.tsv is stored
# - paths.columns_dict_file to columns_dict.pkl

3. Run the chr2 demo

cd 5_model_training
bash run_minimal_training.sh \
  Mic_mega_eQTL \
  2 \
  "<AUX_ROOT>/41588_2024_1820_MOESM4_ESM.xlsx"

Equivalent direct command:

cd 5_model_training
python train_model.py Mic_mega_eQTL 2 \
  --gene_lof_file "<AUX_ROOT>/41588_2024_1820_MOESM4_ESM.xlsx" \
  --yaml_path data_params.yaml \
  --single_chromosome_demo

Expected output

Model file: {data_dir}/training_data/Mic_mega_eQTL/model_results/model_standard_subset_conservative_weighted_chr_chr2_NPR_10.joblib
Feature importance CSVs
Test set predictions

Expected runtime

Normal desktop (8–16 GB RAM, 4–8 cores): ~1–3 hours for a single chromosome (varies by data size).
Cluster: a typical run for this step uses about 5 CPU cores, 50 GB RAM, and up to 5 hours.

For full multi-chromosome training, download the full Synapse model_training resource (syn72248754) and use the standard training workflow.

Instructions For Use

Run the pipeline

Each step can be run independently. See the README.md in each directory for specific instructions.

Note: Step 1 uses internal raw inputs that are not publicly distributed. For public workflows, start from Synapse model_training / predictions data and run downstream steps.

cd 5_model_training && bash run_pipeline.sh
cd ../6_model_inference && bash run_pipeline.sh
# ... and so on

Cluster/HPC execution

If you are running on a SLURM cluster, use the run_jobs.sh scripts provided in step directories where available and adapt resource requests (paths, partitions, and time limits) to your environment.

(Optional) Reproduction Instructions

To reproduce all manuscript results exactly, run the full pipeline in order (Steps 1–11) using the same external datasets and annotations described in each step’s README.md.

Note: Step 1 raw fine-mapping RDS inputs are not publicly distributed in this repository. Public users can run training/inference workflows from the Synapse model_training and predictions resources.

Configuration

All paths and credentials are managed through config.yaml. Copy config.yaml.example and fill in your values:

paths:
  data_dir: "/path/to/data"
  output_dir: "/path/to/output"
  fine_mapping_dir: "/path/to/fine_mapping"
  scratch_dir: "/tmp"

credentials:
  synapse_token: ""

data_sources:
  model_training_synapse_id: "syn72248754"
  predictions_synapse_id: "syn71338354"

Pre-computed Scores

Step 2 requires pre-computed variant effect predictions. See 2_annotate_variants/README.md for details on obtaining:

Enformer variant effect predictions (Avsec et al., 2021)
ChromBPNet cell-type-specific accessibility predictions (Pampari et al., 2025; https://doi.org/10.1101/2024.12.25.630221)
GeneBayes gene conservation scores (Zeng et al., 2024; https://doi.org/10.1038/s41588-024-01820-9)
Baseline LD annotations (Gazal et al., 2017; https://doi.org/10.1038/ng.3954)

Cell Types

The pipeline supports the following brain cell types:

Abbreviation	Cell Type
Ast	Astrocytes
Exc	Excitatory neurons
Inh	Inhibitory neurons
Mic	Microglia
Oli	Oligodendrocytes
OPC	Oligodendrocyte precursor cells

Additional Information

Data access: Publicly distributed inputs are in Synapse model_training (syn72248754) and predictions (syn71338354). Step 1 raw fine-mapping RDS inputs are not publicly distributed in this repo. Step 2 requires large annotation datasets that are not distributed with this repo.
Synapse data access: Data access requires a Synapse account and may require approval for controlled-access datasets.

Synapse model training data (Project: syn72248754):

Training data (per cell type):

Cell Type	Synapse ID
Astrocytes (Ast)	syn72248805
Excitatory neurons (Exc)	syn72248803
Inhibitory neurons (Inh)	syn72248804
Microglia (Mic)	syn72248802
OPC	syn72248807
Oligodendrocytes (Oli)	syn72248806

Test data (per cell type):

Cell Type	Synapse ID
Astrocytes (Ast)	syn72248942
Excitatory neurons (Exc)	syn72248940
Inhibitory neurons (Inh)	syn72248941
Microglia (Mic)	syn72248939
OPC	syn72248944
Oligodendrocytes (Oli)	syn72248943

Supporting files:

File	Synapse ID
README_model_training.txt	syn72249887
columns_dict	syn72248797
gene_lof	syn72248800
gnomad_MAF	syn72248799

Predictions root folder: syn71338354

Predictions by cell type:

Resource	Synapse ID
Ast_mega_eQTL	syn71338358
Exc_mega_eQTL	syn71338356
Inh_mega_eQTL	syn71338357
Mic_mega_eQTL	syn71338355
OPC_mega_eQTL	syn71338360
Oli_mega_eQTL	syn71338359
README_predictions.txt	syn71338459

Storage: Full pipeline outputs are large (multi-terabyte scale depending on cohorts and annotations). Ensure adequate disk space in paths.output_dir.
Compute: The pipeline is designed for HPC. For small tests or demos, limit to a single cohort and chromosome to reduce runtime and memory usage.

Download From Synapse

Install and authenticate the Synapse client.
Download public model_training and/or predictions resources by Synapse ID.

Project URLs:

https://www.synapse.org/Synapse:syn72248754
https://www.synapse.org/Synapse:syn71338354

Example commands:

pip install synapseclient
synapse login

# Download model_training project
synapse get syn72248754

# Download predictions folder
synapse get syn71338354

# Or download specific model_training folders
synapse get syn72248797  # columns_dict
synapse get syn72248800  # gene_lof
synapse get syn72248799  # gnomad_MAF
synapse get syn72248798  # train
synapse get syn72248801  # test

Raw Step 1 fine-mapping RDS files are not publicly distributed in this repository.

Citation

If you use this code, please cite:

@article{lakhani2025sceems,
  title={Machine Learning-Based Prediction of Cell-type Resolved Brain eQTLs Enhances Discovery of Variants Explaining Alzheimer's Disease Heritability},
  author={Lakhani, Chirag M and Cavalca, Giacomo and Liu, Anjing and Nidumbur, Rohan and Feng, Ru and Raj, Towfique and De Jager, Philip and The Alzheimer's Disease Functional Genomics Consortium and Wang, Gao and Knowles, David A.},
  journal={medRxiv},
  year={2025},
  doi={10.64898/2025.12.03.25341562},
  url={https://doi.org/10.64898/2025.12.03.25341562}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scEEMS: Machine Learning-Based Prediction of Cell-Type-Resolved Brain eQTLs

Overview

System Requirements

Software Dependencies

Versions Tested

Hardware

Pipeline

Installation Guide

1. Clone and configure

2. Install dependencies (conda recommended)

Public Data Download

Minimal Training (Chromosome 2 Only)

1. Download chr2 demo files

2. Configure paths

3. Run the chr2 demo

Expected output

Expected runtime

Instructions For Use

Run the pipeline

Cluster/HPC execution

(Optional) Reproduction Instructions

Configuration

Pre-computed Scores

Cell Types

Additional Information

Download From Synapse

Citation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
10_magma_analysis		10_magma_analysis
11_fine_mapping_new_priors		11_fine_mapping_new_priors
1_process_datasets		1_process_datasets
2_annotate_variants		2_annotate_variants
3_cell_featurization		3_cell_featurization
4_create_training_data		4_create_training_data
5_model_training		5_model_training
6_model_inference		6_model_inference
7_aggregate_predictions		7_aggregate_predictions
8_shap_analysis		8_shap_analysis
9_create_annotations		9_create_annotations
LICENSE		LICENSE
README.md		README.md
conda_environment_full.txt		conda_environment_full.txt
config.yaml.example		config.yaml.example
download_synapse_data.py		download_synapse_data.py
environment.yml		environment.yml

License

daklab/scEEMS

Folders and files

Latest commit

History

Repository files navigation

scEEMS: Machine Learning-Based Prediction of Cell-Type-Resolved Brain eQTLs

Overview

System Requirements

Software Dependencies

Versions Tested

Hardware

Pipeline

Installation Guide

1. Clone and configure

2. Install dependencies (conda recommended)

Public Data Download

Minimal Training (Chromosome 2 Only)

1. Download chr2 demo files

2. Configure paths

3. Run the chr2 demo

Expected output

Expected runtime

Instructions For Use

Run the pipeline

Cluster/HPC execution

(Optional) Reproduction Instructions

Configuration

Pre-computed Scores

Cell Types

Additional Information

Download From Synapse

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages