Focus on single gene effects limits discovery and interpretation of complex trait-associated variants
This repository contains the code associated with our work developing a multi-gene eQTL mapping framework, termed cis-principal component expression QTL (cis-pc eQTL or pcQTL). pcQTL leverage allelic expression "proxitropy" - the phenomenon by which one variant changes the expression of mulptile, nearby genes - to map QTL effects missed by a standard single-gene eQTL approach.
Our preprint is available here
This repository contains a Snakemake workflow for performing pcQTL analysis. To run the complete workflow:
-
Create a conda environment according to Environment Setup
-
Download the necessary data as outlined in Required Datasets
-
Edit the example configuration file
config/config_example.yamlto specify the paths to your input data and desired output directories. -
Run the workflow with your configuration:
snakemake --configfile config/config_example.yamlworkflow/rules: Snakemake workflow for the pcQTL mapping framework.workflow/scripts: Python and R scripts used by the workflow.workflow/main_figures: Jupyter notebooks for data analysis and visualization for main figure panels.workflow/supplemental_figures: Jupyter notebooks for data analysis and visualization for suplemental figure panels.config: Example configuration file for the workflow.references: Small file-size references. Details on how to download other references nedded are in Required Datasets.Snakefile: Main Snakemake workflow file that orchestrates the pcQTL analysis pipeline.
This workflow can be run from a specific conda environment with all dependencies installed.
# Clone this repository
git clone https://github.com/kal26/pcqtls.git
cd pcqtls
# Create the conda environment
conda env create -f environment.yml
# Activate the environment
conda activate pcqtlThe workflow performs the following analyses:
- Gene clustering: Identifies co-expressed gene clusters
- eQTL analysis: Maps expression quantitative trait loci for individual genes
- pcQTL analysis: Maps QTLs for principal components derived from gene clusters
- Functional annotation: Annotates variants and clusters with functional information
- Co-localization: Performs co-localization analysis with GWAS summary statistics
| Dataset | Description | Source | Expected Location in Config |
|---|---|---|---|
| GTEx v8 Expression Data | Normalized gene expression by tissue | GTEx Portal | expression_dir |
| GTEx v8 Genotypes | Genotype data: .bed, .bim, and .fam | Protected access on dbGaP, access available via request at this link | genotype_stem |
| GTEx v8 Covariates | Technical and biological covariates | GTEx Portal | covariates_dir |
| GTEx v8 Sample Sizes | Sample sizes for each tissue | references/gtex_sample_sizes.txt |
gtex_meta |
| Tissue IDs | List of tissues to analyze | User generated, see references/selected_tissue_ids.txt as an example |
tissue_id_path |
| Chromosome List | Chromosomes to analyze | references/chrs.txt |
chr_list_path |
| GWAS Metadata | Metadata from GWAS studies | Zenodo | gwas_meta |
| GWAS Summary Stats | GWAS summary statistics files | Zenodo | gwas_folder |
| Dataset | Description | Source | Expected Location in Config |
|---|---|---|---|
| GENCODE v26 | Gene annotations | GENCODE | gencode_path |
| ABC Enhancer Predictions | Activity-by-Contact enhancer predictions (hg19, requires liftOver to hg38) | Engreitz Lab | full_abc_path |
| CTCF ID-tissue | Matching between tissues and CTCF experiments | references/ctcf_matched_gtex.txt |
ctcf_match_path |
| CTCF Peaks | CTCF binding data from ENTEx, the experiments from references/ctcf_matched_gtex.txt |
ENTEx | ctcf_dir |
| Cross-mappability | Cross-mappability from Saha and Battle (2019) | figshare | cross_map_path |
| Paralog Relationships | Gene paralog information | Ensembl Biomart | paralog_path |
| Gene Ontology Terms | GO term annotations | Ensembl Biomart | go_path |
| TAD Boundaries | Topologically Associating Domain boundaries (hg19, requires liftOver to hg38) | TADKB | tad_path |
-
GENCODE Processing: The GENCODE annotation must be processed to include only "gene" level features with columns
(chr,start,end,strand,gene_id,gene_name,tss,alternative_tss). Thetsscolumn should contain a list of the transcription start site position for all "basic" tagged transcripts: use thestartcoordinate for positive-stranded genes and theendcoordinate for negative-stranded genes. For this analysis, only protein-coding genes were considered. -
Genome Assembly Conversion: The ABC enhancer predictions and TADKB boundary databases are provided in hg19 coordinates. These must be converted to hg38 using liftOver before use in the workflow.
The workflow generates results in the following directories (as specified in your config):
clusters_dir: Gene expression clusterseqtl_output_dir: eQTL analysis resultspcqtl_output_dir: pcQTL analysis resultsannotations_output_dir: Functional annotationscoloc_output_dir: Co-localization results
For detailed descriptions of the output file formats and their contents, refer to output.md.
- Clusters of neighboring correlated genes
- Summary stats for pcQTL mapping
- pcQTL-GWAS colocalizations
We thank the donors and their families for their generous gifts of biospecimens to the GTEx research project. The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (http://commonfund.nih.gov/GTEx). Additional funds were provided by the National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), National Heart, Lung, and Blood Institute (NHLBI), National Institute on Drug Abuse (NIDA), National Institute of Mental Health (NIMH), and National Institute of Neurological Disorders and Stroke (NINDS). This research was supported by National Institutes of Health grants R01MH12524, U01AG072573, U01HG012069 to S.B.M. K.L. is supported by the Stanford Genome Training Program (SGTP; NIH/NHGRI T32HG000044). T.G. is supported by the Knight-Hennessy Scholars fellowship

