Focus on single gene effects limits discovery and interpretation of complex trait-associated variants

Overview

This repository contains the code associated with our work developing a multi-gene eQTL mapping framework, termed cis-principal component expression QTL (cis-pc eQTL or pcQTL). pcQTL leverage allelic expression "proxitropy" - the phenomenon by which one variant changes the expression of mulptile, nearby genes - to map QTL effects missed by a standard single-gene eQTL approach.

Preprint

Our preprint is available here

Usage

This repository contains a Snakemake workflow for performing pcQTL analysis. To run the complete workflow:

Create a conda environment according to Environment Setup
Download the necessary data as outlined in Required Datasets
Edit the example configuration file config/config_example.yaml to specify the paths to your input data and desired output directories.
Run the workflow with your configuration:

snakemake --configfile config/config_example.yaml

Repository Structure

workflow/rules: Snakemake workflow for the pcQTL mapping framework.
workflow/scripts: Python and R scripts used by the workflow.
workflow/main_figures: Jupyter notebooks for data analysis and visualization for main figure panels.
workflow/supplemental_figures: Jupyter notebooks for data analysis and visualization for suplemental figure panels.
config: Example configuration file for the workflow.
references: Small file-size references. Details on how to download other references nedded are in Required Datasets.
Snakefile: Main Snakemake workflow file that orchestrates the pcQTL analysis pipeline.

Environment Setup

This workflow can be run from a specific conda environment with all dependencies installed.

# Clone this repository
git clone https://github.com/kal26/pcqtls.git
cd pcqtls

# Create the conda environment 
conda env create -f environment.yml

# Activate the environment
conda activate pcqtl

Workflow Components

The workflow performs the following analyses:

Gene clustering: Identifies co-expressed gene clusters
eQTL analysis: Maps expression quantitative trait loci for individual genes
pcQTL analysis: Maps QTLs for principal components derived from gene clusters
Functional annotation: Annotates variants and clusters with functional information
Co-localization: Performs co-localization analysis with GWAS summary statistics

Required Datasets

QTL Analysis Data

Dataset	Description	Source	Expected Location in Config
GTEx v8 Expression Data	Normalized gene expression by tissue	GTEx Portal	`expression_dir`
GTEx v8 Genotypes	Genotype data: .bed, .bim, and .fam	Protected access on dbGaP, access available via request at this link	`genotype_stem`
GTEx v8 Covariates	Technical and biological covariates	GTEx Portal	`covariates_dir`
GTEx v8 Sample Sizes	Sample sizes for each tissue	`references/gtex_sample_sizes.txt`	`gtex_meta`
Tissue IDs	List of tissues to analyze	User generated, see `references/selected_tissue_ids.txt` as an example	`tissue_id_path`
Chromosome List	Chromosomes to analyze	`references/chrs.txt`	`chr_list_path`
GWAS Metadata	Metadata from GWAS studies	Zenodo	`gwas_meta`
GWAS Summary Stats	GWAS summary statistics files	Zenodo	`gwas_folder`

Annotation Data

Dataset	Description	Source	Expected Location in Config
GENCODE v26	Gene annotations	GENCODE	`gencode_path`
ABC Enhancer Predictions	Activity-by-Contact enhancer predictions (hg19, requires liftOver to hg38)	Engreitz Lab	`full_abc_path`
CTCF ID-tissue	Matching between tissues and CTCF experiments	`references/ctcf_matched_gtex.txt`	`ctcf_match_path`
CTCF Peaks	CTCF binding data from ENTEx, the experiments from `references/ctcf_matched_gtex.txt`	ENTEx	`ctcf_dir`
Cross-mappability	Cross-mappability from Saha and Battle (2019)	figshare	`cross_map_path`
Paralog Relationships	Gene paralog information	Ensembl Biomart	`paralog_path`
Gene Ontology Terms	GO term annotations	Ensembl Biomart	`go_path`
TAD Boundaries	Topologically Associating Domain boundaries (hg19, requires liftOver to hg38)	TADKB	`tad_path`

Data Preparation Notes

GENCODE Processing: The GENCODE annotation must be processed to include only "gene" level features with columns (chr,start,end,strand,gene_id,gene_name,tss,alternative_tss). The tsscolumn should contain a list of the transcription start site position for all "basic" tagged transcripts: use the start coordinate for positive-stranded genes and the end coordinate for negative-stranded genes. For this analysis, only protein-coding genes were considered.
Genome Assembly Conversion: The ABC enhancer predictions and TADKB boundary databases are provided in hg19 coordinates. These must be converted to hg38 using liftOver before use in the workflow.

Output Files

The workflow generates results in the following directories (as specified in your config):

clusters_dir: Gene expression clusters
eqtl_output_dir: eQTL analysis results
pcqtl_output_dir: pcQTL analysis results
annotations_output_dir: Functional annotations
coloc_output_dir: Co-localization results

Output File Details

For detailed descriptions of the output file formats and their contents, refer to output.md.

Results Availability

Clusters of neighboring correlated genes
Summary stats for pcQTL mapping
pcQTL-GWAS colocalizations

ZENODO

Acknowledgments

We thank the donors and their families for their generous gifts of biospecimens to the GTEx research project. The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (http://commonfund.nih.gov/GTEx). Additional funds were provided by the National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), National Heart, Lung, and Blood Institute (NHLBI), National Institute on Drug Abuse (NIDA), National Institute of Mental Health (NIMH), and National Institute of Neurological Disorders and Stroke (NINDS). This research was supported by National Institutes of Health grants R01MH12524, U01AG072573, U01HG012069 to S.B.M. K.L. is supported by the Stanford Genome Training Program (SGTP; NIH/NHGRI T32HG000044). T.G. is supported by the Knight-Hennessy Scholars fellowship

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
config		config
images		images
references		references
workflow		workflow
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
output.md		output.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Focus on single gene effects limits discovery and interpretation of complex trait-associated variants

Overview

Preprint

Usage

Repository Structure

Environment Setup

Workflow Components

Required Datasets

QTL Analysis Data

Annotation Data

Data Preparation Notes

Output Files

Output File Details

Results Availability

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kal26/pcqtls

Folders and files

Latest commit

History

Repository files navigation

Focus on single gene effects limits discovery and interpretation of complex trait-associated variants

Overview

Preprint

Usage

Repository Structure

Environment Setup

Workflow Components

Required Datasets

QTL Analysis Data

Annotation Data

Data Preparation Notes

Output Files

Output File Details

Results Availability

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages