Skip to content

kal26/pcqtls

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Focus on single gene effects limits discovery and interpretation of complex trait-associated variants

Overview

This repository contains the code associated with our work developing a multi-gene eQTL mapping framework, termed cis-principal component expression QTL (cis-pc eQTL or pcQTL). pcQTL leverage allelic expression "proxitropy" - the phenomenon by which one variant changes the expression of mulptile, nearby genes - to map QTL effects missed by a standard single-gene eQTL approach.

pcQTL vs eQTL methods

Preprint

Our preprint is available here

Usage

This repository contains a Snakemake workflow for performing pcQTL analysis. To run the complete workflow:

  1. Create a conda environment according to Environment Setup

  2. Download the necessary data as outlined in Required Datasets

  3. Edit the example configuration file config/config_example.yaml to specify the paths to your input data and desired output directories.

  4. Run the workflow with your configuration:

snakemake --configfile config/config_example.yaml

Repository Structure

  • workflow/rules: Snakemake workflow for the pcQTL mapping framework.
  • workflow/scripts: Python and R scripts used by the workflow.
  • workflow/main_figures: Jupyter notebooks for data analysis and visualization for main figure panels.
  • workflow/supplemental_figures: Jupyter notebooks for data analysis and visualization for suplemental figure panels.
  • config: Example configuration file for the workflow.
  • references: Small file-size references. Details on how to download other references nedded are in Required Datasets.
  • Snakefile: Main Snakemake workflow file that orchestrates the pcQTL analysis pipeline.

Environment Setup

This workflow can be run from a specific conda environment with all dependencies installed.

# Clone this repository
git clone https://github.com/kal26/pcqtls.git
cd pcqtls

# Create the conda environment 
conda env create -f environment.yml

# Activate the environment
conda activate pcqtl

Workflow Components

The workflow performs the following analyses:

  • Gene clustering: Identifies co-expressed gene clusters
  • eQTL analysis: Maps expression quantitative trait loci for individual genes
  • pcQTL analysis: Maps QTLs for principal components derived from gene clusters
  • Functional annotation: Annotates variants and clusters with functional information
  • Co-localization: Performs co-localization analysis with GWAS summary statistics

Required Datasets

QTL Analysis Data

Dataset Description Source Expected Location in Config
GTEx v8 Expression Data Normalized gene expression by tissue GTEx Portal expression_dir
GTEx v8 Genotypes Genotype data: .bed, .bim, and .fam Protected access on dbGaP, access available via request at this link genotype_stem
GTEx v8 Covariates Technical and biological covariates GTEx Portal covariates_dir
GTEx v8 Sample Sizes Sample sizes for each tissue references/gtex_sample_sizes.txt gtex_meta
Tissue IDs List of tissues to analyze User generated, see references/selected_tissue_ids.txt as an example tissue_id_path
Chromosome List Chromosomes to analyze references/chrs.txt chr_list_path
GWAS Metadata Metadata from GWAS studies Zenodo gwas_meta
GWAS Summary Stats GWAS summary statistics files Zenodo gwas_folder

Annotation Data

Dataset Description Source Expected Location in Config
GENCODE v26 Gene annotations GENCODE gencode_path
ABC Enhancer Predictions Activity-by-Contact enhancer predictions (hg19, requires liftOver to hg38) Engreitz Lab full_abc_path
CTCF ID-tissue Matching between tissues and CTCF experiments references/ctcf_matched_gtex.txt ctcf_match_path
CTCF Peaks CTCF binding data from ENTEx, the experiments from references/ctcf_matched_gtex.txt ENTEx ctcf_dir
Cross-mappability Cross-mappability from Saha and Battle (2019) figshare cross_map_path
Paralog Relationships Gene paralog information Ensembl Biomart paralog_path
Gene Ontology Terms GO term annotations Ensembl Biomart go_path
TAD Boundaries Topologically Associating Domain boundaries (hg19, requires liftOver to hg38) TADKB tad_path

Data Preparation Notes

  1. GENCODE Processing: The GENCODE annotation must be processed to include only "gene" level features with columns (chr,start,end,strand,gene_id,gene_name,tss,alternative_tss). The tsscolumn should contain a list of the transcription start site position for all "basic" tagged transcripts: use the start coordinate for positive-stranded genes and the end coordinate for negative-stranded genes. For this analysis, only protein-coding genes were considered.

  2. Genome Assembly Conversion: The ABC enhancer predictions and TADKB boundary databases are provided in hg19 coordinates. These must be converted to hg38 using liftOver before use in the workflow.

Output Files

The workflow generates results in the following directories (as specified in your config):

  • clusters_dir: Gene expression clusters
  • eqtl_output_dir: eQTL analysis results
  • pcqtl_output_dir: pcQTL analysis results
  • annotations_output_dir: Functional annotations
  • coloc_output_dir: Co-localization results

Output File Details

For detailed descriptions of the output file formats and their contents, refer to output.md.

Results Availability

  • Clusters of neighboring correlated genes
  • Summary stats for pcQTL mapping
  • pcQTL-GWAS colocalizations

ZENODO

Acknowledgments

We thank the donors and their families for their generous gifts of biospecimens to the GTEx research project. The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (http://commonfund.nih.gov/GTEx). Additional funds were provided by the National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), National Heart, Lung, and Blood Institute (NHLBI), National Institute on Drug Abuse (NIDA), National Institute of Mental Health (NIMH), and National Institute of Neurological Disorders and Stroke (NINDS). This research was supported by National Institutes of Health grants R01MH12524, U01AG072573, U01HG012069 to S.B.M. K.L. is supported by the Stanford Genome Training Program (SGTP; NIH/NHGRI T32HG000044). T.G. is supported by the Knight-Hennessy Scholars fellowship

make allelic proxitropy work for you

About

multi-gene QTL mapping with cis PC eQTLs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published