Skip to content

RMBrouwer/HAPNEST_adjusted_phenotype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Author: Rachel Brouwer
Updated 2/6/23


HAPNEST_adjusted_phenotype

Adjusted version of phenotype simulation of HAPNEST

This document describes an adjusted version the HAPNEST phenotype creation tool (based on https://github.com/intervene-EU-H2020/synthetic_data/tree/main/algorithms/phenotype). This tool is an adjusted version of the original HAPNEST tool (downloaded from Github before a major update that took place in May 2023). HAPNEST is a program for synthetic genotype creation https://www.biorxiv.org/node/2912508 which includes simulation of phenotypes. See Installation directory for installation instructions.

We adjusted the phenotype tool to incorporate the following changes:

  • Allow for covariate x SNP interactions, explained variance per covariate
  • Allow for explained variance PER COVARIATE separately
  • Allow for explained variance of PC effects (in case of multiple population) separately, used to be included in covariate
  • Prints an error message when input parameters are not recognised
  • Allow for user-defined extent of overlap between main effects and interaction effects (when using polygenic model)
  • Allow for different causal SNPs for different covariate interactions (when using causal-list)

The phenotype creation tool can be directly run on the synthetic data created by HAPNEST. In that case, the SampleList parameter below should be set to the sample file. Make sure the nPopulation parameter matches that of the SampleList.

See directory test_runs for several test using one population and one trait. NOTE: this program should extend to multiple populations and multiple traits, but this has not been tested yet.

To use, run

./phenoalg_adjusted_CovxSNP ParFile seed

where Parfile is a parameter file in plain text (see below) and seed is a random seed - use the same seed in testing scenario's

nPopulation 1
PopulationCorr 1

nTrait 1
TraitCorr 1
Prevalence 0.5

nCovar 2

ProportionGeno 0.1
ProportionPC 0
ProportionCovar 0.1,0.1
ProportionCovarxSNP 0.2,0.2
Polygenicity 0.001
Polygenicity_interactions 0.001,0.004
Pleiotropy 1
Overlap_interactions 0.6,0

SampleList data/outputs/test_1pop_1phe_2cov/test_chr-1.sample
Reference data/inputs/processed/1KG+HGDP/Africa.Annot
GenoFile data/outputs/test_1pop_1phe_2cov/test_chr
Output data/outputs/test_1pop_1phe_2cov/test_1pop_1phe_2cov
CausalList data/inputs/causal_SNPS/test_1pop_1phe_causal
CausalList_interaction data/inputs/causal_SNPS/test_1pop_1phe_causal_int

a -0.4
b -1
c 0.5
nComponent 3
CompWeight 1,5,10

Parameter description

NOTE: in comma-separated input, make sure there are no additional spaces (needs fixing).

Population related flags

  • nPopulation: the number of populations (nPop), integer, matches the number of populations in the synthetic data

  • PopulationCorr: a flattened correlation matrix for population genetic correlation (symmetric positive definite). nPop x nPop entries separated by comma. Not used if there is only one population. Here population genetic correlation is defined as correlation of SNP effects on the same trait across different populations. For the same trait, we assume the same set of causal variants shared across all populations, but each population can have their specific but overall correlated effect sizes. Meaningless when there is only one population.

  • ProportionPC: observed proportion of variance contributed by PC (general effect of ancestry - only useful if nPop > 1; if nPop = 1 this component will not contribute to the phenotype).

Trait related flags

  • nTrait: the number of traits (nTrait), integer.

  • TraitCorr is a flattened correlation matrix for traits correlation (symmetric positive definite). nTrait x nTrait entries separated by comma. Here traits correlation parameter is the observed correlation, which includes shared effects of genetics, covariates and noise.

  • Prevalence: disease prevalence in each population, each trait. Flatten nPop * nTrait matrix, entries separated by comma. If prevalence is specified, output will include a column for binary case/control status.

Covariate related flags

  • nCovar: the number of covariates (nCov), integer - total number of covariates that influence one (or more) of the traits.

  • ProportionGeno: observed causal SNP heritability in each population, each trait, should be in [0,1]. Flattened nPop * nTrait matrix, entries separated by comma (Pop1-SNP_h2_Trait1,Pop1-SNP_h2_Trait2,Pop2-SNP_h2_Trait1,Pop2-SNP_h2_Trait2,...).

  • ProportionPC: observed proportion of variance contributed by PC (general effect of ancestry - only useful if nPop > 1; if nPop = 1 this component will not contribute to the phenotype).

  • ProportionCovar: observed proportion of variance contributed by the covariate (input in SampleList file), in each population, each trait, should be in [0,1]. Flattened nPop * nTrait * nCov matrix, entries separated by comma (Pop1-Trait1-Covar1,Pop1-Trait1-Covar2,Pop1-Trait2-Covar1,Pop1-Trait2-Covar2,Pop2...) It is assumed the same covariates influence multiple traits, but their effect could be set to zero. It is also assumed that the covariates are independent from each other and from the genotype (if not the total variance will be smaller than one).

  • ProportionCovarxSNP: observed proportion of variance contributed by covariate x SNP interactions (input in SampleList file), in each population, each trait, should be in [0,1]. Flattened nPop * nTrait * nCov matrix, entries separated by comma (Pop1-Trait1-Covar1,Pop1-Trait1-Covar2,Pop1-Trait2-Covar1,Pop1-Trait2-Covar2,Pop2...) It is assumed the same covariates influence multiple traits, but their effect could be set to zero. Please note: the proportion of variance explained that is put in here is additional to the main effects of genotype, PC and that particular covariate, per covariate separately.

Parameters related to polygenicity

  • Polygenicity: nTrait vector of trait polygenicity, measured by proportion of total SNPs being causal. e.g. Polygenicity = 0.05 meaning around 5% SNPs will be causal. Only used when no causal SNP lists are provided.

  • Polygenicity_interactions: nTrait vector of trait polygenicity interactions, measured by proportion of total SNPs having a causal interaction with the covariates. Length of this vector should be the number of covariates. Could be set to zero if a covariate does not have an interaction effect. Only used when no causal SNP lists are provided.

Parameters related to overlap between traits, main effects and interactions

  • Pleiotropy: nTrait vector of trait's pleiotropy relationship comparing to trait 1. i.e. if trait 2 has Pleiotropy = 0.9, it means 90% of causal SNPs in trait 1 are also causal in trait 2. Therefore, first entry of Pleiotropy vector is always 1. Entries separated by comma. This is PURE pleiotropy (i.e. no causal SNPs related through LD). Only used when no causal SNP lists are provided.

  • Overlap_interactions: Vector indicating for each trait, each covariate the amount of overlap between the main and interaction effects Pop1-Trait1-Covar1,Pop1-Trait1-Covar2,Pop1-Trait2-Covar1,Pop1-Trait2-Covar2,Pop2...). A value of e.g. 0.3 for a particular trait and covariate implies that 30% of the causal SNPs that have a main effect on the trait will also have an interaction effect. This is PURE pleiotropy (i.e. no causal SNPs related through LD). Only used when no causal SNP lists are provided.

Input and output paths

  • SampleList: full path to the sample list that was created in genotype simulation. Add covariates per individual to this comma separated file if covariates are specified above.

  • CausalList is the prefix for lists of predefined SNPs to be used as causal, overrides polygenicity parameter if specified. Each column contains causal SNPs for one trait, columns separated by comma. CausalList is specified per phenotype. For each trait, the causal list file should be names as prefixn, where n is the trait index.

  • CausalList_interaction is the prefix for lists of predefined SNPs to be used as causal, overrides polygenicity parameter if specified. Each column contains causal SNPs for one trait, columns separated by comma. CausalList_interaction is specified per phenotype. For each trait, the causal list file should be names as prefixn, where n is the trait index. When there are more than one covariate for which an interaction exists, the list of SNPs should be concatenated, using a row "###" to separate the causal SNP lists per covariate. If there is only one covariate that has an interaction, add "###" to indicate a new (empty) SNP list.

  • Reference file: reference file for LD; currently not working because synthetic data is in chr:bp format instead of rsID

  • GenoFile Input genotype file, in .traw format by chromosome, ie takes plink files by chromosome, named as GenoFile-chr.bim/bed/fam, where chr is the chromosome number (1-22). Can be generated using plink --make-bed from other formats

Distribution of genetic effects parameters

  • a,b,c: weighing parameters for the effects of SNPs, -0.4, -1 and 0.5 as suggested values (see original paper for details).

  • nComponent and CompWeights are the number of Gaussian mixture components for SNP-effects and their weights (currently no check whether these have equal length)


For each trait, it outputs two/three files:
.pheno includes the main genetic effects, PCs, covariates, interactions and synthetic phenotype for each individual. \n .causal_n_ includes the causal SNPs and their effect sizes for trait n in each population \n .causal__intn_ includes the causal SNPs that have an interaction for trait n. This file also includes a "covariate" column, indicating which SNP is causal for which covariate.

About

Adjusted version of phenotype simulation of HAPNEST

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published