Skip to content

chrismckennan/FALCO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FALCO: Factor AnaLysis in COrrelated data

The functions in this package implement the methods proposed in McKennan, 2020 (https://arxiv.org/abs/2009.11134), and perform factor analysis in high dimensional biological data. Let yg be an n-vector containing the expression or methylation of genomic unit g. To use this package, you must be able to express Var(yg) (the variance of yg) as Var(yg) = v_{g1}B_1 + ... + v_{gb}B_b, where v_{g1},...,v_{gb} are unknown scalars and B_1,...,B_b are known matrices that parametrize the relationship between samples. This encompasses nearly all modern gene expression and methylation data. Some examples include:

  1. Unrelated samples. In this case, b=1 and B_1=I_n is the identity matrix.

  2. Individuals related through a kinship matrix. In this case, b=2, B_1=I_n, and B_2=U, where U is the kinship matrix.

  3. Multi-tissue data from unrelated individuals. For data with T tissues and arbitrary correlation structure, one can express V(yg) = sum_{i=1}^{T(T+1)/2}v_{gi}B_i. One can simplify the correlation structure depending on the similarity between tissues.

  4. Longitudinal data. For general longitudinal data with T time points, V(yg) can be expressed exactly as it is in 3). If one assumes the marginal variance for each sample is the same, this can be simplified to V(yg) = v_{g1} I_n + sum_{i=2}^{T(T-1)/2 + 1}v_{gi}B_i.

The two primary functions are given below.

FALCO

This implements FALCO (Algorithm 1 of McKennan, 2020), which estimates latent factors and loadings. Like PCA in data with unrelated samples, the factors are orthogonal to one another and are arranged in order of decreasing importance (i.e. decreasing variance explained). As demonstrated in McKennan, 2020, these behave like principal components, and can be used to perform quality control, identify latent patterns in the data, and de-noise the expression/methylation data matrix in eQTL and meQTL studies.

CBCV_plus

This implements CBCV+ (Algorithm 2 of McKennan, 2020), which estimates K, the number of latent factors. If K is unspecified in FALCO, it is estimated using CBCV_plus.

About

Perform factor analysis in high dimensional correlated data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages