Nearest-neighbor Projected-Distance Regression (NPDR) detects network interactions and controls for confounding and multiple testing
This repository provides simulation and analysis code to reproduce the results in our NPDR paper. The repository is structured as followed:
- The
rdirectory contains the.Rmdfiles to perform the analysis and produce the figures. Each file can be run separately in the order indicated in the file name.*_visualize_*.Rmdfiles will have notes on which files need to be run prior to executing chunks.2_*_100_*.Rmdfiles produces results from 100 replications. - The
datadirectory contains real and simulated data from running said.Rmdfiles. - The
resultsdirectory contains.Rdataresults from running code in therdirectory.*.Rdataresult files can be loaded in toRusing theload()function.*.csvresult files should be straightforward. - The
figsdirectory contains generated figures from running the*_visualize_*.Rmdcode in therdirectory. These figures are included in the main textms/main_npdr.pdfor supplementms/sup_npdr.pdf. - The
msdirectory contains the text files to generate the manuscript and supplement (withSweave).
We develop a new feature selection technique that uses the generalized linear model (GLM) to perform regression between nearest-neighbor pair distances projected onto attributes to address a broad spectrum of statistical challenges in high-dimensional data, including interactions and confounding variables. Recently we developed STatistical Inference Relief (STIR), a pseudo t-test approach to estimate the statistical significance of Relief-based attribute scores for case-control (classification) problems, where the data may involve main effects and complex statistical interactions (network epistasis). However, efficient statistical inference methods are needed to detect complex network effects in more complicated modeling situations, including continuous outcomes, mixtures of categorical and continuous predictors, and correcting for potential confounding variables. Our new Nearest-neighbors Projected-Distance Regression (NPDR) encompasses STIR for case-control data and extends its capabilities to statistically correct for covariates, which previously was not feasible for Relief-based methods. NPDR provides a novel and improved way to compute regression-based Relief scores (RRelief) for quantitative outcomes that also allows statistical corrections and various combinations of predictor data types such as genetic variants and gene expression. In addition, we implement a penalized version of NPDR and derive the theoretical constant-$k$ approximation to the expected number of neighbors for spatially uniform radial neighborhoods. Using realistic simulations that include main effects and gene-network interactions, we show for that NPDR improves attribute estimates compared to standard Relief while also being able to compute statistical significance of attributes and adjust for covariates. We demonstrate that the NPDR framework has similar statistical properties to the pseudo t-test based method for case-control data with the added flexibility of regression modeling. We also compare glmnet with the penalized version of NPDR. We use RNA-Seq data from a study of major depressive disorder to show that NPDR with covariate adjustment removes spurious associations due to confounding by sex.
Installing:
devtools::install_github("insilico/npdr")
library(npdr)
Github page: https://github.com/insilico/npdr