The qcdrift package is an R-based analytical framework designed to
automate the detection, correction, and validation of instrumental drift
in high-dimensional mass spectrometry data. It replaces labor-intensice
manual spreadsheet workflows with a reproducible, modular pipeline.
If you have never used R or RStudio before, follow these steps to get up and running.
Before running the analysis, you must install the tools R needs to create your graphs.
- Open RStudio
- Copy and paste the following line into the Console (the bottom-left window) and press Enter:
install.packages(c("devtools", "dplyr", "ggplot2", "patchwork", "ggrepel", "GGally","readxl","tidyr"))qcdrift can be installed direclty from GitHub using devtools:
devtools::install_github('Hood-BIFX/qcdrift')The pipeline is engineered with a modular architecture, allowing users
to call individual functions for specific tasks or use the
process_runs() for a complete end-to-end analysis.
Once the package is installed and loaded, you can run the entire
pipeline with a single command. For this example, we’ll use the provided
example dataset located at inst/extdata/RawMassSpec.xlsx.
library(qcdrift)
results <- system.file('extdata/RawMassSpec.xlsx', package = 'qcdrift') |>
process_runs()The resulting results object is a comprehensive list containing the
cleaned and corrected data, as well as all generated diagnostic plots. A
multi-page PDF report can also be saved in your working directory,
summarizing the key findings as follows:
# We seem to have lost this at some point, so this is a placeholder for the final report rendering function.
render_report(results, output_file = "Final_QC_Report.pdf")The pipeline is built on modular functions that can be called independently. We will walk through the key functions in this section, using the example dataset to illustrate their use and outputs.
read_and_clean_data() standardizes the raw Excel into a “tidy”
long-format data frame.
- Automated Parsing: It extracts sample names and numerical injection orders from the file header.
- QC Identification: It uses regular expression to classify samples
based on the
qc_starts_withparameter, which defaults to “QC”.
raw_data <- system.file('extdata/RawMassSpec.xlsx', package = 'qcdrift') |>
read_and_clean_data()qc_drift_correction() applies the corret_linear() function
(piecewise linear interpolation between QC samples) to correct for QC
drift (other options may be added in the future). The function expects
the input data frame to have the following columns, which are
automatically generated by read_and_clean_data():
io: The injection order (numerical sequence of samples)abundance: The raw intensity values for each metaboliteqc: A binary vector indicating which samples are QCs (1 for QC, 0 for non-QC)metabolites: The name or ID of the molecule being measured
corrected_data <- qc_drift_correction(raw_data)normalize_data() applies Total Sum Normalization (TSN) to account for
injection-level variability, such as differences in sample volume or
dilution. Alternately, specifying method = 'auto' will applly
autoscaling (normalization to mean=0 and sd=1) if desired.
normalized_data <- normalize_data(corrected_data)The package includes several ggplot2- based functions to validate data
integrity.
- Function: Visualizes the global variance structure
- Interpretation: Tight clustering of QC sample (blue diamonds) in the “Corrected” plot indicates successful removal of technical noise.
- Function: Creates a waterfall plot of Coefficient of Variation (CV) percentages
- Interpretation: A horizontal dashed line at 20% marks the acceptable industry threshold for reproducibility.
- Function: Display log10-transformed abundance distributions across the run.
- Interpretation: Consistent median values and interquartile range across samples indicate successful Total Sum Normalization (TSN).
- Function: Displays Z-score normalized intensities across injection orders.
- Interpretation: The removal of “stripping” (verrtical color gradients) confrims that temporal decay has been eliminated.