Skip to content

QizhiSu/labtools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧪 labtools

R

⚗️ Analytical Chemistry Laboratory Data Processing Toolkit

A comprehensive R package for streamlining analytical chemistry workflows

R-CMD-check License CRAN Stars Issues Last Commit

🇺🇸 English | 🇨🇳 中文


🎯 About This Package

labtools is specifically designed for our lab’s analytical chemistry workflows and daily data processing needs. This package prioritizes ease of use for researchers who may not be R experts over maximum flexibility.

Design Philosophy: - Laboratory-focused: Built around common analytical chemistry tasks and workflows - User-friendly: Simple function calls with sensible defaults for non-R experts - Workflow-oriented: Functions designed to work together in typical lab data processing pipelines - Trade-offs: Some flexibility is sacrificed for simplicity and ease of use

Please use according to your needs - if you require maximum flexibility, consider using the underlying packages directly (webchem, rcdk, etc.).


🌟 Key Features

Feature Description
🔍 Chemical Metadata Extraction Enhanced automated retrieval from PubChem with robust error handling and CAS parsing
🧬 Structure Database Management Build and visualize chemical structure databases with advanced filtering
📊 MS Data Processing Export optimized databases for MS-FINDER and MS-DIAL software
🔬 2D GC-MS Analysis Process Canvas exports and combine multi-sample data with precision
⚗️ Chemical Structure Conversion SMILES to MOL file conversion with 2D coordinate generation
📈 Semi-quantification Tools Advanced analytical quantification with machine learning integration
🎯 Interactive Visualization Shiny-based chemical structure navigation and spectrum plotting

💡 New to R? For detailed help on any function, use ?function_name in R console (e.g., ?extract_cid)

🆕 Latest Updates (v0.3.00)

  • Enhanced metadata extraction: Improved parse_cas_clean(), extract_cid(), and extract_meta() functions
  • Better dependency management: mspcompiler is now optional - install only when needed
  • Robust error handling: Comprehensive input validation and informative error messages
  • Improved documentation: Detailed examples and parameter explanations for all functions
  • Enhanced synonyms extraction: Fixed and optimized PubChem synonyms retrieval
  • ASCII compliance: All code now uses ASCII characters for better portability
  • Comprehensive testing: All functions pass R CMD check with zero errors and warnings

🚀 Installation

Development Version (Recommended)

# Install devtools if you haven't already
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# Install labtools from GitHub
devtools::install_github("QizhiSu/labtools")

System Requirements

  • R ≥ 4.0.0
  • Java ≥ 8 (for rcdk package)
  • Internet connection (for PubChem API access)

Optional Dependencies

Some functions require additional packages that are not automatically installed:

  • mspcompiler: Required for MSP library filtering (filter_msp() function)

    # Install when needed
    devtools::install_github("QizhiSu/mspcompiler")

Getting Help

  • Function help: Use ?function_name (e.g., ?extract_cid)
  • Package help: Use help(package = "labtools")
  • Examples: Use example(function_name) to run examples

Quick Start

library(labtools)
library(dplyr)

# 🔍 1. Extract chemical metadata from PubChem
data <- data.frame(
  Name = c("Caffeine", "Aspirin", "Glucose"),
  CAS = c("58-08-2", "50-78-2", "50-99-7")
)

# Extract CIDs and comprehensive metadata
data_with_cid <- extract_cid(data, name_col = "Name", cas_col = "CAS")
data_complete <- extract_meta(data_with_cid, cas = TRUE, flavornet = TRUE)

# 📊 2. Export for MS software
export4msdial(data_complete, polarity = "pos", output = "database_msdial.txt")
export4msfinder(data_complete, output = "database_msfinder.txt")

# ⚗️ 3. Convert SMILES to MOL files
smiles_data <- data.frame(
  ID = c("Caffeine", "Aspirin"),
  SMILES = c("Cn1cnc2c1c(=O)n(c(=O)n2C)C", "CC(=O)OC1=CC=CC=C1C(=O)O")
)
export_smiles_to_mol(smiles_data, output_dir = "mol_files")

# 🎯 4. Interactive chemical structure browser
navigate_chem(data_complete)  # Opens Shiny app

💡 Tip: Use ?extract_cid or ?export_smiles_to_mol to see detailed parameter descriptions and more examples!


📚 Detailed Function Examples

🔍 Chemical Metadata Extraction

library(labtools)

# Extract CIDs from chemical identifiers
compounds <- data.frame(
  Name = c("Caffeine", "Theobromine"),
  CAS = c("58-08-2", "83-67-0")
)

# Extract CIDs with proper error handling
# Use ?extract_cid for full parameter details
compounds_cid <- extract_cid(
  data = compounds,           # Input data frame
  name_col = "Name",         # Column with chemical names (default: "Name")
  cas_col = "CAS",           # Column with CAS numbers (default: "CAS")
  inchikey_col = "InChIKey", # Column with InChIKeys (default: "InChIKey")
  verbose = TRUE             # Show progress messages (default: TRUE)
)

# Extract comprehensive metadata
# Use ?extract_meta for all options
compounds_meta <- extract_meta(
  data = compounds_cid,      # Data frame with CID column (required)
  cas = TRUE,                # Extract CAS numbers (default: FALSE)
  flavornet = TRUE,          # Extract Flavornet data (default: FALSE)
  synonyms = TRUE,           # Extract synonyms (default: FALSE)
  uses = TRUE,               # Extract compound uses (default: FALSE)
  verbose = TRUE             # Show progress (default: TRUE)
)

# Add chemical classification using ClassyFire
# Use ?extract_classyfire for details
compounds_classified <- extract_classyfire(
  data = compounds_meta,     # Data frame with InChIKey column
  inchikey_col = "InChIKey", # InChIKey column name (default)
  name_col = "Name"          # Name column for progress (default)
)

Key Parameters Explained: - name_col, cas_col, inchikey_col: Specify which columns contain identifiers (defaults: “Name”, “CAS”, “InChIKey”) - verbose: Set to FALSE to suppress progress messages (default: TRUE) - cas, flavornet, synonyms, uses: Enable extraction of specific metadata types (all default: FALSE) - Functions automatically handle network errors and provide informative progress messages

📤 Database Export

# Export for MS-FINDER software
# Use ?export4msfinder for details
export4msfinder(
  data = compounds_classified,              # Data with required columns
  output = "msfinder_database.txt"          # Output file path (default)
)

# Export for MS-DIAL (positive mode)
# Use ?export4msdial for all options
export4msdial(
  data = compounds_classified,              # Data with required columns
  polarity = "pos",                         # ESI polarity: "pos" or "neg" (default: "pos")
  output = "msdial_pos.txt"                # Output file path (default)
)

# Export for MS-DIAL (negative mode)
export4msdial(
  data = compounds_classified,
  polarity = "neg",                         # Negative mode for different adducts
  output = "msdial_neg.txt"
)

Required Columns: - MS-FINDER: Name, InChIKey, CID, ExactMass, Formula, SMILES - MS-DIAL: Name, ExactMass, SMILES, InChIKey

Polarity Options: - "pos": Generates [M+H]+, [M+Na]+, [M+K]+ adducts - "neg": Generates [M-H]-, [M+Cl]-, [M+HCOO]- adducts

⚗️ SMILES to MOL Conversion

# Prepare SMILES data
smiles_data <- data.frame(
  Compound_ID = c("CAFF_001", "THEO_002"),
  SMILES_String = c(
    "Cn1cnc2c1c(=O)n(c(=O)n2C)C",      # Caffeine
    "Cn1cnc2c1c(=O)[nH]c(=O)n2C"       # Theobromine
  ),
  stringsAsFactors = FALSE
)

# Convert to MOL files with 2D coordinates
# Use ?export_smiles_to_mol for all parameters
result <- export_smiles_to_mol(
  df = smiles_data,                     # Data frame with ID and SMILES columns
  id_col = "Compound_ID",               # Column with compound IDs (default: "ID")
  smiles_col = "SMILES_String",         # Column with SMILES strings (default: "SMILES")
  output_dir = "molecular_structures"   # Output directory (default: "mol_files")
)

# Check conversion results
print(result)
# Returns: list(success = 2, failed = 0, skipped = 0)

Function Features: - Input validation: Checks for valid data frame and required columns - Error handling: Continues processing even if some SMILES fail - 2D coordinates: Automatically generates 2D coordinates for visualization - Progress tracking: Returns summary of successful/failed conversions - Robust processing: Handles NA/NULL values gracefully

🔬 Canvas Data Processing

# Process Canvas 2D GC-MS data with detailed parameters
# Use ?read_canvas for complete parameter documentation
canvas_data <- read_canvas(
  path = "path/to/canvas/files",        # Directory containing Canvas .txt files
  ri_iden_tol = 30,                     # RI tolerance for identification (default: 30)
  ri_p_iden_tol = 100,                  # Predicted RI tolerance for identification (default: 100)
  ri_align_tol = 50,                    # RI tolerance for alignment (default: 50)
  rt_2d_align_tol = 0.1,                # 2D RT tolerance for alignment (default: 0.1)
  keep = "area"                         # Data to keep: "area", "height", or "both" (default: "area")
)

# Normalize by internal standard (D8)
# Use ?normalize_area for details
normalized_data <- normalize_area(
  df = canvas_data,                     # Data frame with peak area data
  start_col = 12                        # Starting column index of area data (default: 12)
)

# Filter areas based on sample codes
# Use ?keep_area for filtering options
filtered_data <- keep_area(
  df = normalized_data,                 # Data frame with peak area data
  sam_code = sample_codes,              # Data frame mapping sample codes to names
  start_col = 12,                       # Starting column index (default: 12)
  keep_bk = TRUE,                       # Keep blank samples (default: TRUE)
  keep_d8 = TRUE                        # Keep D8 internal standard (default: TRUE)
)

Key Parameters Explained: - ri_iden_tol, ri_p_iden_tol: Retention index tolerances for compound identification - ri_align_tol, rt_2d_align_tol: Tolerances for aligning peaks across samples - keep: Determines which peak data to retain (area is most common for quantification) - start_col: Column index where peak area data begins (typically after metadata columns) - keep_bk, keep_d8: Control whether to retain blank samples and internal standards

📈 Semi-quantification Analysis

# Select appropriate standards using molecular similarity
# Use ?select_std for detailed parameter explanation
standards_assigned <- select_std(
  std_md = reference_standards,         # Standards data frame with response and descriptors
  std_res_col = 3,                      # Column index of response variable (or FALSE for all descriptors)
  std_md1_col = 14,                     # Starting column index of molecular descriptors in standards
  data_md = target_compounds,           # Target compounds data frame with descriptors
  data_md1_col = 5,                     # Starting column index of descriptors in targets
  top_npct_md = 30                      # Percentage of top molecular descriptors to use (default: 20)
)

# Calculate concentrations using assigned standards
# Use ?calculate_con for concentration calculation details
concentrations <- calculate_con(
  df = standards_assigned,              # Data frame after select_std() processing
  sam_weight = sample_weights,          # Data frame with sample weights and volumes (can be NULL)
  start_col = 12                        # Starting column index of peak area data (default: 12)
)

# Organize and summarize concentration data
# Use ?organize_con for result organization options
final_results <- organize_con(
  df = concentrations,                  # Data frame after calculate_con() processing
  sam_code = sample_codes,              # Data frame with sample codes and names
  start_col = 12,                       # Starting column index of concentration data (default: 12)
  digits = 3,                           # Decimal places for rounding (default: 2)
  na2zero = TRUE,                       # Replace NA with zero for calculations (default: TRUE)
  bind_mean_sd = TRUE                   # Combine mean and SD in single columns (default: TRUE)
)

Semi-quantification Workflow: 1. Standard Selection: Uses molecular similarity to assign appropriate calibration standards 2. Concentration Calculation: Applies standard curves to calculate concentrations 3. Result Organization: Summarizes data with statistics and proper formatting

Key Parameters: - std_res_col: Set to FALSE to use all molecular descriptors without feature selection - top_npct_md: Controls how many molecular descriptors to use (higher = more descriptors) - sam_weight: Include sample weights for concentration per gram calculations - bind_mean_sd: Combines mean±SD into single readable columns

📊 Spectral Analysis

# Convert spectrum string to numeric vector
# Use ?update_spectrum for format details
spectrum_str <- "100:30 120:80 145:60 170:90 200:20"
spectrum <- update_spectrum(
  spectrum_str = spectrum_str,          # Spectrum in "mz:intensity mz:intensity" format
  start_mz = 50,                        # Starting m/z value (default: 50)
  end_mz = 500,                         # Ending m/z value (default: 500)
  mz_step = 1,                          # m/z step size (default: 1)
  digits = 0                            # Decimal places for m/z rounding (default: 0)
)

# Create interactive spectrum plot
# Use ?plot_spectrum for plotting options
plot_spectrum(
  spectrum = spectrum,                  # Named numeric vector (m/z as names, intensity as values)
  range = 10,                           # Bin width for peak labeling (default: 10)
  threshold = 5,                        # Minimum intensity (%) for peak labeling (default: 1)
  max_ticks = 20                        # Maximum number of x-axis ticks (default: 20)
)

# Calculate spectral similarity using cosine similarity
# Use ?cosine_similarity for similarity calculation details
spec1 <- update_spectrum("100:30 120:80 145:60")
spec2 <- update_spectrum("100:25 120:85 145:55")
similarity <- cosine_similarity(
  spectrum1 = spec1,                    # First spectrum (named numeric vector)
  spectrum2 = spec2                     # Second spectrum (named numeric vector)
)

# Convert between MSP and data frame formats
# Use ?msp2df and ?df2msp for format conversion
msp_data <- msp2df(msp_file_path)      # Convert MSP file to data frame
df2msp(df_data, "output.msp")          # Convert data frame to MSP file

Spectral Data Formats: - Input: Spectrum strings in “mz:intensity mz:intensity” format - Processing: Converts to named numeric vectors for analysis - Output: Interactive plots and similarity scores

Key Functions: - update_spectrum(): Converts text format to numeric vectors - plot_spectrum(): Creates interactive mass spectrum plots - cosine_similarity(): Calculates spectral similarity (0-1 scale) - msp2df() / df2msp(): Convert between MSP files and data frames

🔍 MSP Library Filtering

# Note: MSP filtering requires the mspcompiler package
# Install if needed: devtools::install_github("QizhiSu/mspcompiler")

# Prepare compound list for filtering
compounds_to_keep <- data.frame(
  Name = c("Caffeine", "Aspirin", "Glucose"),
  InChIKey = c("RYYVLZVUVIJVGH-UHFFFAOYSA-N",
               "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
               "WQZGKKKJIJFFOK-GASJEMHNSA-N"),
  stringsAsFactors = FALSE
)

# Filter MSP spectral library based on compound list
# Use ?filter_msp for filtering options
filter_msp(
  msp = "NIST_library.msp",             # Path to the MSP library file
  cmp_list = compounds_to_keep,         # Data frame with compounds to keep (Name and InChIKey columns required)
  keep_napd8 = TRUE,                    # Add Naphthalene-D8 to the filter list (default: TRUE)
  output = "filtered_library.msp"       # Output path for filtered MSP file
)

# Read MS-DIAL output files with filtering options
# Use ?read_msdial for reading options
msdial_data <- read_msdial(
  file_path = "msdial_results.txt",     # Path to MS-DIAL output file
  keep_unknown = FALSE,                 # Keep unknown compounds (default: FALSE)
  keep_spectrum = TRUE,                 # Keep spectrum data (default: FALSE)
  keep_mean_sd = FALSE                  # Keep mean and SD columns (default: FALSE)
)

MSP Library Management: - Purpose: Filter large spectral libraries to contain only compounds of interest - Requirements: Compound list must have Name and InChIKey columns - Features: Automatically includes Naphthalene-D8 internal standard - Output: Filtered MSP file ready for MS software

MS-DIAL Integration: - Input: MS-DIAL alignment result files - Options: Control retention of unknown compounds and spectral data - Output: Clean data frame ready for further analysis


🔬 Advanced Workflows

Complete Database Creation Workflow

# Step 1: Prepare data
chemicals <- data.frame(
  Name = c("Benzene", "Toluene", "Xylene"),
  CAS = c("71-43-2", "108-88-3", "1330-20-7")
)

# Step 2: Extract comprehensive metadata
chemicals_cid <- extract_cid(chemicals, name_col = "Name", cas_col = "CAS")
chemicals_meta <- extract_meta(chemicals_cid, cas = TRUE, synonyms = TRUE)
chemicals_class <- extract_classyfire(chemicals_meta)

# Step 3: Export for different platforms
export4msdial(chemicals_class, polarity = "pos", output = "pos_database.txt")
export4msdial(chemicals_class, polarity = "neg", output = "neg_database.txt")
export4msfinder(chemicals_class, output = "msfinder_database.txt")

# Step 4: Generate structure files
export_smiles_to_mol(chemicals_class, output_dir = "structures")

2D GC-MS Processing Pipeline

# Process Canvas data
canvas_data <- read_canvas("canvas_files", keep = "area")

# Extract metadata
canvas_meta <- extract_cid(canvas_data, name_col = "Name", cas_col = "CAS")
canvas_complete <- extract_meta(canvas_meta, cas = TRUE)

# Normalize and analyze
canvas_normalized <- normalize_area(canvas_complete)
canvas_filtered <- keep_area(canvas_normalized, sample_codes)

# Semi-quantification
standards <- select_std(ref_standards, 3, 14, canvas_filtered, 5)
concentrations <- calculate_con(standards, sample_weights)
results <- organize_con(concentrations, sample_codes)

🧪 labtools: 分析化学实验室数据处理工具包

专为分析化学工作流程设计的综合性R包

labtools 专门为分析化学实验室工作流程和日常数据处理需求而设计。本包优先考虑不太熟悉R语言的研究人员的易用性,而非最大的灵活性。

设计理念: - 实验室导向:围绕常见的分析化学任务和工作流程构建 - 用户友好:为非R专家提供简单的函数调用和合理的默认设置 - 工作流导向:函数设计为在典型的实验室数据处理流程中协同工作 - 权衡取舍:为了简单性和易用性,牺牲了一些灵活性

请根据您的需求使用 - 如果您需要最大的灵活性,请考虑直接使用底层包(webchem、rcdk等)。

🌟 主要功能

功能 描述
🔍 化学元数据提取 从PubChem数据库自动检索,具有智能重试机制
🧬 结构数据库管理 构建和可视化化学结构数据库,支持高级过滤
📊 质谱数据处理 为MS-FINDER和MS-DIAL软件导出优化数据库
🔬 二维气相色谱-质谱分析 精确处理Canvas导出数据并合并多样本数据
⚗️ 化学结构转换 SMILES到MOL文件转换,支持2D坐标生成
📈 半定量工具 集成机器学习的高级分析定量方法
🎯 交互式可视化 基于Shiny的化学结构导航和光谱绘图

💡 R语言新手? 对于任何函数的详细帮助,请在R控制台中使用 ?函数名 (例如:?extract_cid

🚀 安装

# 安装devtools(如果尚未安装)
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# 从GitHub安装labtools
devtools::install_github("QizhiSu/labtools")

系统要求

  • R ≥ 4.0.0
  • Java ≥ 8 (rcdk包需要)
  • 网络连接 (访问PubChem API)

获取帮助

  • 函数帮助: 使用 ?函数名 (例如:?extract_cid)
  • 包帮助: 使用 help(package = "labtools")
  • 示例: 使用 example(函数名) 运行示例

快速开始

library(labtools)
library(dplyr)

# 🔍 1. 从PubChem提取化学元数据
data <- data.frame(
  Name = c("咖啡因", "阿司匹林", "葡萄糖"),
  CAS = c("58-08-2", "50-78-2", "50-99-7")
)

# 提取CID和综合元数据
data_with_cid <- extract_cid(data, name_col = "Name", cas_col = "CAS")
data_complete <- extract_meta(data_with_cid, cas = TRUE, flavornet = TRUE)

# 📊 2. 导出用于质谱软件
export4msdial(data_complete, polarity = "pos", output = "database_msdial.txt")
export4msfinder(data_complete, output = "database_msfinder.txt")

# ⚗️ 3. 转换SMILES为MOL文件
smiles_data <- data.frame(
  ID = c("咖啡因", "阿司匹林"),
  SMILES = c("Cn1cnc2c1c(=O)n(c(=O)n2C)C", "CC(=O)OC1=CC=CC=C1C(=O)O")
)
export_smiles_to_mol(smiles_data, output_dir = "mol_files")

# 🎯 4. 交互式化学结构浏览器
navigate_chem(data_complete)  # 打开Shiny应用

💡 提示: 使用 ?extract_cid?export_smiles_to_mol 查看详细的参数说明和更多示例!

📚 详细函数示例

🔍 化学元数据提取

library(labtools)

# 从化学标识符提取CID
compounds <- data.frame(
  Name = c("咖啡因", "可可碱"),
  CAS = c("58-08-2", "83-67-0")
)

# 提取CID并进行适当的错误处理
# 使用 ?extract_cid 查看完整参数详情
compounds_cid <- extract_cid(
  data = compounds,           # 输入数据框
  name_col = "Name",         # 包含化学名称的列 (默认: "Name")
  cas_col = "CAS",           # 包含CAS号的列 (默认: "CAS")
  inchikey_col = "InChIKey", # 包含InChIKey的列 (默认: "InChIKey")
  verbose = TRUE             # 显示进度信息 (默认: TRUE)
)

# 提取综合元数据
# 使用 ?extract_meta 查看所有选项
compounds_meta <- extract_meta(
  data = compounds_cid,      # 包含CID列的数据框 (必需)
  cas = TRUE,                # 提取CAS号 (默认: FALSE)
  flavornet = TRUE,          # 提取Flavornet数据 (默认: FALSE)
  synonyms = TRUE,           # 提取同义词 (默认: FALSE)
  uses = TRUE,               # 提取化合物用途 (默认: FALSE)
  verbose = TRUE             # 显示进度 (默认: TRUE)
)

# 使用ClassyFire添加化学分类
# 使用 ?extract_classyfire 查看详情
compounds_classified <- extract_classyfire(
  data = compounds_meta,     # 包含InChIKey列的数据框
  inchikey_col = "InChIKey", # InChIKey列名 (默认值)
  name_col = "Name"          # 用于进度显示的名称列 (默认值)
)

关键参数说明: - name_col, cas_col, inchikey_col: 指定包含标识符的列 (默认值: “Name”, “CAS”, “InChIKey”) - verbose: 设为 FALSE 可抑制进度信息 (默认: TRUE) - cas, flavornet, synonyms, uses: 启用特定元数据类型的提取 (全部默认: FALSE) - 函数自动处理网络错误并提供详细的进度信息


📄 许可证

本项目采用MIT许可证 - 详情请参阅 LICENSE 文件。


👨‍🔬 作者信息

苏启枝 (Qizhi Su) - 包开发者 - 📧 邮箱: sukissqz@gmail.com - 🆔 ORCID: 0000-0002-8124-997X - 🐙 GitHub: @QizhiSu


⭐ 如果您觉得labtools有用,请考虑给它一个星标!⭐

GitHub stars


🔬 让分析化学数据处理变得简单高效 🔬

Streamlining analytical chemistry workflows with precision and elegance

About

This package aims to provide a set of tools to help facilitate the handling of laboratory data. At the moment, there is only one function to help read and align compound lists of different samples. It is still under active development. In the future, more functions will be incorporated based on the need of our lab.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages