Skip to content
Smutin Daniil edited this page May 15, 2025 · 3 revisions

samova.R – Metagenomic Data Generation and Analysis

A tool for generating synthetic metagenomic samples with realistic taxonomic profiles, designed for benchmarking, machine learning augmentation, and ecological modeling.


Installation

From GitHub

if (!require("devtools")) install.packages("devtools")  
devtools::install_github("ctlab/samovar")  
library(samovaR)  

Basic Usage

1. Downloading Data from GMrepo

Fetch metagenomic data using GMrepo_type2data(). Specify filters to refine your dataset.

data <- GMrepo_type2data(
  mesh_ids = "D006262",       # Disease MeSH ID (optional)  
  number_to_process = 1500,   # Max samples to retrieve  
  threshold_amount = 0.0001,  # Min abundance threshold  
  threshold_samples = 5,      # Min samples per species  
  threshold_species = 5       # Min species per sample  
)

2. Preprocessing

Trim low-abundance species and normalize data for analysis.

filtered_data <- teatree_trim(
  data,  
  threshold_amount = 0.0001,  
  threshold_samples = 5,  
  threshold_species = 5  
)

3. Normalization & Clustering

Apply log-transform normalization and cluster species into functional groups.

normalized_data <- tealeaves_pack(filtered_data)  
clustered_data <- teabag_brew(
  normalized_data,  
  min_cluster_size = 30,  # Minimum species per cluster  
  max_cluster_size = 150  # Maximum species per cluster  
)

4. Generate Synthetic Samples

Use samovar_boil() to create artificial metagenomic communities.

synthetic_data <- concotion_pour(clustered_data) %>%  
  samovar_boil(N = 500)  # Generate 500 synthetic samples  

5. Visualization

Plot taxonomic composition with viz_composition().

viz_composition(
  synthetic_data,  
  type = "tile",       # "tile", "bar", or "heatmap"  
  interactive = TRUE,  # Interactive plot (if supported)  
  top = 20            # Display top 20 species  
)

Machine Learning Integration

Augment Training Data

Generate synthetic samples to improve classifier performance.

# Combine real and synthetic data  
training_data <- rbind(real_data, synthetic_data$data)  

# Train a Random Forest model  
model <- randomForest(  
  y = training_data$label,  
  x = training_data[, -ncol(training_data)],  
  importance = TRUE  
)  

Evaluate Performance

Compare model accuracy with and without synthetic data.

predictions <- predict(model, test_data)  
confusionMatrix(predictions, test_data$label)  

Advanced Features

  • Custom Normalization: Replace tealeaves_pack() with a user-defined function.
  • Cluster Optimization: Adjust min_cluster_size and max_cluster_size for finer grouping.
  • Web Interface: Launch the Shiny app with samovar_browse().

Troubleshooting

  • Error: Missing species/samples? → Adjust threshold_ parameters in teatree_trim().
  • Slow clustering? → Reduce max_cluster_size.
  • Poor ML performance? → Increase synthetic sample count (N in samovar_boil()).

Contribute & Cite


📌 Tip: For detailed examples, see the examples/ folder in the repository.

🚀 Happy metagenomic modeling!