-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Smutin Daniil edited this page May 15, 2025
·
3 revisions
A tool for generating synthetic metagenomic samples with realistic taxonomic profiles, designed for benchmarking, machine learning augmentation, and ecological modeling.
if (!require("devtools")) install.packages("devtools")
devtools::install_github("ctlab/samovar")
library(samovaR) Fetch metagenomic data using GMrepo_type2data(). Specify filters to refine your dataset.
data <- GMrepo_type2data(
mesh_ids = "D006262", # Disease MeSH ID (optional)
number_to_process = 1500, # Max samples to retrieve
threshold_amount = 0.0001, # Min abundance threshold
threshold_samples = 5, # Min samples per species
threshold_species = 5 # Min species per sample
)Trim low-abundance species and normalize data for analysis.
filtered_data <- teatree_trim(
data,
threshold_amount = 0.0001,
threshold_samples = 5,
threshold_species = 5
)Apply log-transform normalization and cluster species into functional groups.
normalized_data <- tealeaves_pack(filtered_data)
clustered_data <- teabag_brew(
normalized_data,
min_cluster_size = 30, # Minimum species per cluster
max_cluster_size = 150 # Maximum species per cluster
)Use samovar_boil() to create artificial metagenomic communities.
synthetic_data <- concotion_pour(clustered_data) %>%
samovar_boil(N = 500) # Generate 500 synthetic samples Plot taxonomic composition with viz_composition().
viz_composition(
synthetic_data,
type = "tile", # "tile", "bar", or "heatmap"
interactive = TRUE, # Interactive plot (if supported)
top = 20 # Display top 20 species
)Generate synthetic samples to improve classifier performance.
# Combine real and synthetic data
training_data <- rbind(real_data, synthetic_data$data)
# Train a Random Forest model
model <- randomForest(
y = training_data$label,
x = training_data[, -ncol(training_data)],
importance = TRUE
) Compare model accuracy with and without synthetic data.
predictions <- predict(model, test_data)
confusionMatrix(predictions, test_data$label) -
Custom Normalization: Replace
tealeaves_pack()with a user-defined function. -
Cluster Optimization: Adjust
min_cluster_sizeandmax_cluster_sizefor finer grouping. -
Web Interface: Launch the Shiny app with
samovar_browse().
-
Error: Missing species/samples? → Adjust
threshold_parameters inteatree_trim(). -
Slow clustering? → Reduce
max_cluster_size. -
Poor ML performance? → Increase synthetic sample count (
Ninsamovar_boil()).
- GitHub: github.com/dsmutin/samovar
- Citation: If used in research, cite the repository and related papers.
📌 Tip: For detailed examples, see the examples/ folder in the repository.
🚀 Happy metagenomic modeling!