diff --git a/inst/pages/alpha_diversity.qmd b/inst/pages/alpha_diversity.qmd index 599b38c6..4ceb23ea 100644 --- a/inst/pages/alpha_diversity.qmd +++ b/inst/pages/alpha_diversity.qmd @@ -34,8 +34,8 @@ evident from their names. @bastiaanssen2023bugs1 lay out this relationship across two factors (See table below); First, alpha diversity metrics can be defined as special cases of a unifying equation of **diversity**, where the **Hill number** determines the specific index captured. Lower Hill numbers -favour **richness**, the number of distinct taxonomic features, whereas higher -numbers favour **evenness**, how the taxonomic features are distributed over +favor **richness**, the number of distinct taxonomic features, whereas higher +numbers favor **evenness**, how the taxonomic features are distributed over the sample [@Hill1973]. Second, some alpha diversity metrics are weighted based on phylogeny, like Faith's PD [-@Faith1992] and PhILR [@Silverman2017]. @@ -192,7 +192,7 @@ barcode). ```{r} #| label: plot_richness #| message: false -#| fig-cap: "Observed richness plotted grouped by sample type with colour-labelled barcode." +#| fig-cap: "Observed richness plotted grouped by sample type with color-labeled barcode." library(scater) plotColData( diff --git a/inst/pages/clustering.qmd b/inst/pages/clustering.qmd index 32096105..514a7ea4 100644 --- a/inst/pages/clustering.qmd +++ b/inst/pages/clustering.qmd @@ -118,7 +118,7 @@ clusters to describe the data. Now, we visualize the hierarchical structure of the clusters with a dendrogram tree. In dendrograms, the tree is split where the branch length is the largest. In each splitting point, the tree is divided into two clusters leading to the -hierarchy. In this example, each sample is labelled by their dominant taxon +hierarchy. In this example, each sample is labeled by their dominant taxon to visualize ecological differences between the clusters. ```{r} diff --git a/inst/pages/community_similarity.qmd b/inst/pages/community_similarity.qmd index e6f9d4bd..e958a087 100644 --- a/inst/pages/community_similarity.qmd +++ b/inst/pages/community_similarity.qmd @@ -786,7 +786,7 @@ the `plotRDA()` function from the `r BiocStyle::Biocpkg("miaViz")` package. # Load packages for plotting function library(miaViz) -# Generate RDA plot coloured by clinical status +# Generate RDA plot colored by clinical status plotRDA(tse2, "RDA", colour.by = "ClinicalStatus") ``` @@ -1101,7 +1101,7 @@ eigenvalues? 6. Visualize the first two principal components. 7. Explore `colData` and visualize the first two principal components again, -now with samples coloured based on a variable from the sample metadata. Can you +now with samples colored based on a variable from the sample metadata. Can you observe any patterns? 8. Visualize the PCA loadings for the two first components. Which features have diff --git a/inst/pages/containers.qmd b/inst/pages/containers.qmd index 9871ea3c..4f636d85 100644 --- a/inst/pages/containers.qmd +++ b/inst/pages/containers.qmd @@ -188,7 +188,7 @@ assay(tse, "counts") |> head() In summary, in the world of microbiome analysis, an assay is essentially a way to describe the composition of microbes in a given -sample. This way we can summarise the microbiome profile of a human gut +sample. This way we can summarize the microbiome profile of a human gut or a sample of soil. Furthermore, to illustrate the use of multiple assays, we can create an diff --git a/inst/pages/contributions.qmd b/inst/pages/contributions.qmd index 9e378d7d..d403e481 100644 --- a/inst/pages/contributions.qmd +++ b/inst/pages/contributions.qmd @@ -249,7 +249,7 @@ This work has been supported by: * [Research Council of Finland](https://www.aka.fi/) * [FindingPheno](https://www.findingpheno.eu/) European Union’s Horizon 2020 -research and innovation programme under grant agreement No 952914 +research and innovation program under grant agreement No 952914 * COST Action network on Statistical and Machine Learning Techniques for Human Microbiome Studies diff --git a/inst/pages/correlation.qmd b/inst/pages/correlation.qmd index 9601c45e..7773e9dc 100644 --- a/inst/pages/correlation.qmd +++ b/inst/pages/correlation.qmd @@ -19,7 +19,7 @@ we will demonstrate how to perform correlation analysis with ## Association between taxonomic features -Here we demonstrate how to analyse which bacteria co-exists in the dataset. +Here we demonstrate how to analyze which bacteria co-exists in the dataset. ```{r} #| label: association1 diff --git a/inst/pages/extra_material/add-comm-typing.Rmd b/inst/pages/extra_material/add-comm-typing.Rmd index 5986173d..86ece7b6 100644 --- a/inst/pages/extra_material/add-comm-typing.Rmd +++ b/inst/pages/extra_material/add-comm-typing.Rmd @@ -134,7 +134,7 @@ res <- lapply(k, ClustDiagPlot) ### Composition barplot -A typical way to visualise microbiome composition is by using a composition barplot. +A typical way to visualize microbiome composition is by using a composition barplot. In the following, we agglomerate to the phylum level and subset by the country "Finland" to avoid long computation times. The samples in the barplot are ordered by "Firmicutes": ```{r, message=FALSE, warning=FALSE} @@ -155,7 +155,7 @@ plotAbundance(tse, rank = "Phylum", order.row.by = "abund", order.col.by = "Firm ### Composition heatmap -The community composition can be visualised with a heatmap where one axis represents the samples and the other taxa. The colour of each line represents the abundance of a taxon in a specific sample. +The community composition can be visualized with a heatmap where one axis represents the samples and the other taxa. The color of each line represents the abundance of a taxon in a specific sample. Here, the CLR + Z-transformed abundances are shown. @@ -196,7 +196,7 @@ grid.text("Phylum", x = -0.04, y = 0.47, rot = 90, gp = gpar(fontsize = 16)) ## Cluster into CSTs -The burden of specifying the number of clusters falls on the researcher. To help make an informed decision, we turn to previously established methods for doing so. In this section we introduce three such methods (aside from DMM analysis) to cluster similar samples. They include the [Elbow Method, Silhouette Method, and Gap Statistic Method](https://uc-r.github.io/kmeans_clustering). All of them will utilise the [`kmeans'](https://uc-r.github.io/kmeans_clustering) algorithm which essentially assigns clusters and minimises the distance within clusters (a sum of squares calculation). The default distance metric used is the Euclidean metric. +The burden of specifying the number of clusters falls on the researcher. To help make an informed decision, we turn to previously established methods for doing so. In this section we introduce three such methods (aside from DMM analysis) to cluster similar samples. They include the [Elbow Method, Silhouette Method, and Gap Statistic Method](https://uc-r.github.io/kmeans_clustering). All of them will utilize the [`kmeans'](https://uc-r.github.io/kmeans_clustering) algorithm which essentially assigns clusters and minimizes the distance within clusters (a sum of squares calculation). The default distance metric used is the Euclidean metric. The scree plot allows us to see how much of the variance is captured by each dimension in the MDS ordination. @@ -260,7 +260,7 @@ The function says that the bend occurs at $k=3$, however it is hard to tell that ### Silhouette Method -This method on the otherhand returns a width for each $k$. In this case, we want the $k$ that maximises the width. +This method on the otherhand returns a width for each $k$. In this case, we want the $k$ that maximizes the width. ```{r silhouette} # Silhouette method @@ -272,7 +272,7 @@ The graph shows the maximum occurring at $k=6$. At the very least, there is stro ### Gap-Statistic Method -The Gap-Statistic Method is the most complicated among the methods discussed here. With the gap statistic method, we typically want the $k$ value that maximises the output (local and global maxima), but we also want to pay attention to where the plot jumps if the maximum value doesn't turn out to be helpful. +The Gap-Statistic Method is the most complicated among the methods discussed here. With the gap statistic method, we typically want the $k$ value that maximizes the output (local and global maxima), but we also want to pay attention to where the plot jumps if the maximum value doesn't turn out to be helpful. ```{r gap-statistic} # Gap Statistic Method @@ -282,7 +282,7 @@ factoextra::fviz_nbclust(x, kmeans, method = "gap_stat", nboot = 50)+ The peak suggests $k=6$ clusters. If we also look to the points where the graph jumps, we can see there is evidence for $k=2$, $k=6$, and $k=8$. The output indicates that there should be at least three clusters present. Since we have previous evidence for the existence of six clusters from the silhouette and elbow methods, we will go with $k=6$. -At this point it helps to visualise the clustering in an MDS or NMDS plot. +At this point it helps to visualize the clustering in an MDS or NMDS plot. Now, let's divide the subjects into their respective clusters. @@ -307,7 +307,7 @@ library(scater) library(RColorBrewer) library(patchwork) -# set up colours +# set up colors CSTColors <- brewer.pal(6, "Paired")[c(2, 5, 3, 4, 1, 6)] names(CSTColors) <- CSTs diff --git a/inst/pages/extra_material/extra_material.qmd b/inst/pages/extra_material/extra_material.qmd index a5dbd76f..4229a3a3 100644 --- a/inst/pages/extra_material/extra_material.qmd +++ b/inst/pages/extra_material/extra_material.qmd @@ -189,7 +189,7 @@ Here we'll show an example of how to add relative abundances and CLR normalized OTU tables to your tse assays. With phyloseq you would need three different phyloseq objects, each taking up -7.7 MB of memory, whilst the tse with the three assays takes up only 18.3 MB. +7.7 MB of memory, while the tse with the three assays takes up only 18.3 MB. ```{r} #| label: transform_assay @@ -407,7 +407,7 @@ under `altExp`. `tax_glom()` removes the taxa which have not been assigned to the level given in taxrank by default (NArm = TRUE). So we will add the na.rm = TRUE to `agglomerateByRank()` function which is -equivalent to the default behaviour of `tax_glom()`. +equivalent to the default behavior of `tax_glom()`. ```{r} #| label: agglomerateByRank diff --git a/inst/pages/extra_material/visualization.qmd b/inst/pages/extra_material/visualization.qmd index a14355ae..d58ae5a3 100644 --- a/inst/pages/extra_material/visualization.qmd +++ b/inst/pages/extra_material/visualization.qmd @@ -222,7 +222,7 @@ which is explained in chapter [@sec-extras]. # perform NMDS coordination method tse <- runNMDS(tse, FUN = vegan::vegdist, name = "NMDS") # plot results of a 2-component NMDS on tse, -# coloured-scaled by shannon diversity index +# colored-scaled by shannon diversity index plotReducedDim(tse, "NMDS", colour_by = "shannon") ``` @@ -241,7 +241,7 @@ tse <- addMDS( ncomponents = 3 ) # plot results of a 3-component MDS on tse, -# coloured-scaled by faith diversity index +# colored-scaled by faith diversity index plotReducedDim(tse, "MDS", ncomponents = c(1:3), colour_by = "faith") ``` diff --git a/inst/pages/machine_learning.qmd b/inst/pages/machine_learning.qmd index 668fe6cb..94743c88 100644 --- a/inst/pages/machine_learning.qmd +++ b/inst/pages/machine_learning.qmd @@ -125,7 +125,7 @@ table(tse[["disease"]]) |> Before applying any ML algorithm, the data must be preprocessed. This speeds up the training of the models by reducing the amount of -features analysed, a desirable outcome when working with +features analyzed, a desirable outcome when working with high-dimensional microbiome data. In addition to faster performance, common pre-processing steps have biological justifications. For instance: @@ -133,7 +133,8 @@ For instance: * **Collapse highly correlated features:** In a microbial community, it's common for the abundance of two or more taxonomic features to be highly correlated due to ecological interactions. Thus, removing or collapsing -correlated features allows the model to analyse them as one group. +correlated features allows the model to analyze them as one group. + * **Remove features with near-zero variance:** Features that don't vary enough across groups can hardly help in discerning between them, as they don't hold any biologically relevant information. Additionally, @@ -679,14 +680,14 @@ roc_p + prc_p + plot_layout(guides = "collect") Before describing the plots and their meaning, it is worth noting that the ROC curves of both models resembles the curve presented in -the article where this dataset was first analysed [@qin2012_t2d] +the article where this dataset was first analyzed [@qin2012_t2d] (see Figure 4B). Interestingly, authors used other supervised ML algorithm, and it was trained in a set of 50 microbiome genes (instead of taxonomic features and alpha diversity metrics, as we did). However, it is interesting that concordant AUCs and ROC curves shapes were obtained using different microbiome-derived information. -Regarding our figures, note the dashed grey lines in both plots +Regarding our figures, note the dashed gray lines in both plots representing the expected performance of a model that is classifying samples randomly. Therefore, the greater the distance between that reference and the line representing our model's performance, the better. @@ -774,7 +775,7 @@ obs_vs_pred <- obs_vs_pred + labs(x = "Predicted BMI", y = "Observed BMI") obs_vs_pred + theme_bw() ``` -The dashed grey line in the plot above represents a perfect correlation +The dashed gray line in the plot above represents a perfect correlation between the observed and the model-predicted BMI values of each participant. Thus, the line indicates perfect performance of the model. We can see that while the predictions are around the mean BMI (close diff --git a/inst/pages/mediation.qmd b/inst/pages/mediation.qmd index 26e5fd16..0fcbdf21 100644 --- a/inst/pages/mediation.qmd +++ b/inst/pages/mediation.qmd @@ -29,7 +29,7 @@ $$ The microbiome can mediate the effects of multiple environmental stimuli on human health. However, the importance of its role as a mediator depends on the nature of the stimulus. For example, the effect of dietary fiber intake on host -behaviour is largely mediated by the gut microbiome [@Logan2014nutritional]. In +behavior is largely mediated by the gut microbiome [@Logan2014nutritional]. In contrast, the indirect impact of antibiotic use on mental health through an altered microbiome represents a more subtle process [@Dinan2022antibiotics]. diff --git a/inst/pages/miaverse.qmd b/inst/pages/miaverse.qmd index 9b80b5dd..b1ba9a5f 100644 --- a/inst/pages/miaverse.qmd +++ b/inst/pages/miaverse.qmd @@ -143,7 +143,7 @@ analysis - `r BiocStyle::Githubpkg("himelmallick/IntegratedLearner")` for multiomics classification and prediction - `r BiocStyle::Biocpkg("iSEEtree")` [@Benedetti2025iseetree] for interactive -visualisation of hierarchical data +visualization of hierarchical data - `r BiocStyle::Biocpkg("lefser")` [@Asya2024] for metagenomic biomarker discovery - `r BiocStyle::Biocpkg("LimROTS")` for differential expression analysis for diff --git a/inst/pages/phyloseq_cheatsheet.qmd b/inst/pages/phyloseq_cheatsheet.qmd index 4ddfd4d2..94fef877 100644 --- a/inst/pages/phyloseq_cheatsheet.qmd +++ b/inst/pages/phyloseq_cheatsheet.qmd @@ -194,7 +194,7 @@ OTU tables to your `tse` assays. With `r BiocStyle::Biocpkg("phyloseq")` you would need three different `r BiocStyle::Biocpkg("phyloseq")` objects, each taking up 7.7 MB of memory, -whilst the tse with the three assays takes up only 18.3 MB. +while the tse with the three assays takes up only 18.3 MB. ```{r} #| label: transform_assay @@ -418,7 +418,7 @@ object under `altExp`. `tax_glom()` removes the taxa which have not been assigned to the level given in taxrank by default (NArm = TRUE). So we will add the na.rm = TRUE to -`agglomerateByRank()` function which is equivalent to the default behaviour +`agglomerateByRank()` function which is equivalent to the default behavior of `tax_glom()`. ```{r} diff --git a/inst/pages/subsetting.qmd b/inst/pages/subsetting.qmd index 0040af22..c790b774 100644 --- a/inst/pages/subsetting.qmd +++ b/inst/pages/subsetting.qmd @@ -285,7 +285,7 @@ we opted for a rather conservative threshold that retains most features. We can subset the data based on prevalence using `subsetByPrevalent()`, which filters features that exceed a specified prevalence threshold, -helping to remove rare features that may be artefacts. Conversely, +helping to remove rare features that may be artifacts. Conversely, `subsetByRare()` allows us to retain only features below the threshold, enabling a focus on rare features within the dataset. diff --git a/inst/pages/support.qmd b/inst/pages/support.qmd index 2a4cc537..2f57e39b 100644 --- a/inst/pages/support.qmd +++ b/inst/pages/support.qmd @@ -2,7 +2,7 @@ ## FindingPheno -This project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 952914 ([FindingPheno](https://findingpheno.eu/)). +This project received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 952914 ([FindingPheno](https://findingpheno.eu/)). ## Online support diff --git a/inst/pages/transformation.qmd b/inst/pages/transformation.qmd index c39fc546..74b9ddff 100644 --- a/inst/pages/transformation.qmd +++ b/inst/pages/transformation.qmd @@ -15,7 +15,7 @@ interpretable values, to enhance the comparability of samples/features or to make data compatible with the assumptions of certain statistical methods. Examples include transforming feature counts into relative abundances -(i.e., "normalising as proportions"), or with compositionality-aware +(i.e., "normalizing as proportions"), or with compositionality-aware transformations such as the centered log-ratio transformation (clr). ## Characteristics of microbiome data to inform data transformations {#sec-stat-challenges} @@ -99,7 +99,7 @@ ranks. This has use, for instance, in non-parametric statistics. allows data with zeroes and avoids the need to add pseudocount [@Keshavan2010; @Martino2019]. -- **relabundance**: Relative transformation, also known as normalising as +- **relabundance**: Relative transformation, also known as normalizing as proportions, total sum scaling (TSS) and compositional transformation. This converts counts into proportions (at the scale [0, 1]) that sum up to 1. Much of the currently available taxonomic abundance data from