This repository contains a set of analyses using supervised and unsupervised machine learning to process biological data
CAP (CANONICAL ANALYSIS OF PRINCIPAL COORDINATES): a method that allows a constrained ordination of species abundance data, based on a particular dissimilarity measure. # performed with abundance_df.csv file #
UPGMA (Unweighted Pair Grouping Method with Arithmetic-mean): one of the most common Hierarchical clustering algorithms used in computational biology. It defines the dissimilarity between clusters as their average dissimilarity (hence its name).
Random Forest: is an ML algorithm used for classification and regression tasks. It creates multiple decision trees during training and combines their predictions to make a final prediction (ensemble modelling). Each decision tree in the forest is trained on a random subset of the training data and features, which helps to reduce overfitting and improve generalization. Then, when making predictions, each tree "votes" on the outcome, and the most popular prediction among all the trees is chosen as the final prediction. This ensemble approach often results in more accurate and robust predictions than individual decision trees.