Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
best_fitting_pipeline.py	best_fitting_pipeline.py
feature_data_unfiltered.csv	feature_data_unfiltered.csv
feature_filtering.ipynb	feature_filtering.ipynb
feature_selection.ipynb	feature_selection.ipynb
filtered_features.csv	filtered_features.csv
phenotype.csv	phenotype.csv
taxonomy.png	taxonomy.png

Name

Last commit message

Last commit date

best_fitting_pipeline.py

feature_data_unfiltered.csv

feature_filtering.ipynb

feature_selection.ipynb

filtered_features.csv

phenotype.csv

taxonomy.png

README

This example dataset comes from the MicroMass dataset published at UCI Machine Learning Repositiory.

The data used here is the reference panel of 20 Gram positive and negative bacterial species. The samples are clustered per species, with each instance of a species as a replicate.

The Gram types are used as phenotypes.

^{Taxonomy of species in the reference panel. Source: Mah & Veyrrieras (2013)}

species	replicates	Gram type
Bacillus cereus	26	positive
Bacillus thuringiensis	11	positive
Citrobacter braakii	26	negative
Citrobacter freundii	28	negative
Clostridium difficile	14	positive
Clostridium glycolicum	16	positive
Enterobacter asburiae	29	negative
Enterobacter cloacae	52	negative
Escherichia coli	60	negative
Haemophilus influenzae	50	negative
Haemophilus parainfluenzae	21	negative
Listeria ivanovii	29	positive
Listeria monocytogenes	31	positive
Shigella boydii	18	negative
Shigella flexneri	32	negative
Shigella sonnei	31	negative
Streptococcus mitis	26	positive
Streptococcus oralis	24	positive
Yersinia enterocolitica	27	negative
Yersinia frederiksenii	20	negative

Workflow

This example starts with two datasets: 'feature_data_unfiltered.csv' with the unfiltered mass spectrometry data and 'phenotype.csv' with Gram types of each sample as explained above.

In 'feature_filtering.ipynb', the mass spectrometry dataset is filtered based on the presence of each feature in at least 11 replicates of at least one species. This threshold is chosen, because this is the smallest number of replicates in the data (the sample size of Bacillus thuringiensis). There are no blanks in the dataset, so the blank filtering step is omitted. The filtered dataset is exported as 'filtered_features.csv'.

In 'feature_selection.ipynb' the filtered dataset and 'phenotype.csv' are used as input to build a Machine Learning pipeline for the classification of Gram positive and negative bacteria. This results in the best fitting pipeline, outputted as 'best_fitting_pipeline.py', and the option to extract the most important features for the classification.

Reference

Mah,Pierre and Veyrieras,Jean-Baptiste. (2013). MicroMass. UCI Machine Learning Repository. https://doi.org/10.24432/C5T61S.

Link: https://doi.org/10.24432/C5T61S

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README

Workflow

Reference

FilesExpand file tree

MicroMass

Directory actions

More options

Directory actions

More options

Latest commit

History

MicroMass

Folders and files

parent directory

README.md

README

Workflow

Reference