Skip to content

GabrielMSilva04/ShroomSafe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ShroomSafe

A machine learning project for classifying mushroom edibility based on morphological features. Achieves up to 99.87% accuracy using Random Forest on a dataset of 61,069 mushroom samples from 173 species.

Overview

Purpose: Explore end-to-end ML workflows for binary classification (edible vs. poisonous) using the Secondary Mushroom dataset.

Dataset: 61,069 semi-synthetic mushroom samples with 20 features (17 categorical, 3 numerical) representing morphological characteristics.

Key Challenge: Handle missing data (up to 94.8% in some features) and non-linear feature relationships.

Model Performance

Model Accuracy
Random Forest 99.87%
Decision Tree 99.67%
Naive Bayes 77.93%

Most Important Features: stem-width, cap-surface, gill-attachment, stem-height, stem-color

What's Included

Core Scripts

  • Mushroom.py — Domain-specific utilities
  • data.py — Data loading utilities
  • data_processing.py — Preprocessing (feature removal, imputation)
  • visualize_data.py — Visualization scripts
  • feature_frequencies.py — Feature analysis
  • naive_bayes_model.py — Naive Bayes implementation

Notebooks

  • shroom_safe_demo.ipynb — End-to-end demo
  • load_data.ipynb — Dataset loading and inspection
  • substitute_missing_values.ipynb — Missing value strategies
  • simple_decision_tree.ipynb — Decision tree experiments
  • random_forest_model.ipynb — Random forest implementation
  • features_summary.ipynb — Feature analysis
  • extract_categoric_features.ipynb — Categorical processing

Data Files

  • feature_summary.csv — Feature statistics
  • feature_value_distribution.csv — Distribution summaries
  • secondary_data_meta.txt — Dataset metadata

Getting Started

1. Clone Repository

git clone https://github.com/GabrielMSilva04/ShroomSafe.git
cd ShroomSafe

2. Install Dependencies

pip install -r requirements.txt
# Or: pip install pandas numpy scikit-learn matplotlib seaborn jupyter

3. Get Dataset

Download the Secondary Mushroom Dataset from UCI ML Repository:

4. Run Analysis

# Start with demo
jupyter notebook shroom_safe_demo.ipynb

# Or run individual models
python naive_bayes_model.py
jupyter notebook random_forest_model.ipynb

Workflow

  1. Load data (load_data.ipynb) — Inspect structure and missing values
  2. Preprocess (data_processing.py) — Remove features with >40% missing data, apply KNN/mode imputation
  3. Analyze features (features_summary.ipynb) — Visualize distributions and relationships
  4. Train models — Compare Naive Bayes, Decision Tree, and Random Forest
  5. Evaluate — Assess accuracy, F2-score, and feature importance

Key Findings

  • KNN imputation (k=11) preserves distributions better than mode imputation
  • Random Forest excels through ensemble learning with minimal overfitting
  • Continuous features (stem dimensions) combined with categorical features (surface, color) are most predictive
  • Naive Bayes provides useful probabilistic insights despite lower accuracy

Technical Details

Preprocessing: 6 features removed (>40% missing), KNN imputation for remaining gaps

Models: Naive Bayes (baseline), Decision Tree (interpretable), Random Forest (robust)

Evaluation: Accuracy, F2-score, Gini-based feature importance

Documentation

Full methodology and results detailed in ShroomSafe_report_TAA.pdf

Authors

Sebastião Teixeira, Gabriel Silva, Maria Linhares (DETI, UA)

References

  • Wagner, D., Heider, D., & Hattab, G. (2021). "Mushroom data creation, curation, and simulation to support classification tasks." Scientific Reports, 11.
  • UCI Secondary Mushroom Dataset (2021)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •