A machine learning project for classifying mushroom edibility based on morphological features. Achieves up to 99.87% accuracy using Random Forest on a dataset of 61,069 mushroom samples from 173 species.
Purpose: Explore end-to-end ML workflows for binary classification (edible vs. poisonous) using the Secondary Mushroom dataset.
Dataset: 61,069 semi-synthetic mushroom samples with 20 features (17 categorical, 3 numerical) representing morphological characteristics.
Key Challenge: Handle missing data (up to 94.8% in some features) and non-linear feature relationships.
| Model | Accuracy |
|---|---|
| Random Forest | 99.87% |
| Decision Tree | 99.67% |
| Naive Bayes | 77.93% |
Most Important Features: stem-width, cap-surface, gill-attachment, stem-height, stem-color
- Mushroom.py — Domain-specific utilities
- data.py — Data loading utilities
- data_processing.py — Preprocessing (feature removal, imputation)
- visualize_data.py — Visualization scripts
- feature_frequencies.py — Feature analysis
- naive_bayes_model.py — Naive Bayes implementation
- shroom_safe_demo.ipynb — End-to-end demo
- load_data.ipynb — Dataset loading and inspection
- substitute_missing_values.ipynb — Missing value strategies
- simple_decision_tree.ipynb — Decision tree experiments
- random_forest_model.ipynb — Random forest implementation
- features_summary.ipynb — Feature analysis
- extract_categoric_features.ipynb — Categorical processing
- feature_summary.csv — Feature statistics
- feature_value_distribution.csv — Distribution summaries
- secondary_data_meta.txt — Dataset metadata
git clone https://github.com/GabrielMSilva04/ShroomSafe.git
cd ShroomSafepip install -r requirements.txt
# Or: pip install pandas numpy scikit-learn matplotlib seaborn jupyterDownload the Secondary Mushroom Dataset from UCI ML Repository:
- https://doi.org/10.24432/C5FP5Q
- Place in expected location (see
data.pyfor path)
# Start with demo
jupyter notebook shroom_safe_demo.ipynb
# Or run individual models
python naive_bayes_model.py
jupyter notebook random_forest_model.ipynb- Load data (
load_data.ipynb) — Inspect structure and missing values - Preprocess (
data_processing.py) — Remove features with >40% missing data, apply KNN/mode imputation - Analyze features (
features_summary.ipynb) — Visualize distributions and relationships - Train models — Compare Naive Bayes, Decision Tree, and Random Forest
- Evaluate — Assess accuracy, F2-score, and feature importance
- KNN imputation (k=11) preserves distributions better than mode imputation
- Random Forest excels through ensemble learning with minimal overfitting
- Continuous features (stem dimensions) combined with categorical features (surface, color) are most predictive
- Naive Bayes provides useful probabilistic insights despite lower accuracy
Preprocessing: 6 features removed (>40% missing), KNN imputation for remaining gaps
Models: Naive Bayes (baseline), Decision Tree (interpretable), Random Forest (robust)
Evaluation: Accuracy, F2-score, Gini-based feature importance
Full methodology and results detailed in ShroomSafe_report_TAA.pdf
Sebastião Teixeira, Gabriel Silva, Maria Linhares (DETI, UA)
- Wagner, D., Heider, D., & Hattab, G. (2021). "Mushroom data creation, curation, and simulation to support classification tasks." Scientific Reports, 11.
- UCI Secondary Mushroom Dataset (2021)