A data-driven, intelligent, and scalable framework to analyze groundwater fluoride contamination across India using Machine Learning, Regression Models, and a Fuzzy Inference System (FIS). This system supports early detection of fluoride-vulnerable regions and helps government agencies and water-resource managers make informed decisions.
- Large-Scale Analysis: Evaluates over 16,776 groundwater samples across various Indian states and districts.
- Predictive Modeling: Utilizes advanced regression techniques to estimate precise fluoride concentrations.
- Automated Classification: Categorizes water quality into Safe, Moderate, and High-risk zones using optimized classifiers.
- Interpretability: Employs Mamdani Fuzzy Logic to convert technical data into human-readable risk scores.
- Spatial Visualization: Generates state-level heatmaps and regional analysis for spatial awareness.
The dataset comprises physicochemical parameters that significantly influence fluoride mobility within aquifers.
| Feature Category | Parameters Included |
|---|---|
| Physicochemical | pH, EC, TDS, Na⁺, Ca²⁺, Mg²⁺, K⁺, Cl⁻, SO₄²⁻, NO₃⁻, HCO₃⁻ |
| Target Variable | Fluoride concentration (mg/L) |
| Geospatial | State and District identifiers |
- Standardization: Normalization of column nomenclature (e.g., standardizing “EC µS/cm” to “EC”).
- Imputation: Conversion of invalid entries to null values followed by Median Imputation to maintain numeric stability.
- Risk Labeling: Implementation of WHO drinking water standards for classification.
- Class 0 (< 1.5 mg/L): Safe
- Class 1 (1.5–2.5 mg/L): Moderate Risk
- Class 2 (> 2.5 mg/L): High Risk
- Scaling: Application of Min-Max scaling to a standard 0–1 range.
- Class Balancing: Utilization of SMOTE (Synthetic Minority Over-sampling Technique) to resolve dataset imbalances and achieve perfect class parity.
Seven distinct models were evaluated to determine the most effective classifier for fluoride risk.
| Model | Classification Type | Accuracy |
|---|---|---|
| Random Forest | Ensemble Learning | 93% (Top Performer) |
| XGBoost | Gradient Boosting | High Accuracy |
| LightGBM | Boosting | Efficiency at Scale |
| ANN | Neural Network | Pattern Recognition |
| SVM (RBF) | Kernel-based | Nonlinear Mapping |
| Model | R² Score | RMSE |
|---|---|---|
| Random Forest Regressor | 0.273 | 0.684 |
| Linear Regression | 0.218 | 0.709 |
| SVR | 0.174 | 0.729 |
The system utilizes a Mamdani-type FIS to handle environmental uncertainty and provide interpretable results.
- Input Memberships: Very Low, Low, Normal, High, Very High.
- Output Risk Scores: Low Risk (< 33), Medium Risk (33–66), High Risk (>= 66).
Current Constraints:
- Absence of seasonal temporal data.
- Limited to fluoride without accounting for heavy metal or nitrate interactions.
- Exclusion of complex spatial hydrogeological layers.
Future Directions:
- Implementation of GIS-based real-time heatmaps.
- Integration of Deep Learning for enhanced predictive precision.
- Incorporation of SHAP/LIME for model explainability and transparency.
# Clone the repository
git clone [https://github.com/codemuggle09/AquaRisk](https://github.com/codemuggle09/AquaRisk)
# Navigate into project folder
cd AquaRisk
# Install dependencies
pip install -r requirements.txt
# Launch the dashboard
python -m streamlit run webapp.py