Skip to content

πŸ€– Advanced ML pipeline predicting suicide rates using WHO/World Bank data. Features automated processing, multiple algorithms with cross-validation, and novel interpretability approaches. Demonstrates enterprise-level data science capabilities.

Notifications You must be signed in to change notification settings

pinheiro-lu/machine-learning-suicide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Socioeconomic Determinants of Suicide Rates - ML Analysis

Python scikit-learn Code style: black

Machine learning analysis to identify the top 5 socioeconomic determinants of suicide rates using World Bank (WDI) and WHO data. Systematic reduction from 1509 β†’ 5 interpretable variables focused on mental health public policy.

πŸ“„ Read the full paper | πŸ“Š View results

🎯 Main Results

Decision Tree (best model): RΒ² = 0.82, MSE = 8.56
Lasso Regression: RΒ² = 0.24, MSE = 36.82

Top 5 Identified Determinants

Variable Importance Effect Interpretation
🚺 Female labor force participation 0.296 ⬆️ "Double burden" stress
πŸ™οΈ Population density 0.284 ⬇️ Access to services/social networks
🏭 Industrial employment 0.169 ⬆️ Adverse working conditions
⚑ Access to electricity 0.137 ⬇️ Development indicator
πŸ₯ Private health spending 0.115 ⬇️ Access to mental health

RΒ² Comparison MSE Comparison

πŸ“Š About the Project

Analysis of 185 countries (2000-2021) integrating World Bank and WHO data to identify socioeconomic factors that most influence suicide rates. Complete pipeline: data collection β†’ variable selection β†’ modeling β†’ interpretation.

Why this matters? Suicide affects 720+ thousand people/year (WHO). Identifying actionable determinants helps develop effective public policies (aligned with UN SDGs).

Methodology Summary

  1. Data: 1509 WDI variables + age-standardized suicide rates (WHO)
  2. Selection: Correlation filters + Maximum Independent Set β†’ 74 variables
  3. Modeling: Decision Trees + Lasso Regression (5-fold CV)
  4. Interpretation: Importance analysis + direction of associations

οΏ½ Quick Start

# Install requirements
pip install -r requirements.txt

# Run best model (Decision Tree with 5 key variables)
python scripts/modeling/generic_regression_crossval.py --model decision_tree --mode interpretable

# Generate comparison plots
python scripts/plot_comparacao_modelos.py

πŸ“ Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/          # Original API data
β”‚   β”œβ”€β”€ interim/      # Intermediate processing  
β”‚   └── processed/    # Final datasets
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ data_processing/  # Extraction and processing
β”‚   └── modeling/         # ML models
β”œβ”€β”€ results/          # Model outputs
└── src/             # Utilities and feature selection

πŸ”§ Requirements

pip install -r requirements.txt

Main dependencies: Python 3.8+, scikit-learn, pandas, numpy, matplotlib, seaborn, requests

🎯 Contributions

  • Systematic reduction 1509β†’5 interpretable variables
  • Directionality analysis (decision trees)
  • Reproducible pipeline for epidemiological studies
  • Actionable insights for suicide prevention

⚠️ Limitations

  • Aggregated data (ecological fallacy)
  • Imputation required for all observations
  • Directionality analysis could be improved (SHAP/LIME)

πŸ‘₯ Authors

Luan Pereira Pinheiro and Sofia Leopoldo - University of SΓ£o Paulo

πŸ“„ Full Article: PDF | πŸ“ Data & Code: Available in this repository


This research contributes to understanding socioeconomic factors in suicide prevention through interpretable ML, aligned with UN SDGs.

About

πŸ€– Advanced ML pipeline predicting suicide rates using WHO/World Bank data. Features automated processing, multiple algorithms with cross-validation, and novel interpretability approaches. Demonstrates enterprise-level data science capabilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published