A comprehensive Python toolkit for evaluating machine learning model fairness across different demographic groups.
- Interactive HTML Reports: Generate comprehensive, self-contained HTML reports with embedded visualizations
- Multiple Fairness Metrics: Evaluate demographic parity, equalized odds, calibration parity, and disparate impact
- Metric Selection: See section below on choosing the appropriate fairness metrics with healthcare examples.
pip install -e .import numpy as np
from fairness_toolkit.reporting.interactive_fairness_report import generate_interactive_fairness_report
# Your model predictions
y_true = np.array([0, 1, 1, 0, 1, 0, 1, 1])
y_pred = np.array([0, 1, 0, 0, 1, 1, 1, 0])
y_prob = np.array([0.1, 0.8, 0.3, 0.2, 0.9, 0.6, 0.7, 0.4])
# Sensitive features
sensitive_features = {
'race': np.array(['White', 'Black', 'White', 'Black', 'White', 'Black', 'White', 'Black']),
'sex': np.array(['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'])
}
# Generate interactive report
report = generate_interactive_fairness_report(
y_true=y_true,
y_pred=y_pred,
sensitive_features_dict=sensitive_features,
y_prob=y_prob,
output_path="fairness_report.html"
)
print(f"Report saved to: {report['output_path']}")# Define which groups to use as baseline for comparisons
privileged_groups = {
'race': 'White',
'sex': 'Male',
'age': 'Middle'
}
report = generate_interactive_fairness_report(
y_true=y_true,
y_pred=y_pred,
sensitive_features_dict=sensitive_features,
y_prob=y_prob,
privileged_groups=privileged_groups,
output_path="custom_fairness_report.html"
)from fairness_toolkit.metrics import calculate_all_metrics
# Calculate metrics for a single feature
metrics = calculate_all_metrics(y_true, y_pred, race_groups, y_prob)
print(f"Demographic parity: {metrics.demographic_parity}")
print(f"Equalized odds: {metrics.equalized_odds}")
print(f"Disparate impact: {metrics.disparate_impact}")from fairness_toolkit.visualization.fairness_plots import (
plot_group_distributions,
create_fairness_dashboard,
plot_heatmap
)
# Create individual plots (returns base64 encoded images)
distributions_plot = plot_group_distributions(
y_true, y_pred, race_groups, y_prob, feature_name='Race'
)
dashboard_plot = create_fairness_dashboard(
y_true, y_pred, race_groups, y_prob, feature_name='Race'
)The interactive HTML reports include:
- Model accuracy, precision, recall, F1-score, and AUC
- Prominently displayed for immediate assessment
- Dropdown selector for different sensitive features
- Group-by-group performance metrics
- Visual comparisons across demographic groups
- Demographic Parity: Equal positive prediction rates across groups
- Equalized Odds: Equal true positive and false positive rates
- Calibration Parity: Equal predictive values across groups
- Disparate Impact: 80% rule compliance testing
Selecting an appropriate fairness metric depends on your specific healthcare application and ethical priorities. Here's a decision flow to guide your choice:
Start here: What is your model predicting?
- Resource allocation (e.g., ICU bed assignment, organ transplant priority) → Consider Demographic Parity to ensure equal access rates across groups. This prevents systematic exclusion of certain populations from critical resources.
- Risk prediction requiring accuracy (e.g., disease diagnosis, mortality risk) → Consider Equalized Odds to maintain both equal true positive rates (sensitivity) and false positive rates across groups. This ensures no group systematically misses diagnoses or receives unnecessary interventions.
- Probability-based decisions (e.g., readmission risk scores, treatment response likelihood) → Consider Calibration Parity to ensure predicted probabilities mean the same thing across groups. A 70% risk should represent the same actual risk regardless of patient demographics.
- Legal compliance or screening (e.g., insurance eligibility, clinical trial enrollment) → Consider Disparate Impact (80% rule) as it aligns with legal standards and ensures selection rates don't disproportionately exclude protected groups.
Additional considerations:
- If false negatives are catastrophic (missing cancer diagnosis) → Consider prioritizing Equalized Odds with focus on TPR equality
- If false positives are harmful (unnecessary surgery recommendations) → Consider prioritizing Equalized Odds with focus on FPR equality
- If maintaining trust across communities is critical → Consider Demographic Parity to show equal treatment at face value
- If decisions involve thresholds or cutoffs → Consider Calibration Parity to ensure fair threshold application
- Actionable improvement suggestions
- Metric-specific recommendations
- Technical and policy-based solutions
fairness_toolkit/
├── metrics/
│ ├── __init__.py
│ └── fairness_metrics.py # Core fairness calculations
├── reporting/
│ ├── __init__.py
│ ├── fairness_report.py # Basic reporting functions
│ ├── interactive_fairness_report.py # Interactive HTML reports
│ └── interactive_template.html # HTML template
├── visualization/
│ ├── __init__.py
│ └── fairness_plots.py # All visualization functions
├── utils/
│ └── __init__.py
├── README.md
├── setup.py
└── requirements.txt
- Python 3.7+
- NumPy
- Pandas
- Scikit-learn
- Matplotlib
- Seaborn
- Jinja2
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
If you use this toolkit in your research, please cite:
@software{fairness_toolkit,
title={Fairness Toolkit: Interactive ML Fairness Analysis},
author={Jonathan Tsai},
year={2025},
url={https://github.com/cctsairush/fairness_toolkit}
}