From 794a07d00c1b6ca0ddfce37abd41e00ce9153051 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Fri, 28 Feb 2020 09:47:04 -0500 Subject: [PATCH 1/9] add metrics API proposal Signed-off-by: Miro Dudik --- api/METRICS.md | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 api/METRICS.md diff --git a/api/METRICS.md b/api/METRICS.md new file mode 100644 index 0000000..f5c77b0 --- /dev/null +++ b/api/METRICS.md @@ -0,0 +1,91 @@ +# API proposal for metrics + +## Example + +```python +# For most sklearn metrics, we would have their group version that returns a Bunch with fields +# * overall: overall metric value +# * by_group: a dictionary that maps sensitive feature values to metric values + +summary = accuracy_score_by_group(y_true, y_pred, sensitive_features=sf, **other_kwargs) + +# Exporting into pd.Series or pd.DataFrame in not too complicated + +series = pd.Series({**summary.by_group, 'overall': summary.overall}) +df = pd.DataFrame({"model accuracy": {**summary.by_group, 'overall': summary.overall}}) + +# Several types of scalar metrics for group fairness can be obtained from `summary` via transformation functions + +acc_difference = difference_from_summary(summary) +acc_ratio = ratio_from_summary(summary) +acc_group_min = group_min_from_summary(summary) + +# Most common disparity metrics should be predefined + +demo_parity_difference = demographic_parity_difference(y_true, y_pred, sensitive_features=sf, **other_kwargs) +demo_parity_ratio = demographic_parity_ratio(y_true, y_pred, sensitive_features=sf, **other_kwargs) +eq_odds_difference = equalized_odds_difference(y_true, y_pred, sensitive_features=sf, **other_kwargs) + +# For predefined disparities based on sklearn metrics, we adopt a consistent naming conventions + +acc_difference = accuracy_score_difference(y_true, y_pred, sensitive_features=sf, **other_kwargs) +acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_kwargs) +acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs) +``` + +## Functions + +*Function signatures* + +```python +metric_by_group(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) +# return the summary for the provided metrics + +make_metric_by_group(metric) +# return a callable object _by_group: +# _by_group(...) = metric_by_group(, ...) + +# Transformation functions returning scalars +difference_from_summary(summary) +ratio_from_summary(summary) +group_min_from_summary(summary) +group_max_from_summary(summary) + +# Metric-specific functions returing summary and scalars +_by_group(y_true, y_pred, *, sensitive_features, **other_kwargs) +_difference(y_true, y_pred, *, sensitive_features, **other_kwargs) +_ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) +_group_min(y_true, y_pred, *, sensitive_features, **other_kwargs) +_group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) +``` + +*Summary of transformations* + +|transformation function|output|metric-specific function|code|aif360| +|-----------------------|------|------------------------|----|------| +|`difference_from_summary`|max - min|`_difference`|D|unprivileged - privileged| +|`ratio_from_summary`|min / max|`_ratio`|R| unprivileged / privileged| +|`group_min_from_summary`|min|`_group_min`|Min| N/A | +|`group_max_from_summary`|max|`_group_max`|Max| N/A | + +*Supported metric-specific functions* + +|metric|variants|task|notes|aif360| +|------|--------|-----|----|------| +|`selection_rate`| G,D,R,Min | class | | ✓ | +|`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| +|`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | +|`balanced_accuracy_score` | G | class | sklearn | - | +|`mean_absolute_error` | G,D,R,Max | class,reg | sklearn | class only: `error_rate` +|`false_positive_rate` | G,D,R | class | | ✓ | +|`false_negative_rate` | G | class | | ✓ | +|`true_positive_rate` | G,D,R | class | | ✓ | +|`true_negative_rate` | G | class | | ✓ | +|`equalized_odds` | D,R | class | max of difference or ratio under `true_positive_rate`, `false_positive_rate` | - | +|`precision_score`| G | class | sklearn | ✓ | +|`recall_score`| G | class | sklearn | ✓ | +|`f1_score`| G | class | sklearn | - | +|`roc_auc_score`| G | prob | sklearn | - | +|`log_loss`| G | prob | sklearn | - | +|`mean_squared_error`| G | prob,reg | sklearn | - | +|`r2_score`| G | reg | sklearn | - | From 3b93629f7088fe0f2f7be254411a908a1c8b72ad Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Tue, 3 Mar 2020 18:14:58 -0500 Subject: [PATCH 2/9] add clarifications and confusion_matrix Signed-off-by: Miro Dudik --- api/METRICS.md | 35 ++++++++++++++++++++++++++++------- 1 file changed, 28 insertions(+), 7 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index f5c77b0..391a660 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -33,13 +33,14 @@ acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_ acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs) ``` -## Functions +## Proposal *Function signatures* ```python metric_by_group(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) -# return the summary for the provided metrics +# return the summary for the provided `metric`, where `metric` has the signature +# metric(y_true, y_pred, **other_kwargs) make_metric_by_group(metric) # return a callable object _by_group: @@ -51,7 +52,7 @@ ratio_from_summary(summary) group_min_from_summary(summary) group_max_from_summary(summary) -# Metric-specific functions returing summary and scalars +# Metric-specific functions returning summary and scalars _by_group(y_true, y_pred, *, sensitive_features, **other_kwargs) _difference(y_true, y_pred, *, sensitive_features, **other_kwargs) _ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) @@ -59,7 +60,7 @@ group_max_from_summary(summary) _group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) ``` -*Summary of transformations* +*Summary of transformations and transformation codes* |transformation function|output|metric-specific function|code|aif360| |-----------------------|------|------------------------|----|------| @@ -68,7 +69,18 @@ group_max_from_summary(summary) |`group_min_from_summary`|min|`_group_min`|Min| N/A | |`group_max_from_summary`|max|`_group_max`|Max| N/A | -*Supported metric-specific functions* +*Summary of tasks and task codes* + +|task|definition|code| +|----|----------|----| +|binary classification|labels and predictions are in {0,1}|class| +|probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob| +|randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand| +|regression|labels and predictions are real-valued|reg| + +*Predefined metric-specific functions* + +* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_by_group`. |metric|variants|task|notes|aif360| |------|--------|-----|----|------| @@ -76,7 +88,8 @@ group_max_from_summary(summary) |`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| |`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | |`balanced_accuracy_score` | G | class | sklearn | - | -|`mean_absolute_error` | G,D,R,Max | class,reg | sklearn | class only: `error_rate` +|`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` | +|`confusion_matrix` | G | class | sklearn | `binary_confusion_matrix` | |`false_positive_rate` | G,D,R | class | | ✓ | |`false_negative_rate` | G | class | | ✓ | |`true_positive_rate` | G,D,R | class | | ✓ | @@ -87,5 +100,13 @@ group_max_from_summary(summary) |`f1_score`| G | class | sklearn | - | |`roc_auc_score`| G | prob | sklearn | - | |`log_loss`| G | prob | sklearn | - | -|`mean_squared_error`| G | prob,reg | sklearn | - | +|`mean_squared_error`| G | prob, reg | sklearn | - | |`r2_score`| G | reg | sklearn | - | + +## Dashboard questions + +1. Should we enable regression metrics for probabilistic classification? + * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` +1. Should we introduce balanced error metrics for probabilistic classification? + * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` +1. Do we keep `mean_prediction` and `mean_{over,under}prediction`? From 9359f135a372c90d093f36bc8a7c76ca144ddfe8 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Tue, 3 Mar 2020 18:21:43 -0500 Subject: [PATCH 3/9] fix list markdown Signed-off-by: Miro Dudik --- api/METRICS.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index 391a660..ec11657 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -106,7 +106,7 @@ group_max_from_summary(summary) ## Dashboard questions 1. Should we enable regression metrics for probabilistic classification? - * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` + * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` 1. Should we introduce balanced error metrics for probabilistic classification? - * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` + * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` 1. Do we keep `mean_prediction` and `mean_{over,under}prediction`? From ddde2ff751d2a9aee17ba42d0090a9e4c182283e Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Thu, 12 Mar 2020 11:48:43 -0400 Subject: [PATCH 4/9] rename _by_group to _group_summary for consistency Signed-off-by: Miro Dudik --- api/METRICS.md | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index ec11657..e497e1d 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -3,18 +3,20 @@ ## Example ```python -# For most sklearn metrics, we would have their group version that returns a Bunch with fields +# For most sklearn metrics, we will have their group version that returns +# the summary of its performance across groups as well as the overall +# performance, represented as a Bunch object with fields # * overall: overall metric value # * by_group: a dictionary that maps sensitive feature values to metric values -summary = accuracy_score_by_group(y_true, y_pred, sensitive_features=sf, **other_kwargs) +summary = accuracy_score_group_summary(y_true, y_pred, sensitive_features=sf, **other_kwargs) # Exporting into pd.Series or pd.DataFrame in not too complicated series = pd.Series({**summary.by_group, 'overall': summary.overall}) df = pd.DataFrame({"model accuracy": {**summary.by_group, 'overall': summary.overall}}) -# Several types of scalar metrics for group fairness can be obtained from `summary` via transformation functions +# Several types of scalar metrics for group fairness can be obtained from the group summary via transformation functions acc_difference = difference_from_summary(summary) acc_ratio = ratio_from_summary(summary) @@ -38,13 +40,13 @@ acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, *Function signatures* ```python -metric_by_group(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) -# return the summary for the provided `metric`, where `metric` has the signature +group_summary(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) +# return the group summary for the provided `metric`, where `metric` has the signature # metric(y_true, y_pred, **other_kwargs) -make_metric_by_group(metric) -# return a callable object _by_group: -# _by_group(...) = metric_by_group(, ...) +make_metric_group_summary(metric) +# return a callable object _group_summary: +# _group_summary(...) = group_summary(, ...) # Transformation functions returning scalars difference_from_summary(summary) @@ -52,15 +54,15 @@ ratio_from_summary(summary) group_min_from_summary(summary) group_max_from_summary(summary) -# Metric-specific functions returning summary and scalars -_by_group(y_true, y_pred, *, sensitive_features, **other_kwargs) +# Metric-specific functions returning group summary and scalars +_group_summary(y_true, y_pred, *, sensitive_features, **other_kwargs) _difference(y_true, y_pred, *, sensitive_features, **other_kwargs) _ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) _group_min(y_true, y_pred, *, sensitive_features, **other_kwargs) _group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) ``` -*Summary of transformations and transformation codes* +*Transformations and transformation codes* |transformation function|output|metric-specific function|code|aif360| |-----------------------|------|------------------------|----|------| @@ -69,7 +71,7 @@ group_max_from_summary(summary) |`group_min_from_summary`|min|`_group_min`|Min| N/A | |`group_max_from_summary`|max|`_group_max`|Max| N/A | -*Summary of tasks and task codes* +*Tasks and task codes* |task|definition|code| |----|----------|----| @@ -80,7 +82,7 @@ group_max_from_summary(summary) *Predefined metric-specific functions* -* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_by_group`. +* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_group_summary`. |metric|variants|task|notes|aif360| |------|--------|-----|----|------| From 0b86e6d333bcc76d04c6cdce51c3256684e47944 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Mon, 16 Mar 2020 11:36:06 -0400 Subject: [PATCH 5/9] remove dashboard questions Signed-off-by: Miro Dudik --- api/METRICS.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index e497e1d..98a2be2 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -104,11 +104,3 @@ group_max_from_summary(summary) |`log_loss`| G | prob | sklearn | - | |`mean_squared_error`| G | prob, reg | sklearn | - | |`r2_score`| G | reg | sklearn | - | - -## Dashboard questions - -1. Should we enable regression metrics for probabilistic classification? - * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` -1. Should we introduce balanced error metrics for probabilistic classification? - * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` -1. Do we keep `mean_prediction` and `mean_{over,under}prediction`? From 37a2c14540ebd2fa93ca61abdab0d0822dcf3a30 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Mon, 4 May 2020 14:22:32 -0400 Subject: [PATCH 6/9] add harmonization section Signed-off-by: Miro Dudik --- api/METRICS.md | 191 +++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 178 insertions(+), 13 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index 98a2be2..05c789e 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -86,21 +86,186 @@ group_max_from_summary(summary) |metric|variants|task|notes|aif360| |------|--------|-----|----|------| -|`selection_rate`| G,D,R,Min | class | | ✓ | -|`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| -|`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | -|`balanced_accuracy_score` | G | class | sklearn | - | -|`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` | |`confusion_matrix` | G | class | sklearn | `binary_confusion_matrix` | |`false_positive_rate` | G,D,R | class | | ✓ | |`false_negative_rate` | G | class | | ✓ | |`true_positive_rate` | G,D,R | class | | ✓ | |`true_negative_rate` | G | class | | ✓ | -|`equalized_odds` | D,R | class | max of difference or ratio under `true_positive_rate`, `false_positive_rate` | - | -|`precision_score`| G | class | sklearn | ✓ | -|`recall_score`| G | class | sklearn | ✓ | -|`f1_score`| G | class | sklearn | - | -|`roc_auc_score`| G | prob | sklearn | - | -|`log_loss`| G | prob | sklearn | - | -|`mean_squared_error`| G | prob, reg | sklearn | - | -|`r2_score`| G | reg | sklearn | - | +|`selection_rate`| G,D,R | class | | ✓ | +|`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| +|`equalized_odds` | D,R | class | max of difference or min ratio under `true_positive_rate`, `false_positive_rate` | - | +|`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | +|`balanced_accuracy_score` | G,Min | class | sklearn | - | +|`precision_score`| G,Min | class | sklearn | ✓ | +|`recall_score`| G,Min | class | sklearn | ✓ | +|`f1_score`| G,Min | class | sklearn | - | +|`roc_auc_score`| G,Min | prob | sklearn | - | +|`log_loss`| G,Max | prob | sklearn | - | +|`mean_prediction`| G | prob, reg | | - | +|`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` | +|`mean_squared_error`| G,Max | prob, reg | sklearn | - | +|`r2_score`| G,Min | reg | sklearn | - | +|`_mean_overprediction` | G | class, reg | | - | +|`_mean_underprediction` | G | class, reg | | - | +|`_root_mean_squared_error`| G | prob, reg | | - | +|`_balanced_root_mean_squared_error`| G | prob | | - | + +# Harmonizing metrics across sub-packages + +Various sub-packages of Fairlearn refer to the related fairness concepts in many different ways. The goal of this part of the proposal is to harmonize their API with the one used in `fairlearn.metrics`. + +## Notation + +A _metric_ refers to any function that can be evaluated on subsets of the data. We use the notation _metric_(\*) for its value on the whole data set and _metric_(_a_) for its value on the subset of examples with sensitive feature value _a_. We write `` and `` for literals representing the metric name. + +For example, if _metric_ is _accuracy_score_, then + +* _metric_(\*) = P[_Y=h(X)_] +* _metric_(_a_) = P[_Y=h(X)_ | _A=a_] +* `` = `accuracy_score` +* `` = `AccuracyScore` + +## Fairness metrics and fairness constraints + +Fairness metrics are expressed in terms of _metric_(\*) and _metric_(_a_), and they are used to define fairness constraints. Mitigation algorithms also have a notion of an objective, which is typically equal to _metric_(\*). + +### fairlearn.metrics + +There are two kinds of fairness metrics in `fairlearn.metrics`: + +* Those systematically derived from some base metrics: + + * `_difference` = + [max_a_ _metric_(_a_)] - [min_a_ _metric_(_a_) ] + * `_ratio` = + [min_a_ _metric_(_a_)] / [max_a_ _metric_(_a_) ] + * `_group_min` = + [min_a_ _metric_(_a_)] + * `_group_max` = + [max_a_ _metric_(_a_)] + +* Additional metrics that are defined in terms of the systematic functions: + + * `demographic_parity_difference`
+ = `selection_rate_difference` + + * `demographic_parity_ratio`
+ = `selection_rate_ratio` + + * `equalized_odds_difference`
+ = max(`true_positive_rate_difference`, `false_positive_rate_difference`) + + * `equalized_odds_ratio`
+ = min(`true_positive_rate_ratio`, `false_positive_rate_ratio`) + +### fairlearn.postprocessing + +**Status quo.** Our postprocessing algorithms use the following calling conventions: + +* `constraints` is a string that represents fairness constraints as follows: + + * `"demographic_parity"`: for all _a_,
+ _selection_rate_(_a_) = _selection_rate_(\*). + + * `"equalized_odds"`: for all _a_,
+ _true_positive_rate_(_a_) = _true_positive_rate_(\*),
+ _false_positive_rate_(_a_) = _false_positive_rate_(\*). + +* `objective` is always _accuracy_score_(\*). + +**Proposal.** The following proposal is an extension of the status quo; no breaking changes are introduced: + +* `constraints`: in addition to `"demographic_parity"` and `"equalized_odds"`, also allow: + + * `"_parity"`: for all _a_,
+ _metric_(_a_) = _metric_(\*). + +* `objective` is a string of the form: + + * `""`:
+ goal is then to maximize _metric_(\*) subject to constraints + +### fairlearn.reductions + +We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both represent `constraints` and `objective` via objects of type `Moment`. + +**Status quo:** + +* In both cases, `objective` is automatically inferred from the provided `constraints`. + +* `ExponentiatedGradient(estimator, constraints, eps=`ε`)` considers constraints that are specified jointly by the provided `Moment` object and the numeric value ε as follows: + + * `DemographicParity()`: for all _a_,
+ |_selection_rate_(_a_) - _selection_rate_(\*)| ≤ ε. + + * `DemographicParity(ratio=`_r_`)`: for all _a_,
+ _r_ ⋅ _selection_rate_(_a_) - _selection_rate_(\*) ≤ ε,
+ _r_ ⋅ _selection_rate_(\*) - _selection_rate_(_a_) ≤ ε. + + * `TruePositiveRateDifference()`: for all _a_,
+ |_true_positive_rate_(_a_) - _true_positive_rate_(\*)| ≤ ε. + + * `TruePositiveRateDifference(ratio=`_r_`)`: analogous + + * `EqualizedOdds()`: for all _a_,
+ |_true_positive_rate_(_a_) - _true_positive_rate_(\*)| ≤ ε,
+ |_false_positive_rate_(_a_) - _false_positive_rate_(\*)| ≤ ε. + + * `EqualizedOdds(ratio=`_r_`)`: analogous + + * `ErrorRateRatio()`: for all _a_,
+ |_error_rate_(_a_) - _error_rate_(\*)| ≤ ε. + + * `ErrorRateDifference(ratio=`_r_`)`: analogous + + * all of the above constraints are descendants of `ConditionalSelectionRate` + + * the `objective` for all of the above constraints is `ErrorRate()` + +* `GridSearch(estimator, constraints)` considers constraints represented by the provided `Moment` object. The behavior of the `GridSearch` algorithm does not depend on the value of the right-hand side of constraints, so it is not provided to the constructor. In addition to the `Moment` objects above, `GridSearch` also supports the following: + + * `GroupLossMoment()`: for all _a_,
+ _loss_(_a_) ≤ ζ + + * where `` is a _loss evaluation_ object; it supports an interface that takes `y_true` and `y_pred` as the input and returns the vector of losses evaluated on individual examples + + * the `objective` for `GroupLossMoment()` is `AverageLossMoment()` + + * both `GroupLossMoment` and `AverageLossMoment` are descendants of `ConditionalLossMoment` + +**Proposal.** The proposal introduces some breaking changes: + +* `constraints`: + + * `Parity(difference_bound=`ε`)`: for all _a_,
+ |_metric_(_a_) - _metric_(\*)| ≤ ε. + + * `Parity(ratio_bound=`_r_`, ratio_bound_slack=`ε`)`: for all _a_,
+ _r_ ⋅ _metric_(_a_) - _metric_(\*) ≤ ε,
+ _r_ ⋅ _metric_(\*) - _metric_(_a_) ≤ ε. + + * `DemographicParity` and `EqualizedOdds` have the same calling convention as `Parity` + + * `BoundedGroupLoss(, bound=`ζ`)`: for all _a_,
+ _loss_(_a_) ≤ ζ + + _Alternative proposals_: + * `LossParity(, group_max_bound=`ζ`)`
+ * `Parity(group_max_bound=`ζ`)` + + +* `objective`: + * `()` (for classification moments) + + * `AverageLoss()` (for loss minimization moments)
+ + _Alternative proposals_: + * `MeanLoss()` + * `OverallLoss()` + * the object `` doubles as (1) the loss evaluator and (2) the moment implementing the objective + +* the loss evaluation object `` needs to support the following API: + + * `(y_true, y_pred)` returning the vector of losses on each example + * `.min_loss` the minimum value the loss evaluates to + * `.max_loss` the maximum value the loss evaluates to From 51c101c14e83f6e085aae16ef3e0b4408b1fe8d0 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Mon, 25 May 2020 21:54:30 -0400 Subject: [PATCH 7/9] finalize the proposal Signed-off-by: Miro Dudik --- api/METRICS.md | 129 +++++++++++++++++++++++++++++++------------------ 1 file changed, 81 insertions(+), 48 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index 05c789e..4f3e186 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -1,6 +1,16 @@ # API proposal for metrics -## Example +## Goals + +1. Metrics should have simple functional signatures that can be passed to `make_scorer`. + +1. Metrics that return structured objects—such as evaluating accuracy in each group—should be easy to turn into a `DataFrame` or `Series`. + +1. It should be possible to implement caching of intermediate results (at some point in future). For example, many metrics can be derived from the set of group-level metric values across all groups, so it would be nice to avoid re-calculating the group summaries. + +1. Metrics that are derived from existing `sklearn` metrics should be recognizable. + +## Proposal in the form of an example ```python # For most sklearn metrics, we will have their group version that returns @@ -35,26 +45,40 @@ acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_ acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs) ``` -## Proposal +## Proposal details + +The items that are not implemented yet are marked as `[TODO]`. -*Function signatures* +### Metrics engine ```python group_summary(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) # return the group summary for the provided `metric`, where `metric` has the signature # metric(y_true, y_pred, **other_kwargs) -make_metric_group_summary(metric) +make_group_summary(metric) [TODO] # return a callable object _group_summary: -# _group_summary(...) = group_summary(, ...) +# _group_summary(...) = group_summary(metric, ...) -# Transformation functions returning scalars +# Transformation functions returning scalars (the definitions are below) difference_from_summary(summary) ratio_from_summary(summary) group_min_from_summary(summary) group_max_from_summary(summary) -# Metric-specific functions returning group summary and scalars +derived_metric(metric, transformation, y_true, y_pred, *, sensitive_features, **other_kwargs) [TODO] +# return a metric derived from the provided `metric`, where `metric` has the signature +# * metric(y_true, y_pred, **other_kwargs) +# and `transformation` is a string 'difference', 'ratio', 'group_min' or 'group_max'. +# +# Alternatively, `metric` can be a metric group summary function, and `transformation` +# can be a function one of the functions _from_summary. + +make_derived_metric(metric, transformation) [TODO] +# return a callable object _: +# _(...) = derived_metric(metric, transformation, ...) + +# Predefined metrics are named according to the following patterns: _group_summary(y_true, y_pred, *, sensitive_features, **other_kwargs) _difference(y_true, y_pred, *, sensitive_features, **other_kwargs) _ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) @@ -62,57 +86,64 @@ group_max_from_summary(summary) _group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) ``` -*Transformations and transformation codes* +### Definitions of transformation functions -|transformation function|output|metric-specific function|code|aif360| +|transformation function|output|derived metric name|code|aif360| |-----------------------|------|------------------------|----|------| |`difference_from_summary`|max - min|`_difference`|D|unprivileged - privileged| |`ratio_from_summary`|min / max|`_ratio`|R| unprivileged / privileged| |`group_min_from_summary`|min|`_group_min`|Min| N/A | |`group_max_from_summary`|max|`_group_max`|Max| N/A | -*Tasks and task codes* +### List of predefined metrics + +* In the list of predefined metrics, we refer to the following machine learning tasks: -|task|definition|code| -|----|----------|----| -|binary classification|labels and predictions are in {0,1}|class| -|probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob| -|randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand| -|regression|labels and predictions are real-valued|reg| + |task|definition|code| + |----|----------|----| + |binary classification|labels and predictions are in {0,1}|class| + |probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob| + |randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand| + |regression|labels and predictions are real-valued|reg| -*Predefined metric-specific functions* +* For each _base metric_, we provide the list of predefined derived metrics, using D, R, Min, Max to refer to the transformations from the table above, and G to refer to `_group_summary`. We follow these rules: + * always provide G (except for demographic parity and equalized odds, which do not make sense as group-level metrics) + * provide D and R for confusion-matrix metrics + * provide Min for score functions (worst-case score) + * provide Max for error/loss functions (worst-case error) + * for internal API metrics starting with `_`, only provide G -* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_group_summary`. |metric|variants|task|notes|aif360| |------|--------|-----|----|------| |`confusion_matrix` | G | class | sklearn | `binary_confusion_matrix` | |`false_positive_rate` | G,D,R | class | | ✓ | -|`false_negative_rate` | G | class | | ✓ | +|`false_negative_rate` | G,D,R | class | | ✓ | |`true_positive_rate` | G,D,R | class | | ✓ | -|`true_negative_rate` | G | class | | ✓ | +|`true_negative_rate` | G,D,R | class | | ✓ | |`selection_rate`| G,D,R | class | | ✓ | |`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| -|`equalized_odds` | D,R | class | max of difference or min ratio under `true_positive_rate`, `false_positive_rate` | - | +|`equalized_odds` | D,R | class | max difference or min ratio under `true_positive_rate`, `false_positive_rate` | - | |`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | +|`zero_one_loss` | G,D,R,Max | class | sklearn | `error_rate` | |`balanced_accuracy_score` | G,Min | class | sklearn | - | |`precision_score`| G,Min | class | sklearn | ✓ | |`recall_score`| G,Min | class | sklearn | ✓ | -|`f1_score`| G,Min | class | sklearn | - | +|`f1_score` [TODO]| G,Min | class | sklearn | - | |`roc_auc_score`| G,Min | prob | sklearn | - | -|`log_loss`| G,Max | prob | sklearn | - | -|`mean_prediction`| G | prob, reg | | - | -|`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` | +|`log_loss` [TODO]| G,Max | prob | sklearn | - | +|`mean_absolute_error` | G,Max | reg | sklearn | - | |`mean_squared_error`| G,Max | prob, reg | sklearn | - | |`r2_score`| G,Min | reg | sklearn | - | +|`mean_prediction`| G | prob, reg | | - | |`_mean_overprediction` | G | class, reg | | - | |`_mean_underprediction` | G | class, reg | | - | |`_root_mean_squared_error`| G | prob, reg | | - | |`_balanced_root_mean_squared_error`| G | prob | | - | -# Harmonizing metrics across sub-packages +# Harmonizing metrics across modules -Various sub-packages of Fairlearn refer to the related fairness concepts in many different ways. The goal of this part of the proposal is to harmonize their API with the one used in `fairlearn.metrics`. +Various modules of Fairlearn refer to the related fairness concepts in many different ways. The goal of this part of the proposal is to harmonize their API with the one used in `fairlearn.metrics`. ## Notation @@ -127,24 +158,24 @@ For example, if _metric_ is _accuracy_score_, then ## Fairness metrics and fairness constraints -Fairness metrics are expressed in terms of _metric_(\*) and _metric_(_a_), and they are used to define fairness constraints. Mitigation algorithms also have a notion of an objective, which is typically equal to _metric_(\*). +Fairness metrics are expressed in terms of _metric_(\*) and _metric_(_a_). Mitigation algorithms have a notion of _fairness constraints_, expressed in terms of fairness metrics, and a notion of an _objective_, typically equal to _metric_(\*). ### fairlearn.metrics There are two kinds of fairness metrics in `fairlearn.metrics`: -* Those systematically derived from some base metrics: +* Fairness metrics derived from base metrics: * `_difference` = - [max_a_ _metric_(_a_)] - [min_a_ _metric_(_a_) ] + [max_a_ _metric_(_a_)] - [min_a'_ _metric_(_a'_) ] * `_ratio` = - [min_a_ _metric_(_a_)] / [max_a_ _metric_(_a_) ] + [min_a_ _metric_(_a_)] / [max_a'_ _metric_(_a'_) ] * `_group_min` = [min_a_ _metric_(_a_)] * `_group_max` = [max_a_ _metric_(_a_)] -* Additional metrics that are defined in terms of the systematic functions: +* Other fairness metrics: * `demographic_parity_difference`
= `selection_rate_difference` @@ -185,7 +216,7 @@ There are two kinds of fairness metrics in `fairlearn.metrics`: * `""`:
goal is then to maximize _metric_(\*) subject to constraints -### fairlearn.reductions +### fairlearn.reductions [TODO] We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both represent `constraints` and `objective` via objects of type `Moment`. @@ -216,13 +247,13 @@ We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both r * `ErrorRateRatio()`: for all _a_,
|_error_rate_(_a_) - _error_rate_(\*)| ≤ ε. - * `ErrorRateDifference(ratio=`_r_`)`: analogous + * `ErrorRateRatio(ratio=`_r_`)`: analogous * all of the above constraints are descendants of `ConditionalSelectionRate` * the `objective` for all of the above constraints is `ErrorRate()` -* `GridSearch(estimator, constraints)` considers constraints represented by the provided `Moment` object. The behavior of the `GridSearch` algorithm does not depend on the value of the right-hand side of constraints, so it is not provided to the constructor. In addition to the `Moment` objects above, `GridSearch` also supports the following: +* `GridSearch(estimator, constraints)` considers constraints represented by the provided `Moment` object. The behavior of the `GridSearch` algorithm does not depend on the value of the right-hand side of the constraints, so it is not provided to the constructor. In addition to the `Moment` objects above, `GridSearch` also supports the following: * `GroupLossMoment()`: for all _a_,
_loss_(_a_) ≤ ζ @@ -233,7 +264,7 @@ We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both r * both `GroupLossMoment` and `AverageLossMoment` are descendants of `ConditionalLossMoment` -**Proposal.** The proposal introduces some breaking changes: +**Proposal.** The proposal introduces many breaking changes: * `constraints`: @@ -246,26 +277,28 @@ We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both r * `DemographicParity` and `EqualizedOdds` have the same calling convention as `Parity` + * rename `ConditionalSelectionRate` to `UtilityParity` + * `BoundedGroupLoss(, bound=`ζ`)`: for all _a_,
_loss_(_a_) ≤ ζ - _Alternative proposals_: - * `LossParity(, group_max_bound=`ζ`)`
- * `Parity(group_max_bound=`ζ`)` + * remove `ConditionalLossMoment` * `objective`: * `()` (for classification moments) - * `AverageLoss()` (for loss minimization moments)
+ * `MeanLoss()` (for loss minimization moments)
+ +* the loss evaluator object `` needs to support the following API: - _Alternative proposals_: - * `MeanLoss()` - * `OverallLoss()` - * the object `` doubles as (1) the loss evaluator and (2) the moment implementing the objective + * `(y_true, y_pred)` returning the vector of losses on each example + * `.min_loss` the minimum value the loss evaluates to + * `.max_loss` the maximum value the loss evaluates to -* the loss evaluation object `` needs to support the following API: + _Constructors_: - * `(y_true, y_pred)` returning the vector of losses on each example - * `.min_loss` the minimum value the loss evaluates to - * `.max_loss` the maximum value the loss evaluates to + * `SquareLoss(max_loss=...)` + * `AbsoluteLoss(max_loss=...)` + * `LogLoss(max_loss=...)` + * `ZeroOneLoss()` From 82740e6ce792bf276406838434ad8ff5e30a79d0 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Mon, 25 May 2020 22:11:51 -0400 Subject: [PATCH 8/9] add upper bound to BGL Signed-off-by: Miro Dudik --- api/METRICS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/api/METRICS.md b/api/METRICS.md index 4f3e186..a2a187d 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -279,7 +279,7 @@ We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both r * rename `ConditionalSelectionRate` to `UtilityParity` - * `BoundedGroupLoss(, bound=`ζ`)`: for all _a_,
+ * `BoundedGroupLoss(, upper_bound=`ζ`)`: for all _a_,
_loss_(_a_) ≤ ζ * remove `ConditionalLossMoment` From b1fc2b6841a1ce98f080192c605d112721d79520 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Tue, 26 May 2020 22:21:09 -0400 Subject: [PATCH 9/9] address some comments Signed-off-by: Miro Dudik --- api/METRICS.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index a2a187d..8178b6d 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -67,7 +67,8 @@ group_min_from_summary(summary) group_max_from_summary(summary) derived_metric(metric, transformation, y_true, y_pred, *, sensitive_features, **other_kwargs) [TODO] -# return a metric derived from the provided `metric`, where `metric` has the signature +# return the value of a metric derived from the provided `metric`, where `metric` +# has the signature # * metric(y_true, y_pred, **other_kwargs) # and `transformation` is a string 'difference', 'ratio', 'group_min' or 'group_max'. # @@ -95,6 +96,8 @@ make_derived_metric(metric, transformation) [TODO] |`group_min_from_summary`|min|`_group_min`|Min| N/A | |`group_max_from_summary`|max|`_group_max`|Max| N/A | +> _Note_: The ratio min/max should evaluate to `np.nan` if min<0.0, and to 1.0 if min=max=0.0. + ### List of predefined metrics * In the list of predefined metrics, we refer to the following machine learning tasks: @@ -108,7 +111,7 @@ make_derived_metric(metric, transformation) [TODO] * For each _base metric_, we provide the list of predefined derived metrics, using D, R, Min, Max to refer to the transformations from the table above, and G to refer to `_group_summary`. We follow these rules: * always provide G (except for demographic parity and equalized odds, which do not make sense as group-level metrics) - * provide D and R for confusion-matrix metrics + * provide D and R for confusion-matrix-derived metrics * provide Min for score functions (worst-case score) * provide Max for error/loss functions (worst-case error) * for internal API metrics starting with `_`, only provide G