Skip to content
267 changes: 234 additions & 33 deletions api/METRICS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# API proposal for metrics

## Example
## Goals

1. Metrics should have simple functional signatures that can be passed to `make_scorer`.

1. Metrics that return structured objects—such as evaluating accuracy in each group—should be easy to turn into a `DataFrame` or `Series`.

1. It should be possible to implement caching of intermediate results (at some point in future). For example, many metrics can be derived from the set of group-level metric values across all groups, so it would be nice to avoid re-calculating the group summaries.

1. Metrics that are derived from existing `sklearn` metrics should be recognizable.

## Proposal in the form of an example

```python
# For most sklearn metrics, we will have their group version that returns
Expand Down Expand Up @@ -35,72 +45,263 @@ acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_
acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs)
```

## Proposal
## Proposal details

*Function signatures*
The items that are not implemented yet are marked as `[TODO]`.

### Metrics engine

```python
group_summary(metric, y_true, y_pred, *, sensitive_features, **other_kwargs)
# return the group summary for the provided `metric`, where `metric` has the signature
# metric(y_true, y_pred, **other_kwargs)

make_metric_group_summary(metric)
make_group_summary(metric) [TODO]
# return a callable object <metric>_group_summary:
# <metric>_group_summary(...) = group_summary(<metric>, ...)
# <metric>_group_summary(...) = group_summary(metric, ...)

# Transformation functions returning scalars
# Transformation functions returning scalars (the definitions are below)
difference_from_summary(summary)
ratio_from_summary(summary)
group_min_from_summary(summary)
group_max_from_summary(summary)

# Metric-specific functions returning group summary and scalars
derived_metric(metric, transformation, y_true, y_pred, *, sensitive_features, **other_kwargs) [TODO]
# return the value of a metric derived from the provided `metric`, where `metric`
# has the signature
# * metric(y_true, y_pred, **other_kwargs)
# and `transformation` is a string 'difference', 'ratio', 'group_min' or 'group_max'.
#
# Alternatively, `metric` can be a metric group summary function, and `transformation`
# can be a function one of the functions <transformation>_from_summary.

make_derived_metric(metric, transformation) [TODO]
# return a callable object <metric>_<transformation>:
# <metric>_<transformation>(...) = derived_metric(metric, transformation, ...)

# Predefined metrics are named according to the following patterns:
<metric>_group_summary(y_true, y_pred, *, sensitive_features, **other_kwargs)
<metric>_difference(y_true, y_pred, *, sensitive_features, **other_kwargs)
<metric>_ratio(y_true, y_pred, *, sensitive_features, **other_kwargs)
<metric>_group_min(y_true, y_pred, *, sensitive_features, **other_kwargs)
<metric>_group_max(y_true, y_pred, *, sensitive_features, **other_kwargs)
```

*Transformations and transformation codes*
### Definitions of transformation functions

|transformation function|output|metric-specific function|code|aif360|
|transformation function|output|derived metric name|code|aif360|
|-----------------------|------|------------------------|----|------|
|`difference_from_summary`|max - min|`<metric>_difference`|D|unprivileged - privileged|
|`ratio_from_summary`|min / max|`<metric>_ratio`|R| unprivileged / privileged|
|`group_min_from_summary`|min|`<metric>_group_min`|Min| N/A |
|`group_max_from_summary`|max|`<metric>_group_max`|Max| N/A |

*Tasks and task codes*
> _Note_: The ratio min/max should evaluate to `np.nan` if min<0.0, and to 1.0 if min=max=0.0.

### List of predefined metrics

* In the list of predefined metrics, we refer to the following machine learning tasks:

|task|definition|code|
|----|----------|----|
|binary classification|labels and predictions are in {0,1}|class|
|probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob|
|randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand|
|regression|labels and predictions are real-valued|reg|
|task|definition|code|
|----|----------|----|
|binary classification|labels and predictions are in {0,1}|class|
|probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob|
|randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand|
|regression|labels and predictions are real-valued|reg|

*Predefined metric-specific functions*
* For each _base metric_, we provide the list of predefined derived metrics, using D, R, Min, Max to refer to the transformations from the table above, and G to refer to `<metric>_group_summary`. We follow these rules:
* always provide G (except for demographic parity and equalized odds, which do not make sense as group-level metrics)
* provide D and R for confusion-matrix-derived metrics
* provide Min for score functions (worst-case score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any such cases, but are there scores people could come up with that would work the other way round? Or would you just consider them losses then?

* provide Max for error/loss functions (worst-case error)
* for internal API metrics starting with `_`, only provide G

* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `<metric>_group_summary`.

|metric|variants|task|notes|aif360|
|------|--------|-----|----|------|
|`selection_rate`| G,D,R,Min | class | | &#x2713; |
|`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`|
|`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` |
|`balanced_accuracy_score` | G | class | sklearn | - |
|`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` |
|`confusion_matrix` | G | class | sklearn | `binary_confusion_matrix` |
|`false_positive_rate` | G,D,R | class | | &#x2713; |
|`false_negative_rate` | G | class | | &#x2713; |
|`false_negative_rate` | G,D,R | class | | &#x2713; |
|`true_positive_rate` | G,D,R | class | | &#x2713; |
|`true_negative_rate` | G | class | | &#x2713; |
|`equalized_odds` | D,R | class | max of difference or ratio under `true_positive_rate`, `false_positive_rate` | - |
|`precision_score`| G | class | sklearn | &#x2713; |
|`recall_score`| G | class | sklearn | &#x2713; |
|`f1_score`| G | class | sklearn | - |
|`roc_auc_score`| G | prob | sklearn | - |
|`log_loss`| G | prob | sklearn | - |
|`mean_squared_error`| G | prob, reg | sklearn | - |
|`r2_score`| G | reg | sklearn | - |
|`true_negative_rate` | G,D,R | class | | &#x2713; |
|`selection_rate`| G,D,R | class | | &#x2713; |
|`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`|
|`equalized_odds` | D,R | class | max difference or min ratio under `true_positive_rate`, `false_positive_rate` | - |
|`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` |
|`zero_one_loss` | G,D,R,Max | class | sklearn | `error_rate` |
|`balanced_accuracy_score` | G,Min | class | sklearn | - |
|`precision_score`| G,Min | class | sklearn | &#x2713; |
|`recall_score`| G,Min | class | sklearn | &#x2713; |
|`f1_score` [TODO]| G,Min | class | sklearn | - |
|`roc_auc_score`| G,Min | prob | sklearn | - |
|`log_loss` [TODO]| G,Max | prob | sklearn | - |
|`mean_absolute_error` | G,Max | reg | sklearn | - |
|`mean_squared_error`| G,Max | prob, reg | sklearn | - |
|`r2_score`| G,Min | reg | sklearn | - |
|`mean_prediction`| G | prob, reg | | - |
|`_mean_overprediction` | G | class, reg | | - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you think you'll have an alternative to these underscores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to ask the community what their thoughts are given that we're doing somewhat non-standard things here.

|`_mean_underprediction` | G | class, reg | | - |
|`_root_mean_squared_error`| G | prob, reg | | - |
|`_balanced_root_mean_squared_error`| G | prob | | - |

# Harmonizing metrics across modules

Various modules of Fairlearn refer to the related fairness concepts in many different ways. The goal of this part of the proposal is to harmonize their API with the one used in `fairlearn.metrics`.

## Notation

A _metric_ refers to any function that can be evaluated on subsets of the data. We use the notation _metric_(\*) for its value on the whole data set and _metric_(_a_) for its value on the subset of examples with sensitive feature value _a_. We write `<metric>` and `<Metric>` for literals representing the metric name.

For example, if _metric_ is _accuracy_score_, then

* _metric_(\*) = P[_Y=h(X)_]
* _metric_(_a_) = P[_Y=h(X)_ | _A=a_]
* `<metric>` = `accuracy_score`
* `<Metric>` = `AccuracyScore`

## Fairness metrics and fairness constraints

Fairness metrics are expressed in terms of _metric_(\*) and _metric_(_a_). Mitigation algorithms have a notion of _fairness constraints_, expressed in terms of fairness metrics, and a notion of an _objective_, typically equal to _metric_(\*).

### fairlearn.metrics

There are two kinds of fairness metrics in `fairlearn.metrics`:

* Fairness metrics derived from base metrics:

* `<metric>_difference` =
[max<sub>_a_</sub> _metric_(_a_)] - [min<sub>_a'_</sub> _metric_(_a'_) ]
* `<metric>_ratio` =
[min<sub>_a_</sub> _metric_(_a_)] / [max<sub>_a'_</sub> _metric_(_a'_) ]
* `<metric>_group_min` =
[min<sub>_a_</sub> _metric_(_a_)]
* `<metric>_group_max` =
[max<sub>_a_</sub> _metric_(_a_)]

* Other fairness metrics:

* `demographic_parity_difference` <br>
= `selection_rate_difference`

* `demographic_parity_ratio` <br>
= `selection_rate_ratio`

* `equalized_odds_difference` <br>
= max(`true_positive_rate_difference`, `false_positive_rate_difference`)

* `equalized_odds_ratio` <br>
= min(`true_positive_rate_ratio`, `false_positive_rate_ratio`)

### fairlearn.postprocessing

**Status quo.** Our postprocessing algorithms use the following calling conventions:

* `constraints` is a string that represents fairness constraints as follows:

* `"demographic_parity"`: for all _a_, <br>
_selection_rate_(_a_) = _selection_rate_(\*).

* `"equalized_odds"`: for all _a_,<br>
_true_positive_rate_(_a_) = _true_positive_rate_(\*), <br>
_false_positive_rate_(_a_) = _false_positive_rate_(\*).

* `objective` is always _accuracy_score_(\*).

**Proposal.** The following proposal is an extension of the status quo; no breaking changes are introduced:

* `constraints`: in addition to `"demographic_parity"` and `"equalized_odds"`, also allow:

* `"<metric>_parity"`: for all _a_,<br>
_metric_(_a_) = _metric_(\*).

* `objective` is a string of the form:

* `"<metric>"`: <br>
goal is then to maximize _metric_(\*) subject to constraints

### fairlearn.reductions [TODO]

We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both represent `constraints` and `objective` via objects of type `Moment`.

**Status quo:**

* In both cases, `objective` is automatically inferred from the provided `constraints`.

* `ExponentiatedGradient(estimator, constraints, eps=`&epsilon;`)` considers constraints that are specified jointly by the provided `Moment` object and the numeric value &epsilon; as follows:

* `DemographicParity()`: for all _a_, <br>
|_selection_rate_(_a_) - _selection_rate_(\*)| &le; &epsilon;.

* `DemographicParity(ratio=`_r_`)`: for all _a_, <br>
_r_ &sdot; _selection_rate_(_a_) - _selection_rate_(\*) &le; &epsilon;, <br>
_r_ &sdot; _selection_rate_(\*) - _selection_rate_(_a_) &le; &epsilon;.

* `TruePositiveRateDifference()`: for all _a_, <br>
|_true_positive_rate_(_a_) - _true_positive_rate_(\*)| &le; &epsilon;.

* `TruePositiveRateDifference(ratio=`_r_`)`: analogous

* `EqualizedOdds()`: for all _a_, <br>
|_true_positive_rate_(_a_) - _true_positive_rate_(\*)| &le; &epsilon;, <br>
|_false_positive_rate_(_a_) - _false_positive_rate_(\*)| &le; &epsilon;.

* `EqualizedOdds(ratio=`_r_`)`: analogous

* `ErrorRateRatio()`: for all _a_, <br>
|_error_rate_(_a_) - _error_rate_(\*)| &le; &epsilon;.

* `ErrorRateRatio(ratio=`_r_`)`: analogous

* all of the above constraints are descendants of `ConditionalSelectionRate`

* the `objective` for all of the above constraints is `ErrorRate()`

* `GridSearch(estimator, constraints)` considers constraints represented by the provided `Moment` object. The behavior of the `GridSearch` algorithm does not depend on the value of the right-hand side of the constraints, so it is not provided to the constructor. In addition to the `Moment` objects above, `GridSearch` also supports the following:

* `GroupLossMoment(<loss>)`: for all _a_, <br>
_loss_(_a_) &le; &zeta;

* where `<loss>` is a _loss evaluation_ object; it supports an interface that takes `y_true` and `y_pred` as the input and returns the vector of losses evaluated on individual examples

* the `objective` for `GroupLossMoment(<loss>)` is `AverageLossMoment(<loss>)`

* both `GroupLossMoment` and `AverageLossMoment` are descendants of `ConditionalLossMoment`

**Proposal.** The proposal introduces many breaking changes:

* `constraints`:

* `<Metric>Parity(difference_bound=`&epsilon;`)`: for all _a_, <br>
|_metric_(_a_) - _metric_(\*)| &le; &epsilon;.

* `<Metric>Parity(ratio_bound=`_r_`, ratio_bound_slack=`&epsilon;`)`: for all _a_, <br>
_r_ &sdot; _metric_(_a_) - _metric_(\*) &le; &epsilon;, <br>
_r_ &sdot; _metric_(\*) - _metric_(_a_) &le; &epsilon;.

* `DemographicParity` and `EqualizedOdds` have the same calling convention as `<Metric>Parity`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we should be able to run direct comparisons between things in metrics and things in moments to make sure they agree that they're calculating the same thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, but almost... there's a small discrepancy between metrics and moments:

  • <metric>_difference and <metric>_ratio look at the difference / ratio between the max and the min
  • <Metric>Parity is looking at the max difference/ratio between any group and the overall metric

I think that the final solution would be to enable a flag relative_to_overall in <metric>_difference and <metric>_ratio to get the same functionality as in the moments when the flag equals True and the current functionality when the flag equals False. But I'm not sure we need to put it in this proposal (but definitely will put it in the user guide).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, Richard. I think this definitely needs to be documented well.


* rename `ConditionalSelectionRate` to `UtilityParity`

* `BoundedGroupLoss(<loss>, upper_bound=`&zeta;`)`: for all _a_, <br>
_loss_(_a_) &le; &zeta;

* remove `ConditionalLossMoment`


* `objective`:
* `<Metric>()` (for classification moments)

* `MeanLoss(<loss>)` (for loss minimization moments) <br>

* the loss evaluator object `<loss>` needs to support the following API:

* `<loss>(y_true, y_pred)` returning the vector of losses on each example
* `<loss>.min_loss` the minimum value the loss evaluates to
* `<loss>.max_loss` the maximum value the loss evaluates to

_Constructors_:

* `SquareLoss(max_loss=...)`
* `AbsoluteLoss(max_loss=...)`
* `LogLoss(max_loss=...)`
* `ZeroOneLoss()`