-
Notifications
You must be signed in to change notification settings - Fork 6
Add harmonization of metrics to Metric API #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
794a07d
3b93629
9359f13
ddde2ff
0b86e6d
afa5890
37a2c14
51c101c
82740e6
b1fc2b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,16 @@ | ||
| # API proposal for metrics | ||
|
|
||
| ## Example | ||
| ## Goals | ||
|
|
||
| 1. Metrics should have simple functional signatures that can be passed to `make_scorer`. | ||
|
|
||
| 1. Metrics that return structured objects—such as evaluating accuracy in each group—should be easy to turn into a `DataFrame` or `Series`. | ||
|
|
||
| 1. It should be possible to implement caching of intermediate results (at some point in future). For example, many metrics can be derived from the set of group-level metric values across all groups, so it would be nice to avoid re-calculating the group summaries. | ||
|
|
||
| 1. Metrics that are derived from existing `sklearn` metrics should be recognizable. | ||
|
|
||
| ## Proposal in the form of an example | ||
|
|
||
| ```python | ||
| # For most sklearn metrics, we will have their group version that returns | ||
|
|
@@ -35,72 +45,263 @@ acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_ | |
| acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs) | ||
| ``` | ||
|
|
||
| ## Proposal | ||
| ## Proposal details | ||
|
|
||
| *Function signatures* | ||
| The items that are not implemented yet are marked as `[TODO]`. | ||
|
|
||
| ### Metrics engine | ||
|
|
||
| ```python | ||
| group_summary(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) | ||
| # return the group summary for the provided `metric`, where `metric` has the signature | ||
| # metric(y_true, y_pred, **other_kwargs) | ||
|
|
||
| make_metric_group_summary(metric) | ||
| make_group_summary(metric) [TODO] | ||
| # return a callable object <metric>_group_summary: | ||
| # <metric>_group_summary(...) = group_summary(<metric>, ...) | ||
| # <metric>_group_summary(...) = group_summary(metric, ...) | ||
|
|
||
| # Transformation functions returning scalars | ||
| # Transformation functions returning scalars (the definitions are below) | ||
| difference_from_summary(summary) | ||
| ratio_from_summary(summary) | ||
| group_min_from_summary(summary) | ||
| group_max_from_summary(summary) | ||
|
|
||
| # Metric-specific functions returning group summary and scalars | ||
| derived_metric(metric, transformation, y_true, y_pred, *, sensitive_features, **other_kwargs) [TODO] | ||
| # return the value of a metric derived from the provided `metric`, where `metric` | ||
| # has the signature | ||
| # * metric(y_true, y_pred, **other_kwargs) | ||
| # and `transformation` is a string 'difference', 'ratio', 'group_min' or 'group_max'. | ||
| # | ||
| # Alternatively, `metric` can be a metric group summary function, and `transformation` | ||
| # can be a function one of the functions <transformation>_from_summary. | ||
|
|
||
| make_derived_metric(metric, transformation) [TODO] | ||
| # return a callable object <metric>_<transformation>: | ||
| # <metric>_<transformation>(...) = derived_metric(metric, transformation, ...) | ||
|
|
||
| # Predefined metrics are named according to the following patterns: | ||
| <metric>_group_summary(y_true, y_pred, *, sensitive_features, **other_kwargs) | ||
| <metric>_difference(y_true, y_pred, *, sensitive_features, **other_kwargs) | ||
| <metric>_ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) | ||
| <metric>_group_min(y_true, y_pred, *, sensitive_features, **other_kwargs) | ||
| <metric>_group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) | ||
| ``` | ||
|
|
||
| *Transformations and transformation codes* | ||
| ### Definitions of transformation functions | ||
|
|
||
| |transformation function|output|metric-specific function|code|aif360| | ||
| |transformation function|output|derived metric name|code|aif360| | ||
| |-----------------------|------|------------------------|----|------| | ||
| |`difference_from_summary`|max - min|`<metric>_difference`|D|unprivileged - privileged| | ||
| |`ratio_from_summary`|min / max|`<metric>_ratio`|R| unprivileged / privileged| | ||
MiroDudik marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| |`group_min_from_summary`|min|`<metric>_group_min`|Min| N/A | | ||
| |`group_max_from_summary`|max|`<metric>_group_max`|Max| N/A | | ||
|
|
||
| *Tasks and task codes* | ||
| > _Note_: The ratio min/max should evaluate to `np.nan` if min<0.0, and to 1.0 if min=max=0.0. | ||
|
|
||
| ### List of predefined metrics | ||
|
|
||
| * In the list of predefined metrics, we refer to the following machine learning tasks: | ||
|
|
||
| |task|definition|code| | ||
| |----|----------|----| | ||
| |binary classification|labels and predictions are in {0,1}|class| | ||
| |probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob| | ||
| |randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand| | ||
| |regression|labels and predictions are real-valued|reg| | ||
| |task|definition|code| | ||
| |----|----------|----| | ||
| |binary classification|labels and predictions are in {0,1}|class| | ||
| |probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob| | ||
| |randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand| | ||
| |regression|labels and predictions are real-valued|reg| | ||
|
|
||
| *Predefined metric-specific functions* | ||
| * For each _base metric_, we provide the list of predefined derived metrics, using D, R, Min, Max to refer to the transformations from the table above, and G to refer to `<metric>_group_summary`. We follow these rules: | ||
| * always provide G (except for demographic parity and equalized odds, which do not make sense as group-level metrics) | ||
| * provide D and R for confusion-matrix-derived metrics | ||
| * provide Min for score functions (worst-case score) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't think of any such cases, but are there scores people could come up with that would work the other way round? Or would you just consider them losses then? |
||
| * provide Max for error/loss functions (worst-case error) | ||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * for internal API metrics starting with `_`, only provide G | ||
|
|
||
| * variants: D, R, Min, Max refer to the transformations from the table above; G refers to `<metric>_group_summary`. | ||
|
|
||
| |metric|variants|task|notes|aif360| | ||
| |------|--------|-----|----|------| | ||
| |`selection_rate`| G,D,R,Min | class | | ✓ | | ||
| |`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| | ||
| |`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | | ||
| |`balanced_accuracy_score` | G | class | sklearn | - | | ||
| |`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` | | ||
| |`confusion_matrix` | G | class | sklearn | `binary_confusion_matrix` | | ||
| |`false_positive_rate` | G,D,R | class | | ✓ | | ||
| |`false_negative_rate` | G | class | | ✓ | | ||
| |`false_negative_rate` | G,D,R | class | | ✓ | | ||
| |`true_positive_rate` | G,D,R | class | | ✓ | | ||
| |`true_negative_rate` | G | class | | ✓ | | ||
| |`equalized_odds` | D,R | class | max of difference or ratio under `true_positive_rate`, `false_positive_rate` | - | | ||
| |`precision_score`| G | class | sklearn | ✓ | | ||
| |`recall_score`| G | class | sklearn | ✓ | | ||
| |`f1_score`| G | class | sklearn | - | | ||
| |`roc_auc_score`| G | prob | sklearn | - | | ||
| |`log_loss`| G | prob | sklearn | - | | ||
| |`mean_squared_error`| G | prob, reg | sklearn | - | | ||
| |`r2_score`| G | reg | sklearn | - | | ||
| |`true_negative_rate` | G,D,R | class | | ✓ | | ||
| |`selection_rate`| G,D,R | class | | ✓ | | ||
| |`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| | ||
| |`equalized_odds` | D,R | class | max difference or min ratio under `true_positive_rate`, `false_positive_rate` | - | | ||
| |`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | | ||
| |`zero_one_loss` | G,D,R,Max | class | sklearn | `error_rate` | | ||
| |`balanced_accuracy_score` | G,Min | class | sklearn | - | | ||
| |`precision_score`| G,Min | class | sklearn | ✓ | | ||
| |`recall_score`| G,Min | class | sklearn | ✓ | | ||
MiroDudik marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| |`f1_score` [TODO]| G,Min | class | sklearn | - | | ||
| |`roc_auc_score`| G,Min | prob | sklearn | - | | ||
| |`log_loss` [TODO]| G,Max | prob | sklearn | - | | ||
| |`mean_absolute_error` | G,Max | reg | sklearn | - | | ||
| |`mean_squared_error`| G,Max | prob, reg | sklearn | - | | ||
| |`r2_score`| G,Min | reg | sklearn | - | | ||
| |`mean_prediction`| G | prob, reg | | - | | ||
| |`_mean_overprediction` | G | class, reg | | - | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When do you think you'll have an alternative to these underscores?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to ask the community what their thoughts are given that we're doing somewhat non-standard things here. |
||
| |`_mean_underprediction` | G | class, reg | | - | | ||
| |`_root_mean_squared_error`| G | prob, reg | | - | | ||
| |`_balanced_root_mean_squared_error`| G | prob | | - | | ||
|
|
||
| # Harmonizing metrics across modules | ||
|
|
||
| Various modules of Fairlearn refer to the related fairness concepts in many different ways. The goal of this part of the proposal is to harmonize their API with the one used in `fairlearn.metrics`. | ||
|
|
||
| ## Notation | ||
|
|
||
| A _metric_ refers to any function that can be evaluated on subsets of the data. We use the notation _metric_(\*) for its value on the whole data set and _metric_(_a_) for its value on the subset of examples with sensitive feature value _a_. We write `<metric>` and `<Metric>` for literals representing the metric name. | ||
|
|
||
| For example, if _metric_ is _accuracy_score_, then | ||
|
|
||
| * _metric_(\*) = P[_Y=h(X)_] | ||
| * _metric_(_a_) = P[_Y=h(X)_ | _A=a_] | ||
| * `<metric>` = `accuracy_score` | ||
| * `<Metric>` = `AccuracyScore` | ||
|
|
||
| ## Fairness metrics and fairness constraints | ||
|
|
||
| Fairness metrics are expressed in terms of _metric_(\*) and _metric_(_a_). Mitigation algorithms have a notion of _fairness constraints_, expressed in terms of fairness metrics, and a notion of an _objective_, typically equal to _metric_(\*). | ||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### fairlearn.metrics | ||
|
|
||
| There are two kinds of fairness metrics in `fairlearn.metrics`: | ||
|
|
||
| * Fairness metrics derived from base metrics: | ||
|
|
||
| * `<metric>_difference` = | ||
| [max<sub>_a_</sub> _metric_(_a_)] - [min<sub>_a'_</sub> _metric_(_a'_) ] | ||
| * `<metric>_ratio` = | ||
| [min<sub>_a_</sub> _metric_(_a_)] / [max<sub>_a'_</sub> _metric_(_a'_) ] | ||
| * `<metric>_group_min` = | ||
| [min<sub>_a_</sub> _metric_(_a_)] | ||
| * `<metric>_group_max` = | ||
| [max<sub>_a_</sub> _metric_(_a_)] | ||
|
|
||
| * Other fairness metrics: | ||
|
|
||
| * `demographic_parity_difference` <br> | ||
| = `selection_rate_difference` | ||
|
|
||
| * `demographic_parity_ratio` <br> | ||
| = `selection_rate_ratio` | ||
|
|
||
| * `equalized_odds_difference` <br> | ||
| = max(`true_positive_rate_difference`, `false_positive_rate_difference`) | ||
|
|
||
| * `equalized_odds_ratio` <br> | ||
| = min(`true_positive_rate_ratio`, `false_positive_rate_ratio`) | ||
|
|
||
| ### fairlearn.postprocessing | ||
|
|
||
| **Status quo.** Our postprocessing algorithms use the following calling conventions: | ||
|
|
||
| * `constraints` is a string that represents fairness constraints as follows: | ||
|
|
||
| * `"demographic_parity"`: for all _a_, <br> | ||
| _selection_rate_(_a_) = _selection_rate_(\*). | ||
|
|
||
| * `"equalized_odds"`: for all _a_,<br> | ||
| _true_positive_rate_(_a_) = _true_positive_rate_(\*), <br> | ||
| _false_positive_rate_(_a_) = _false_positive_rate_(\*). | ||
|
|
||
| * `objective` is always _accuracy_score_(\*). | ||
|
|
||
| **Proposal.** The following proposal is an extension of the status quo; no breaking changes are introduced: | ||
|
|
||
| * `constraints`: in addition to `"demographic_parity"` and `"equalized_odds"`, also allow: | ||
|
|
||
| * `"<metric>_parity"`: for all _a_,<br> | ||
| _metric_(_a_) = _metric_(\*). | ||
|
|
||
| * `objective` is a string of the form: | ||
|
|
||
| * `"<metric>"`: <br> | ||
| goal is then to maximize _metric_(\*) subject to constraints | ||
|
|
||
| ### fairlearn.reductions [TODO] | ||
|
|
||
| We support two algorithms: `ExponentiatedGradient` and `GridSearch`. They both represent `constraints` and `objective` via objects of type `Moment`. | ||
|
|
||
| **Status quo:** | ||
|
|
||
| * In both cases, `objective` is automatically inferred from the provided `constraints`. | ||
|
|
||
| * `ExponentiatedGradient(estimator, constraints, eps=`ε`)` considers constraints that are specified jointly by the provided `Moment` object and the numeric value ε as follows: | ||
|
|
||
| * `DemographicParity()`: for all _a_, <br> | ||
| |_selection_rate_(_a_) - _selection_rate_(\*)| ≤ ε. | ||
|
|
||
| * `DemographicParity(ratio=`_r_`)`: for all _a_, <br> | ||
| _r_ ⋅ _selection_rate_(_a_) - _selection_rate_(\*) ≤ ε, <br> | ||
| _r_ ⋅ _selection_rate_(\*) - _selection_rate_(_a_) ≤ ε. | ||
|
|
||
| * `TruePositiveRateDifference()`: for all _a_, <br> | ||
| |_true_positive_rate_(_a_) - _true_positive_rate_(\*)| ≤ ε. | ||
|
|
||
| * `TruePositiveRateDifference(ratio=`_r_`)`: analogous | ||
|
|
||
| * `EqualizedOdds()`: for all _a_, <br> | ||
| |_true_positive_rate_(_a_) - _true_positive_rate_(\*)| ≤ ε, <br> | ||
| |_false_positive_rate_(_a_) - _false_positive_rate_(\*)| ≤ ε. | ||
|
|
||
| * `EqualizedOdds(ratio=`_r_`)`: analogous | ||
|
|
||
| * `ErrorRateRatio()`: for all _a_, <br> | ||
| |_error_rate_(_a_) - _error_rate_(\*)| ≤ ε. | ||
|
|
||
| * `ErrorRateRatio(ratio=`_r_`)`: analogous | ||
|
|
||
| * all of the above constraints are descendants of `ConditionalSelectionRate` | ||
|
|
||
| * the `objective` for all of the above constraints is `ErrorRate()` | ||
|
|
||
| * `GridSearch(estimator, constraints)` considers constraints represented by the provided `Moment` object. The behavior of the `GridSearch` algorithm does not depend on the value of the right-hand side of the constraints, so it is not provided to the constructor. In addition to the `Moment` objects above, `GridSearch` also supports the following: | ||
riedgar-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| * `GroupLossMoment(<loss>)`: for all _a_, <br> | ||
| _loss_(_a_) ≤ ζ | ||
|
|
||
| * where `<loss>` is a _loss evaluation_ object; it supports an interface that takes `y_true` and `y_pred` as the input and returns the vector of losses evaluated on individual examples | ||
|
|
||
| * the `objective` for `GroupLossMoment(<loss>)` is `AverageLossMoment(<loss>)` | ||
|
|
||
| * both `GroupLossMoment` and `AverageLossMoment` are descendants of `ConditionalLossMoment` | ||
|
|
||
| **Proposal.** The proposal introduces many breaking changes: | ||
|
|
||
| * `constraints`: | ||
|
|
||
| * `<Metric>Parity(difference_bound=`ε`)`: for all _a_, <br> | ||
romanlutz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| |_metric_(_a_) - _metric_(\*)| ≤ ε. | ||
|
|
||
| * `<Metric>Parity(ratio_bound=`_r_`, ratio_bound_slack=`ε`)`: for all _a_, <br> | ||
| _r_ ⋅ _metric_(_a_) - _metric_(\*) ≤ ε, <br> | ||
| _r_ ⋅ _metric_(\*) - _metric_(_a_) ≤ ε. | ||
|
|
||
| * `DemographicParity` and `EqualizedOdds` have the same calling convention as `<Metric>Parity` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that we should be able to run direct comparisons between things in
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not quite, but almost... there's a small discrepancy between metrics and moments:
I think that the final solution would be to enable a flag
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great catch, Richard. I think this definitely needs to be documented well. |
||
|
|
||
| * rename `ConditionalSelectionRate` to `UtilityParity` | ||
|
|
||
| * `BoundedGroupLoss(<loss>, upper_bound=`ζ`)`: for all _a_, <br> | ||
| _loss_(_a_) ≤ ζ | ||
|
|
||
| * remove `ConditionalLossMoment` | ||
|
|
||
|
|
||
| * `objective`: | ||
| * `<Metric>()` (for classification moments) | ||
|
|
||
| * `MeanLoss(<loss>)` (for loss minimization moments) <br> | ||
|
|
||
| * the loss evaluator object `<loss>` needs to support the following API: | ||
|
|
||
| * `<loss>(y_true, y_pred)` returning the vector of losses on each example | ||
| * `<loss>.min_loss` the minimum value the loss evaluates to | ||
| * `<loss>.max_loss` the maximum value the loss evaluates to | ||
|
|
||
| _Constructors_: | ||
|
|
||
| * `SquareLoss(max_loss=...)` | ||
| * `AbsoluteLoss(max_loss=...)` | ||
| * `LogLoss(max_loss=...)` | ||
| * `ZeroOneLoss()` | ||
Uh oh!
There was an error while loading. Please reload this page.