fix indexing bug with DatFrames & null weights by baraldian · Pull Request #1 · MiroDudik/fairlearn

baraldian · 2022-12-06T14:37:58Z

weights can be null due to subsampling & matrices not aligned

Description

Tests

no new tests required
new tests added
existing tests adjusted

Documentation

no documentation changes needed
user guide added or updated
API docs added or updated
example notebook added or updated

Screenshots

…trices not aligned

MiroDudik · 2023-01-07T15:28:46Z

@baraldian -- can you construct some short examples that would elicit these bugs?

MiroDudik · 2023-01-07T15:32:13Z

fairlearn/reductions/_exponentiated_gradient/_lagrangian.py

        if self.subsample != None:
            index_sub = _sample(n=self.subsample, weights=redW, random_state=self.random_state)
-            redX_subsampled = self.constraints.X[index_sub, :]
+            index_sub = redW.index.isin(index_sub)


This is incorrect, because it replaces any repeated samples by a single sample. We need those repetitions.

You are right, but here I got the following:

pandas.errors.InvalidIndexError: (array([50693, 66061, 55676, 50330,... 29425, 61648]), slice(None, None, None))

or

{KeyError}"None of [Int64Index([50693, 66061, ... 29425, 61648], dtype='int64')] are in the [columns]"

self.constraints.X is a pd.DataFrame so we need at least .iloc
see the same error here

But we may have trouble if X turns into numpy.ndarray

However I think
redX_subsampled = self.constraints.X.iloc[index_sub]
may be fine.

MiroDudik · 2023-01-07T15:34:26Z

fairlearn/reductions/_grid_search/grid_search.py

                current_estimator.fit(X_sub, y_reduction_sub)
            else:
-                current_estimator.fit(X, y_reduction, **{self.sample_weight_name: weights})
+                current_estimator.fit(X, y_reduction, **{self.sample_weight_name: weights.fillna(0)})


Something is off when weights take NaN values. Are the labels or sensitive features undefined?

MiroDudik · 2023-01-07T15:41:46Z

fairlearn/reductions/_moments/utility_parity.py

                        group_event_select / self.prob_group_event[event, group]
        if self.M is not None:
-            return self.utility_diff * self.M.dot(lambda_vec)
+            M, lambda_vec = self.M.align(lambda_vec, axis=1, fill_value=0, copy=False)


Did this come up in a situation when lambda_vec.index was a subset of M.columns or was it some other situation?

I think it 's a subset.

self.M.columns is

[ ('+', 'all', 1),('+', 'all', 2),('+', 'all', 3),('+', 'all', 5),('+', 'all', 6),('+', 'all', 7), ('+', 'all', 8),('+', 'all', 9),('-', 'all', 1),('-', 'all', 2),('-', 'all', 3),('-', 'all', 5), ('-', 'all', 6),('-', 'all', 7),('-', 'all', 8),('-', 'all', 9)]

lambda_vec.index

MultiIndex([ ('+', 'all', 1),('+', 'all', 2),('+', 'all', 3),('+', 'all', 6), ('+', 'all', 8),('+', 'all', 9),('-', 'all', 1),('-', 'all', 2),('-', 'all', 3), ('-', 'all', 6),('-', 'all', 8),('-', 'all', 9)], names=['sign', 'event', 'group_id'])

baraldian · 2023-01-10T14:59:21Z

In this notebook I replicated the setting. (I had to install miniconda because fairlearn does not work with the base python version of colab)

I'm first training ExponentiatedGradient on a subsample of the data and then I apply GridSearch over a bigger sample of data, passing grid=expgrad.lambda_vecs_

I'm using datasets from folktables
pip install 'folktables @ git+https://github.com/zykls/folktables'

This is the data loader I'm using

from folktables import ACSDataSource, generate_categories, ACSPublicCoverage
import folktables
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from copy import deepcopy
import numpy as np
import pandas as pd
import scipy.optimize as opt
from fairlearn.reductions import ErrorRate, GridSearch
from fairlearn.reductions import ExponentiatedGradient, DemographicParity
from sklearn.base import BaseEstimator
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

def load_transform_ACS(loader_method, states=None):
    data_source = ACSDataSource(survey_year=2018, horizon='1-Year', survey='person')
    definition_df = data_source.get_definitions(download=True)
    categories = generate_categories(features=loader_method.features, definition_df=definition_df)
    acs_data = data_source.get_data(
        download=True, states=states)  # TODO # with density 1  random_seed=0 do nothing | join_household=False ???
    df, label, group = loader_method.df_to_pandas(acs_data, categories=categories)
    del acs_data
    categorical_cols = list(categories.keys())
    # See here for data documentation of cols https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/
    # https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict00_02.pdf
    df = pd.get_dummies(df, dtype=np.uint8, columns=categorical_cols)
    numerical_cols = np.setdiff1d(loader_method.features, categorical_cols)
    df[numerical_cols] = StandardScaler().fit_transform(df[numerical_cols])
    return df, label.iloc[:, 0].astype(int), group.iloc[:, 0]
os.makedirs('data/2018/1-Year/', exist_ok=True)

This is how to reproduce the error

X, y, A = load_transform_ACS(loader_method=ACSPublicCoverage, states=['CA'])
base_model = deepcopy(LogisticRegression(solver='liblinear', fit_intercept=True, random_state=42))
constraint = DemographicParity(difference_bound=0.05)
expgrad_frac = ExponentiatedGradient(estimator=deepcopy(base_model),
                                                constraints=deepcopy(constraint), eps=0.05, nu=1e-6)
expgrad_frac.fit(X.iloc[:1000], y.iloc[:1000], sensitive_features=A.iloc[:1000])
grid_search_frac = GridSearch(
            LogisticRegression(solver='liblinear', fit_intercept=True),
            constraints=constraint, grid=expgrad_frac.lambda_vecs_)
grid_search_frac.fit(X, y, sensitive_features=A)

fix indexing bug with DatFrames; null weights due to subsampling & ma…

41a24e0

…trices not aligned

MiroDudik reviewed Jan 7, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix indexing bug with DatFrames & null weights #1

fix indexing bug with DatFrames & null weights #1
baraldian wants to merge 1 commit intoMiroDudik:scalablefrom
baraldian:scalable

baraldian commented Dec 6, 2022 •

edited

Loading

Uh oh!

MiroDudik commented Jan 7, 2023

Uh oh!

MiroDudik Jan 7, 2023

Uh oh!

baraldian Jan 10, 2023

Uh oh!

MiroDudik Jan 7, 2023

Uh oh!

MiroDudik Jan 7, 2023

Uh oh!

baraldian Jan 10, 2023 •

edited

Loading

Uh oh!

baraldian commented Jan 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baraldian commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Documentation

Screenshots

Uh oh!

MiroDudik commented Jan 7, 2023

Uh oh!

MiroDudik Jan 7, 2023

Choose a reason for hiding this comment

Uh oh!

baraldian Jan 10, 2023

Choose a reason for hiding this comment

Uh oh!

MiroDudik Jan 7, 2023

Choose a reason for hiding this comment

Uh oh!

MiroDudik Jan 7, 2023

Choose a reason for hiding this comment

Uh oh!

baraldian Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baraldian commented Jan 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

baraldian commented Dec 6, 2022 •

edited

Loading

baraldian Jan 10, 2023 •

edited

Loading