Skip to content

fix indexing bug with DatFrames & null weights #1

Open
baraldian wants to merge 1 commit intoMiroDudik:scalablefrom
baraldian:scalable
Open

fix indexing bug with DatFrames & null weights #1
baraldian wants to merge 1 commit intoMiroDudik:scalablefrom
baraldian:scalable

Conversation

@baraldian
Copy link

@baraldian baraldian commented Dec 6, 2022

weights can be null due to subsampling & matrices not aligned

Description

Tests

  • no new tests required
  • new tests added
  • existing tests adjusted

Documentation

  • no documentation changes needed
  • user guide added or updated
  • API docs added or updated
  • example notebook added or updated

Screenshots

@MiroDudik
Copy link
Owner

@baraldian -- can you construct some short examples that would elicit these bugs?

if self.subsample != None:
index_sub = _sample(n=self.subsample, weights=redW, random_state=self.random_state)
redX_subsampled = self.constraints.X[index_sub, :]
index_sub = redW.index.isin(index_sub)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, because it replaces any repeated samples by a single sample. We need those repetitions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but here I got the following:

pandas.errors.InvalidIndexError: (array([50693, 66061, 55676, 50330,...
       29425, 61648]), slice(None, None, None))

or

{KeyError}"None of [Int64Index([50693, 66061, ...        29425, 61648],
          dtype='int64')] are in the [columns]"

self.constraints.X is a pd.DataFrame so we need at least .iloc
see the same error here

But we may have trouble if X turns into numpy.ndarray

However I think
redX_subsampled = self.constraints.X.iloc[index_sub]
may be fine.

current_estimator.fit(X_sub, y_reduction_sub)
else:
current_estimator.fit(X, y_reduction, **{self.sample_weight_name: weights})
current_estimator.fit(X, y_reduction, **{self.sample_weight_name: weights.fillna(0)})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off when weights take NaN values. Are the labels or sensitive features undefined?

group_event_select / self.prob_group_event[event, group]
if self.M is not None:
return self.utility_diff * self.M.dot(lambda_vec)
M, lambda_vec = self.M.align(lambda_vec, axis=1, fill_value=0, copy=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this come up in a situation when lambda_vec.index was a subset of M.columns or was it some other situation?

Copy link
Author

@baraldian baraldian Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it 's a subset.

self.M.columns is

[           
('+', 'all', 1),('+', 'all', 2),('+', 'all', 3),('+', 'all', 5),('+', 'all', 6),('+', 'all', 7),
('+', 'all', 8),('+', 'all', 9),('-', 'all', 1),('-', 'all', 2),('-', 'all', 3),('-', 'all', 5),
('-', 'all', 6),('-', 'all', 7),('-', 'all', 8),('-', 'all', 9)]

lambda_vec.index

MultiIndex([
('+', 'all', 1),('+', 'all', 2),('+', 'all', 3),('+', 'all', 6),
('+', 'all', 8),('+', 'all', 9),('-', 'all', 1),('-', 'all', 2),('-', 'all', 3),
('-', 'all', 6),('-', 'all', 8),('-', 'all', 9)],
           names=['sign', 'event', 'group_id'])

@baraldian
Copy link
Author

In this notebook I replicated the setting. (I had to install miniconda because fairlearn does not work with the base python version of colab)

I'm first training ExponentiatedGradient on a subsample of the data and then I apply GridSearch over a bigger sample of data, passing grid=expgrad.lambda_vecs_

I'm using datasets from folktables
pip install 'folktables @ git+https://github.com/zykls/folktables'

This is the data loader I'm using

from folktables import ACSDataSource, generate_categories, ACSPublicCoverage
import folktables
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from copy import deepcopy
import numpy as np
import pandas as pd
import scipy.optimize as opt
from fairlearn.reductions import ErrorRate, GridSearch
from fairlearn.reductions import ExponentiatedGradient, DemographicParity
from sklearn.base import BaseEstimator
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

def load_transform_ACS(loader_method, states=None):
    data_source = ACSDataSource(survey_year=2018, horizon='1-Year', survey='person')
    definition_df = data_source.get_definitions(download=True)
    categories = generate_categories(features=loader_method.features, definition_df=definition_df)
    acs_data = data_source.get_data(
        download=True, states=states)  # TODO # with density 1  random_seed=0 do nothing | join_household=False ???
    df, label, group = loader_method.df_to_pandas(acs_data, categories=categories)
    del acs_data
    categorical_cols = list(categories.keys())
    # See here for data documentation of cols https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/
    # https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict00_02.pdf
    df = pd.get_dummies(df, dtype=np.uint8, columns=categorical_cols)
    numerical_cols = np.setdiff1d(loader_method.features, categorical_cols)
    df[numerical_cols] = StandardScaler().fit_transform(df[numerical_cols])
    return df, label.iloc[:, 0].astype(int), group.iloc[:, 0]
os.makedirs('data/2018/1-Year/', exist_ok=True)

This is how to reproduce the error

X, y, A = load_transform_ACS(loader_method=ACSPublicCoverage, states=['CA'])
base_model = deepcopy(LogisticRegression(solver='liblinear', fit_intercept=True, random_state=42))
constraint = DemographicParity(difference_bound=0.05)
expgrad_frac = ExponentiatedGradient(estimator=deepcopy(base_model),
                                                constraints=deepcopy(constraint), eps=0.05, nu=1e-6)
expgrad_frac.fit(X.iloc[:1000], y.iloc[:1000], sensitive_features=A.iloc[:1000])
grid_search_frac = GridSearch(
            LogisticRegression(solver='liblinear', fit_intercept=True),
            constraints=constraint, grid=expgrad_frac.lambda_vecs_)
grid_search_frac.fit(X, y, sensitive_features=A)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants