fix indexing bug with DatFrames & null weights #1
fix indexing bug with DatFrames & null weights #1baraldian wants to merge 1 commit intoMiroDudik:scalablefrom
Conversation
…trices not aligned
|
@baraldian -- can you construct some short examples that would elicit these bugs? |
| if self.subsample != None: | ||
| index_sub = _sample(n=self.subsample, weights=redW, random_state=self.random_state) | ||
| redX_subsampled = self.constraints.X[index_sub, :] | ||
| index_sub = redW.index.isin(index_sub) |
There was a problem hiding this comment.
This is incorrect, because it replaces any repeated samples by a single sample. We need those repetitions.
There was a problem hiding this comment.
You are right, but here I got the following:
pandas.errors.InvalidIndexError: (array([50693, 66061, 55676, 50330,...
29425, 61648]), slice(None, None, None))
or
{KeyError}"None of [Int64Index([50693, 66061, ... 29425, 61648],
dtype='int64')] are in the [columns]"
self.constraints.X is a pd.DataFrame so we need at least .iloc
see the same error here
But we may have trouble if X turns into numpy.ndarray
However I think
redX_subsampled = self.constraints.X.iloc[index_sub]
may be fine.
| current_estimator.fit(X_sub, y_reduction_sub) | ||
| else: | ||
| current_estimator.fit(X, y_reduction, **{self.sample_weight_name: weights}) | ||
| current_estimator.fit(X, y_reduction, **{self.sample_weight_name: weights.fillna(0)}) |
There was a problem hiding this comment.
Something is off when weights take NaN values. Are the labels or sensitive features undefined?
| group_event_select / self.prob_group_event[event, group] | ||
| if self.M is not None: | ||
| return self.utility_diff * self.M.dot(lambda_vec) | ||
| M, lambda_vec = self.M.align(lambda_vec, axis=1, fill_value=0, copy=False) |
There was a problem hiding this comment.
Did this come up in a situation when lambda_vec.index was a subset of M.columns or was it some other situation?
There was a problem hiding this comment.
I think it 's a subset.
self.M.columns is
[
('+', 'all', 1),('+', 'all', 2),('+', 'all', 3),('+', 'all', 5),('+', 'all', 6),('+', 'all', 7),
('+', 'all', 8),('+', 'all', 9),('-', 'all', 1),('-', 'all', 2),('-', 'all', 3),('-', 'all', 5),
('-', 'all', 6),('-', 'all', 7),('-', 'all', 8),('-', 'all', 9)]
lambda_vec.index
MultiIndex([
('+', 'all', 1),('+', 'all', 2),('+', 'all', 3),('+', 'all', 6),
('+', 'all', 8),('+', 'all', 9),('-', 'all', 1),('-', 'all', 2),('-', 'all', 3),
('-', 'all', 6),('-', 'all', 8),('-', 'all', 9)],
names=['sign', 'event', 'group_id'])
|
In this notebook I replicated the setting. (I had to install miniconda because fairlearn does not work with the base python version of colab) I'm first training ExponentiatedGradient on a subsample of the data and then I apply GridSearch over a bigger sample of data, passing grid=expgrad.lambda_vecs_ I'm using datasets from folktables This is the data loader I'm using This is how to reproduce the error |
weights can be null due to subsampling & matrices not aligned
Description
Tests
Documentation
Screenshots