Skip to content

Conversation

@antoinebaker
Copy link

@antoinebaker antoinebaker commented Oct 2, 2025

Hi @GaetandeCast

The blogpost is neat, easy to follow I think. Here a few suggestions in the python file.

For the overfitted model, I feel the narrative is "feature importance computed on a overfitted model is unreliable, however it's good enough to identify irrelevant features and trim them down to get a good model".

Is that supported by theory or in practice that RFECV with PFI is good for feature selection ?

Copy link
Owner

@GaetandeCast GaetandeCast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the overfitted model, I feel the narrative is "feature importance computed on a overfitted model is unreliable, however it's good enough to identify irrelevant features and trim them down to get a good model".

Yes, I will try to make that clearer.

Is that supported by theory or in practice that RFECV with PFI is good for feature selection ?

The minimal axiom of Reyero Lobo et al. supports it in the sense that it makes sense to eliminate the features with zero permutation importance one by one. This does not cover the fact the RFECV can remove features with non zero importance so long as they don't degrade the performance. However, in practice it is fine since it should lead to a better model because of CV.

I'll mention this in the part that justifies RFECV + permutation importance


linear_regressor = LassoCV(random_state=rng)
linear_regressor.fit(X_train, y_train)
# maybe a dataframe feature | coef will be better looking ?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks nicer for instance:

print("Coefficients of the linear model:")
print(
    pd.DataFrame(
        {f"x{idx}": f"{linear_regressor.coef_[idx]:.3f}" for idx in range(X.shape[1])},
        index=["Coefficient"],
    )
)

And for the second model:

print("Coefficients of the linear model:")
print(
    pd.DataFrame(
        {
            f"{feature_names[idx]}": f"{linear_regressor.coef_[idx]:.3f}"
            for idx in np.argsort(linear_regressor.coef_)[::-1]
        },
        index=["Coefficient"],
    )
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was thinking something even simpler like:

pd.options.display.precision = 3
pd.DataFrame({"feature":feature_names, "coef":linear_regressor.coef_})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want ordering by coef:

pd.DataFrame({"feature":feature_names, "coef":linear_regressor.coef_}).sort_values(by="coef")

# if $X_2$ has a low impact on the target or if the model is overfitting on it.
# interaction with $X_0$. We can now say that $X_1$ is important for the underlying process. Some features involving
# $X_2$ are receiving low but non zero coefficients in the second model. In our synthetic case, we know
# that the target $Y = X_0 + (X_0+X_1)^2 + \text{noise}$ does not depend on $X_2$, so these small nonzero coefficients
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this clarification.
I use + \mathcal{N}(0, \sigma^2) later for the noise so we should pick one to be consistent.
I think + \text{noise} might be better since we don't need to introduce \sigma in this case (which I did not do).

# Validation (`RFECV`) provides a good way to trim down irrelevant features.
# [Justify that permutation importance is sensible by citing Reyero Lobo et al. ?]
# [Yes ! Maybe explain that j irrelevant means X_j \perp Y | X_{-j}, and that PFI
# (in the optimal setting) is able to detect such irrelevant features]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll think of something.

)

# %% [markdown]
# [I feel the summary/recap is a bit dry, maybe give more details?]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll improve it, it's kind of a placeholder for now

@ogrisel
Copy link

ogrisel commented Oct 3, 2025

Here is a pass of feedback:

  • Mis-specified => Misspecified
    https://en.wiktionary.org/wiki/misspecified

  • "Misspecified model" as a section header => "A first misspecified model" or "Misspecified models" or "Dealing with misspecification"

  • the empty model => the null model / constant predictor

  • rng.normal takes sigma as second argument instead of sigma**2

  • Please split long code cells into subcells with independent outputs. For instance for the first cell: one cell about the data generating process, then one cell to fit the lasso model and display the non-zero coef and one cell to list the features names with 0 coef.

  • Instead printing a list comprehension, do a for loop with one print statement per iteration.

  • Make the feature names consistent in the print statements: "Feature 0" vs "x1", "x2"

  • Same comment w.r.t the second cell: split it into separate cells, each with their output and interleave the analysis as you go instead of writing the analysis of the second cell before the code.

  • You should state that we can interpret the magnitude of the coefficient of the linear model as a relative importance measure because all the features have the scale scale. Actually this is not true anymore once you use PolynomialFeatures: the cross-features for not have the same variance. So we might want to either insert a standard scaler after the PF step, or, multiply the coef values by the standard deviations of the features to get importance values.

  • Please join the feature names and (signed) feature importances (the scaled coefs) in a pandas dataframe and use horizontal bar plots instead of printing the values.

  • "and does not drop when we train on half the data, we know that the model is well specified"

    The fact that the score does not drop when training on half the data does not guarantee that we are well specified: instead it tells us that we are not overfitting, that is, we have trained on enough training data points. We can only believe that we are well specified if we have have chosen an expressive enough model class (and hyperparameter set) given what we know about the structure of the data generating process.

  • Similar comments about the second code block: it's too long and the conclusions should be interleaved into logically separated sub code cells.

  • [Justify that permutation importance is sensible by citing Reyero Lobo et al. ?]

Yes please do so.

  • "This score does not drop significantly when we re-train on only half the data, indicating that the model is close to Bayes-optimal. "

    It does drop from 0.989 to 0.980 so it's a bit of a stretch to assert that the final model is "close" to Bayes optimal. Maybe we could be reach 0.999 if we doubled the training size again, or not (e.g. stay below 0.985 what ever the number of data points). We cannot say just from those values. Maybe you can try to tweak the training set size to see if we can get closer to Bayes optimal after feature selection while still being overfitting before feature selection?

  • Could you please merge the two bar plot (train and test PR) into one (with test in blue and train in orange for instance) so that we can compare the relative sizes of the PIs?

@antoinebaker
Copy link
Author

The fact that the score does not drop when training on half the data does not guarantee that we are well specified: instead it tells us that we are not overfitting, that is, we have trained on enough training data points. We can only believe that we are well specified if we have have chosen an expressive enough model class (and hyperparameter set) given what we know about the structure of the data generating process.

Ah yes, I was also a bit confused by this "half training" argument :) I think in your case it is self evident from the data generating process that the first linear model is misspecified and the second is well-specified, so I would just remove that part.

If you want to claim that you are "close" to the Bayes optimal model, maybe you can compare the mse to the noise variance (mse >= noise variance and equality for the Bayes optimal regressor).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants