-
Notifications
You must be signed in to change notification settings - Fork 36
Replace legacy Draper–Smith stepwisefit with a MATLAB-compatible stepwisefit #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
open for feedbacks and discussions |
|
In code commented lines should always use |
|
Use the private |
|
Thanks for feedback , will update accordingly. |
No numerical or behavioral changes intended. |
|
At the very end of the file, add error tests for all input validation. It should be like See other functions to get an idea. |
|
added remaining input validation and corresponding tests |
This PR replaces the legacy stepwisefit implementation with a ground-up, well-tested implementation aimed at closer MATLAB parity and better diagnostics. The main user-visible changes and new capabilities are:
Core behavior
Full stepwise selection based on conditional p-values (forward-add then backward-remove cycle) using robust regress calls for candidate evaluation.
Selection performed on optionally standardized predictors (
Scale = "on"), while final reported coefficients are on the original data scale.Missing-value handling: rows with any NaN in X or y are ignored and recorded in stats.wasnan.
Name–Value interface (MATLAB-like)
Supports the following Name–Value options (case-insensitive):
Validates inputs (types, lengths) and errors early on invalid options.
Diagnostics & outputs
Returns the full MATLAB-style output set:
b— p×1 coefficient vector (estimates for included predictors and conditional refit for excluded predictors).se— standard errors (NaN where df insufficient).pval— two-sided p-values.stats— structure with detailed diagnostics:nextstep— placeholder scalar (currently 0).history— structure summarizing final model; includes in, df0, rmse, B (B currently a final-state snapshot).covb(covariance) for intercept + selected predictors and fills a (p+1)x(p+1) matrix with NaN elsewhere to preserve shape.xr— residuals for predictors not in the final model (orthogonalized w.r.t. the final model), suitable for diagnostics.Robustness & edge cases
Tests added
Rationale & design notes
The goal was practical MATLAB compatibility: match MATLAB’s observable behaviour and numeric contract for common usages (selection logic, outputs, statistical fields) rather than reproducing every internal implementation detail.
Selection uses conditional p-values computed from regress fits. Conditional refits for excluded predictors are performed so b, se, and pval match the documented MATLAB contract (estimates for excluded predictors come from fitting the final model plus that predictor).
Scale is applied only during selection to replicate MATLAB’s behavior: selected terms are those that would be chosen if columns were standardized, but final reported statistics are produced on the original scale.
history is intentionally conservative in Phase 1: a final-state summary is provided and history.B stores last-step coefficients. This avoids producing large, possibly inconsistent arrays before we are confident of consistent per-step behavior between Octave and MATLAB.
What remains / future scope (recommended next phases)
I split remaining work into prioritized phases so reviewers and maintainers can assess risk and scope.
Phase 2
nextstep logic — compute the recommended next action (index of variable to add/remove) rather than returning 0. Tests should assert expected next-step values for synthetic multi-step cases.
Full per-step history — build history.in as an (nsteps × p) logical array and history.B as p×nsteps coefficients matrix (excluding intercept in MATLAB convention). This provides parity for users who inspect model evolution.
Positional argument support — accept legacy positional calls stepwisefit(X,y,penter,premove,method) and [] defaults in positional slots, to match older MATLAB usage patterns.
Display output formatting — optionally print step messages matching MATLAB when Display='on'.
Phase 3
Exact tie-breaking determinism — ensure deterministic choice when candidate p-values are numerically equal; add tests to enforce parity with MATLAB in deterministic examples.
Single-precision input support — maintain input type semantics when single is provided.
Small performance improvements — avoid redundant regress calls where possible; microbenchmarks and targeted caching.
Backwards compatibility & risk
The public function signature remains stepwisefit(X, y, varargin). This is a breaking change only with regards to older code that relied on the previous, less featureful Octave implementation. The new implementation is intentionally stricter on input validation to avoid silent misbehavior.
Tests and early runtime checks are included to reduce the risk of regressions elsewhere in the package.
Performance: the new implementation performs many regress calls during selection. For very high-dimensional datasets this may be slower than the legacy implementation — performance profiling and optimizations will be pursued in follow-up PR(s) if necessary.