Skip to content

Conversation

@Pasta-coder
Copy link
Contributor

@Pasta-coder Pasta-coder commented Dec 19, 2025

This PR replaces the legacy stepwisefit implementation with a ground-up, well-tested implementation aimed at closer MATLAB parity and better diagnostics. The main user-visible changes and new capabilities are:

Core behavior

Full stepwise selection based on conditional p-values (forward-add then backward-remove cycle) using robust regress calls for candidate evaluation.

Selection performed on optionally standardized predictors (Scale = "on"), while final reported coefficients are on the original data scale.

Missing-value handling: rows with any NaN in X or y are ignored and recorded in stats.wasnan.

Name–Value interface (MATLAB-like)

Supports the following Name–Value options (case-insensitive):

  • "InModel" — logical vector specifying initial model.
  • "Keep" — logical vector of predictors that must remain in/out.
  • "PEnter" — scalar threshold for entering variables (default 0.05).
  • "PRemove" — scalar threshold for removing variables (default max(PEnter, 0.1)).
  • "Scale" — "on" / "off" (selection only).
  • "MaxIter" — positive integer, max number of stepwise iterations.
  • "Display" — "on" / "off" (accepted for interface compatibility).

Validates inputs (types, lengths) and errors early on invalid options.

Diagnostics & outputs

Returns the full MATLAB-style output set:

  • b — p×1 coefficient vector (estimates for included predictors and conditional refit for excluded predictors).
  • se — standard errors (NaN where df insufficient).
  • pval — two-sided p-values.
  • `finalmodel — logical row vector indicating selected predictors.
  • stats — structure with detailed diagnostics:
  • `source, df0, dfe, SStotal, SSresid, fstat, pval, rmse, xr, yr, B, SE, TSTAT, PVAL, covb, intercept, wasnan.
  • nextstep — placeholder scalar (currently 0).
  • history — structure summarizing final model; includes in, df0, rmse, B (B currently a final-state snapshot).
  • Computes covb (covariance) for intercept + selected predictors and fills a (p+1)x(p+1) matrix with NaN elsewhere to preserve shape.
  • Computes xr — residuals for predictors not in the final model (orthogonalized w.r.t. the final model), suitable for diagnostics.

Robustness & edge cases

  • Guards regress calls and degenerate fits (singular designs) to avoid crashes.
  • Handles small degrees-of-freedom (df <= 0) by returning NaNs for SE/p-values rather than throwing.
  • Validates Keep and InModel vector lengths and types.

Tests added

  • Comprehensive BISTs covering:
  • MATLAB-parity numeric examples (hald and synthetic inline datasets).
  • Input validation tests (invalid Name–Value, mismatched lengths, Keep behavior).
  • Shape and content checks for stats fields, covb, xr orthogonality, RMSE/F-statistic consistency.
  • Scale on/off contract test (shape-preserving).
  • Intercept-only behavior, small-n robustness, and general shape/contract tests for outputs.

Rationale & design notes

The goal was practical MATLAB compatibility: match MATLAB’s observable behaviour and numeric contract for common usages (selection logic, outputs, statistical fields) rather than reproducing every internal implementation detail.

Selection uses conditional p-values computed from regress fits. Conditional refits for excluded predictors are performed so b, se, and pval match the documented MATLAB contract (estimates for excluded predictors come from fitting the final model plus that predictor).

Scale is applied only during selection to replicate MATLAB’s behavior: selected terms are those that would be chosen if columns were standardized, but final reported statistics are produced on the original scale.

history is intentionally conservative in Phase 1: a final-state summary is provided and history.B stores last-step coefficients. This avoids producing large, possibly inconsistent arrays before we are confident of consistent per-step behavior between Octave and MATLAB.

What remains / future scope (recommended next phases)

I split remaining work into prioritized phases so reviewers and maintainers can assess risk and scope.

Phase 2

nextstep logic — compute the recommended next action (index of variable to add/remove) rather than returning 0. Tests should assert expected next-step values for synthetic multi-step cases.

Full per-step history — build history.in as an (nsteps × p) logical array and history.B as p×nsteps coefficients matrix (excluding intercept in MATLAB convention). This provides parity for users who inspect model evolution.

Positional argument support — accept legacy positional calls stepwisefit(X,y,penter,premove,method) and [] defaults in positional slots, to match older MATLAB usage patterns.

Display output formatting — optionally print step messages matching MATLAB when Display='on'.

Phase 3

Exact tie-breaking determinism — ensure deterministic choice when candidate p-values are numerically equal; add tests to enforce parity with MATLAB in deterministic examples.

Single-precision input support — maintain input type semantics when single is provided.

Small performance improvements — avoid redundant regress calls where possible; microbenchmarks and targeted caching.

Backwards compatibility & risk

The public function signature remains stepwisefit(X, y, varargin). This is a breaking change only with regards to older code that relied on the previous, less featureful Octave implementation. The new implementation is intentionally stricter on input validation to avoid silent misbehavior.

Tests and early runtime checks are included to reduce the risk of regressions elsewhere in the package.

Performance: the new implementation performs many regress calls during selection. For very high-dimensional datasets this may be slower than the legacy implementation — performance profiling and optimizations will be pursued in follow-up PR(s) if necessary.

@Pasta-coder Pasta-coder changed the title Stepwisefit new Replace legacy Draper–Smith stepwisefit with a MATLAB-compatible stepwisefit Dec 21, 2025
@Pasta-coder
Copy link
Contributor Author

open for feedbacks and discussions

@Pasta-coder Pasta-coder marked this pull request as ready for review December 21, 2025 22:24
@pr0m1th3as
Copy link
Member

In code commented lines should always use ## instead of %.
Do not use try ... catch blocks in BISTs.

@pr0m1th3as
Copy link
Member

Use the private pairedArgs function to parse paired arguments and declare their default values. You can study a working example of its use in the latest grpstats function starting at line 240.

@Pasta-coder
Copy link
Contributor Author

Thanks for feedback , will update accordingly.

@Pasta-coder
Copy link
Contributor Author

  • Replace manual Name–Value parsing with pairedArgs and centralized defaults
  • Convert all code comments to Octave-style '##'
  • Validate Scale and Display options explicitly
  • Remove try/catch usage from BISTs and use fail() for error tests
  • Minor style and consistency cleanups

No numerical or behavioral changes intended.
All tests pass.

@pr0m1th3as
Copy link
Member

At the very end of the file, add error tests for all input validation. It should be like

%!error <error message> stepwisefit (syntax that causes the specific error being tested)

See other functions to get an idea.

@Pasta-coder
Copy link
Contributor Author

added remaining input validation and corresponding tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants