Feature/demean accelerated by schroedk · Pull Request #995 · py-econometrics/pyfixest

schroedk · 2025-08-21T15:48:53Z

This PR implements a port of the Iron-Tucks acceleration implementation from the fixest package.

It includes:

new package demean_accelerated
new rust implementation of detect_singletons
speed optimization for feols

Closes #357

codecov · 2025-08-21T15:58:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag	Coverage Δ
core-tests	`75.11% <100.00%> (+0.02%)`	⬆️
tests-extended	`?`
tests-vs-r	`17.59% <31.03%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pyfixest/core/__init__.py	`100.00% <100.00%> (ø)`
pyfixest/core/demean.py	`100.00% <100.00%> (ø)`
pyfixest/core/detect_singletons.py	`100.00% <100.00%> (ø)`
pyfixest/estimation/__init__.py	`100.00% <100.00%> (ø)`
pyfixest/estimation/backends.py	`72.41% <100.00%> (ø)`
pyfixest/estimation/demean_.py	`55.64% <100.00%> (ø)`
pyfixest/estimation/feols_.py	`86.77% <ø> (-4.61%)`	⬇️
pyfixest/estimation/model_matrix_fixest_.py	`90.95% <100.00%> (-0.55%)`	⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

s3alfisc · 2025-10-22T20:05:24Z

@schroedk just rebased / merged changes from master in here (new features plus I moved to using a pixi toml, plus the dev env no longer installs on windows due to compatibility challenges).

As before, you can build the Rust bindings by typing

pixi r -e dev maturin-develop

and run the benchmarks via

pixi run -e dev pytest  tests/test_demean.py::test_demean_complex_fixed_effects

s3alfisc · 2025-12-02T22:22:26Z

Benchmarks of the accelerated vs regular rust vs fixest and FixedEffectsModels.jl. Looks like good progress to me!

schroedk · 2025-12-15T10:04:34Z

Benchmarks of the accelerated vs regular rust vs fixest and FixedEffectsModels.jl. Looks like good progress to me!

@s3alfisc do you have the code to run this benchmark?

s3alfisc · 2025-12-15T10:09:46Z

Yes, it's here: https://github.com/s3alfisc/fixest_benchmarks

Hope I documented the setup well, but I think clone + just + task runner should get you started.

It's the OLS benchmarks for the hard problem that are relevant.

Note: @grantmcdermott mentioned the other day there might be a minor issue with the benchmarks (though I don't know what exactly) so best to take it with a grain of salt (though I couldn't spot it, everything looked ok to me).

s3alfisc · 2025-12-15T10:13:07Z

Wait it looks like I didn't push my local changes including the just setup. One sec

s3alfisc · 2025-12-15T10:18:45Z

It's in the justfile branch on the remote 😅

https://github.com/s3alfisc/fixest_benchmarks

Requirements: global R and Julia installations. Just.

Then type

just setup

To install all package deps as well as python (in local env).

Then just bench-ols for benchmarks.

s3alfisc · 2025-12-27T12:30:21Z

Is this confirmed? 👀 😄 🚀

s3alfisc · 2025-12-27T12:32:26Z

Re failing tests atm - seems to be the prediction method, which is notoriously fickle and should be fixable by changing the tolerance or just checking on a different subset of the prediction array:

 x1 = array([23.03009222,  7.43960451, 14.04928582, 17.83110832, 17.46495242])
x2 = array([23.03009052,  7.43960323, 14.04928428, 17.83110934, 17.4649537 ])

schroedk · 2026-01-03T15:31:04Z

@s3alfisc Looks like it is on par now on the benchmark

s3alfisc · 2026-01-03T16:51:26Z

@s3alfisc Looks like it is on par now on the benchmark

So cool! Super awesome!!!

s3alfisc · 2026-01-03T16:52:49Z

Let me know once you think it is time for a review =) (I will start anyways as it will talk me some time to study the code 😄 )

schroedk · 2026-01-04T12:25:42Z

@s3alfisc the PR is ready for review. My suggestions/questions would be:

Remove the added benchmark code (no diff to .gitignore, benchmarks)
I added a rust version of detect_singletons and wired it into feols for speed -> what to do with the numba implementation?
Rust backend now resolves to the accelerated version -> what to do with the original demean.rs?
Too many commits, after review I would squash merge

s3alfisc · 2026-01-04T13:11:11Z

Awesome, I will start my review then! I will try to be fast, but also have a big PR from @leostimpfle to review thoroughly + all the Rust code will be a bit of a challenge for me 😄

On your questions, I'd say:

Yes, let's delete the benchmark code? I will push some changes to the fixest_benchmark repo that I have been working on going forward, we can use it to benchmark the routines?
This is super awesome! For the numba implementation - I guess I would keep it for now? I haven't fully decided what to do with the numba code once we make the switch and make the rust backend the default - option 1 is to allow users to choose a numba backend, option 2 would be to delete all numba code. Do you happen have an opinion on this?
I would probably delete it? Or is there any use in keeping it for testing / as a reference?
Yes, squash commit makes sense to me 👍

Implement fixest's Irons-Tuck-Grand acceleration algorithm for high-dimensional fixed effects demeaning in Rust. This is a coefficient-space iterative method that provides significant speedups over naive alternating projections. Key features: - Irons-Tuck acceleration with grand acceleration steps - Support for 2-FE and 3+ FE cases with optimized projectors - Algorithm aligned with R fixest implementation - Auto-vectorized loops (no explicit SIMD dependencies) Reference: https://github.com/lrberge/fixest (CCC_demean.cpp)

Performance improvements to the accelerated demeaning implementation: - Optimize memory layout and share FEInfo across columns - Add SSR (sum of squared residuals) stopping criterion for 2-FE - Loop unrolling for 3-FE projection hot paths - Align tolerance default with fixest (1e-6 instead of 1e-8)

Restructure the Rust demeaning code for clarity and maintainability: - Introduce Projector trait for FE-specific projection strategies - Introduce Demeaner trait for high-level solver strategies - Unified DemeanBuffers struct for scratch space management - Replace unsafe pointer code with safe iterator-based implementations - Move related functions into appropriate impl blocks

Eliminate Python/numba overhead in the estimation pipeline: - Implement detect_singletons in Rust to avoid numba JIT compilation - Add Python wrapper maintaining API compatibility - Optimize factorize() using pd.factorize instead of category conversion - Replace slow df.isin() with np.isinf() for infinite value detection

Testing and code quality improvements: - Add edge case tests for demean_accelerated - Implement buffer reuse via for_each_init pattern - Extract MultiFEBuffers struct for better readability - Refactor Demeaner trait to own context and config references

Remove unnecessary abstractions after experimentation phase: - Remove Accelerator trait in favor of direct IronsTuckGrand impl - Move config into IronsTuckGrand struct - Consolidate ConvergenceState and related types - Update to PyO3 0.26 API (allow_threads -> detach)

Connect the new demean_accelerated module to Python and polish: - Wire rust backend to use demean_accelerated instead of simple demean - Fix MultiFE early convergence bug in 3+ FE demeaning - Rename scatter/gather to apply_design_matrix for clarity - Avoid per-column copy for Fortran-ordered input arrays - Add type cast guard and #[inline(always)] on hot methods

schroedk · 2026-01-04T23:23:07Z

Awesome, I will start my review then! I will try to be fast, but also have a big PR from @leostimpfle to review thoroughly + all the Rust code will be a bit of a challenge for me 😄

On your questions, I'd say:

Yes, let's delete the benchmark code? I will push some changes to the fixest_benchmark repo that I have been working on going forward, we can use it to benchmark the routines?

This is super awesome! For the numba implementation - I guess I would keep it for now? I haven't fully decided what to do with the numba code once we make the switch and make the rust backend the default - option 1 is to allow users to choose a numba backend, option 2 would be to delete all numba code. Do you happen have an opinion on this?

I would probably delete it? Or is there any use in keeping it for testing / as a reference?

Yes, squash commit makes sense to me 👍

deleted
I will keep it for now, feols already uses the rust version. There is no other usage (beside tests) of the numba code, so one can rid of it any time and re-export the rust backed version for backwards compatibility. I cannot think of a reason to keep the numba backend in the middle/long run, so I would go with Option 2.
I deleted it and rewired the python bindings to the accelerated version (commit
6c091d7) -> no breaking change in the python API
I created a smaller more meaningful set of commits, so you could also step through the commits for the review. If you like, we can squash to one commit for merge.

Replace the simple alternating projections implementation with the accelerated Irons-Tuck algorithm as the sole Rust demean backend. Changes: - Remove src/demean.rs (old simple implementation) - Update demean.py to call _demean_accelerated_rs - Remove demean_accelerated.py (was only needed during development) - Update backends.py and demean_.py imports - Clean up tests to remove redundant fixtures The public Python API is unchanged - users calling demean() or using the "rust" backend get the accelerated implementation transparently. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reorder fixed effects by number of groups (largest first) to match fixest's default `fixef.reorder = TRUE` behavior. This improves convergence for 3+ FE cases by making the 2-FE sub-convergence phase work on the largest FEs first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add coef_sums_buffer to SingleFEDemeaner, TwoFEDemeaner, and MultiFEBuffers - Change apply_design_matrix_t to write to caller-provided buffer - Remove unnecessary in_out_2fe.to_vec() copy in MultiFEDemeaner - Rename in_out to coef_sums/coef_sums_buffer for clarity This eliminates per-column allocations: 1 for 2FE, 4 for 3+FE cases. Benchmarks show 4-12% improvement for medium-sized datasets (100K obs). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Unroll the accumulate_fe_contributions loop 4x to enable better instruction-level parallelism. This produces paired loads (ldp) and reduces loop overhead, providing ~7% speedup on large 3FE demeaning workloads. Also refactor compute_ssr to reuse the optimized accumulate method. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

s3alfisc · 2026-01-06T21:18:02Z

If I understand it correctly, the algorithm is set up to operate in "coef space" - this means that we have estimated fixed effects at the end of the demeaning routine. Would it be possible to return these fixed effects? In these fixed effects models, we are usually not interested in them, but we might need them to conduct predictions, which in turn is needed to compute marginal effects quickly. This is more or less how the R margins package does it, and Stata and R and Pythons marginal effects too. From the margins paper:

In pyfixest, we currently get the estimated fixed effects post hoc by solving a sparse system via LSMR (see here), which I think we could skip now?

… nice to have

schroedk · 2026-01-08T18:43:41Z

@s3alfisc latest benchmark
With numba:

Without numba:

s3alfisc · 2026-01-08T18:46:06Z

This looks awesome! Should we maybe update the benchmarks in the readme?

Fixed effects are now always sorted by number of groups (largest first), matching fixest's default behavior. This simplifies the API and ensures optimal convergence properties. Changes: - Remove `reorder_fe` field from `FixestConfig` - Remove `with_reorder` method from `FixedEffectsIndex` - Remove `with_config` method from `DemeanContext` - Simplify `FixedEffectsIndex::new()` to always reorder Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace `convergence_len() -> usize` with `convergence_range() -> Range<usize>` in the Projector trait. This makes the accelerator fully generic over any Projector implementation, not just FE-specific ones that check a prefix. The accelerator extracts (start, end) from the range to avoid cloning overhead. Following fixest's approach, FE projectors exclude the last FE (smallest after reordering) from convergence checking. At a fixed point, if (n_fe - 1) FEs have converged, the remaining one must also have converged. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add original_to_reordered mapping to FixedEffectsIndex for tracking how FEs are reordered internally (by size for optimal convergence) - Add fe_coefficients field to DemeanResult - Add reorder_coefficients_to_original() method to restore coefficients to the user's original FE order - Add total_coef buffer to MultiFEBuffers for accumulating coefficients across all demeaning phases (warmup, two_fe_convergence, reacceleration) - Update all demeaners to populate and return FE coefficients Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Test cases: - Single FE coefficient correctness - Two FE coefficient correctness - Three FE coefficient correctness (random order) - Coefficient ordering preservation (verifies coefficients match original FE order, not internal reordered order) - Weighted demeaning with coefficient extraction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Rust changes: - DemeanContext now has weights: Option<ObservationWeights> - When None, uses group_counts for denominators (no per-obs multiplication) - _demean_rs binding takes weights=None by default Python changes: - demean() wrapper detects uniform weights (all equal) via np.allclose - Passes None to Rust when weights are uniform, enabling fast path - Public API unchanged (weights parameter still required) This saves memory (no per-obs weight storage) and computation (no weight multiplication in scatter operations) for unweighted regression. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Refactor/demean accelerated

s3alfisc · 2026-01-11T12:22:31Z

Just a thought - is it currently possible for users to turn of the "reorder fixed effects" solution via the API?
I think it could be useful to allow this for cases where users know which fixed effects are most complicated, which might not necessarily be identical to the "largest" fixed effects.

Example:

We have three fixed effects: nba_player, district, city. NBA player is just an identifier of a player, while district and city tell us about where a game is played. Because of schedules, players play games in different cities and districts and hence there is little correlation. But of course there are much more players than cities / districts.

As cities are nested within districts, the fixed effects will be super correlated. In this case, it might make sense to work on district & city in the two way part of the three part strategy for 3+ FE even if the cardinality of the FE is lower?

schroedk · 2026-01-11T14:08:55Z

Just a thought - is it currently possible for users to turn of the "reorder fixed effects" solution via the API? I think it could be useful to allow this for cases where users know which fixed effects are most complicated, which might not necessarily be identical to the "largest" fixed effects.

Example:

We have three fixed effects: nba_player, district, city. NBA player is just an identifier of a player, while district and city tell us about where a game is played. Because of schedules, players play games in different cities and districts and hence there is little correlation. But of course there are much more players than cities / districts.

As cities are nested within districts, the fixed effects will be super correlated. In this case, it might make sense to work on district & city in the two way part of the three part strategy for 3+ FE even if the cardinality of the FE is lower?

It is not, but easy to change on the rust side. Actually it is just reverting 35ba573, as I had it before but decided to keep it more simple. But I did not consider your very valid point. The bigger problem is, how to add it to the python part?

s3alfisc · 2026-01-11T15:05:45Z

The bigger problem is, how to add it to the python part?

Fixest has a Boolean argument "reorder_coef" - I think we could just add it to the python interface and then pass it to the Rust code? For the non-accelerated algos, it would just not do anything. Maybe I am missing something?

Remove manual 4x loop unrolling from compute_ssr methods in TwoFEProjector and MultiFEProjector. LLVM auto-vectorizes simple loops effectively, making manual unrolling unnecessary complexity. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

schroedk · 2026-01-12T08:24:01Z

The bigger problem is, how to add it to the python part?

Fixest has a Boolean argument "reorder_coef" - I think we could just add it to the python interface and then pass it to the Rust code? For the non-accelerated algos, it would just not do anything. Maybe I am missing something?

@s3alfisc that would be one possibility, but then you have to also expose it to calling functions like feols. In this case the parameter is confusing, because it is only relevant for the rust backend.

What about this, I make it configurable on the rust side and we handle the python side in a separate issue? In this case, here we only have to talk about the default:

true: sorts every time, the average user would profit, but "power user" sorting the FE would not be able to
false: on average, the convergence might be slower due to the missing sorting (no idea how much), we would have to document, that manual sorting the hard FE to the front can increase convergence speed. But again, this only holds true for the rust backend.

Edit: I would vote for false (not sorting by default), it is more explicit (has a higher value for me than pure maximal performance)

Previously, fixed effects were always reordered by size (largest first) during demeaning. This adds a `reorder_fe` boolean parameter that allows users to control this behavior. Default is `false` (no reordering). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

s3alfisc mentioned this pull request Oct 12, 2025

pyfixest is MUCH slower than (Stata + Julia) reghdfejl #1042

Open

schroedk force-pushed the feature/demean-accelerated branch from dbf6e71 to 1e633c1 Compare December 12, 2025 11:51

schroedk force-pushed the feature/demean-accelerated branch from dba72f4 to a2e68a2 Compare January 3, 2026 14:49

schroedk force-pushed the feature/demean-accelerated branch from 9edc99e to 839fcaa Compare January 3, 2026 16:16

schroedk marked this pull request as ready for review January 4, 2026 12:20

schroedk added 7 commits January 4, 2026 23:35

schroedk force-pushed the feature/demean-accelerated branch from 65a4cd2 to c3ca143 Compare January 4, 2026 22:39

schroedk force-pushed the feature/demean-accelerated branch from 6c091d7 to e20668e Compare January 4, 2026 23:55

schroedk force-pushed the feature/demean-accelerated branch from e20668e to 81e2aa1 Compare January 4, 2026 23:56

schroedk force-pushed the feature/demean-accelerated branch from 32a52d1 to 420d7fc Compare January 5, 2026 10:51

schroedk and others added 3 commits January 5, 2026 12:36

Minor grammer and typo fixes

a39ab4b

s3alfisc added 4 commits January 6, 2026 23:15

documentation clarifications in types.rs

28eaf83

document ssc = 0 convergence reason

1610a70

Rename coef to omega in Irons-Tuck accelerate for clarity

06ef560

DemeanResult struct does not contain coefficients (though it would be…

de60290

… nice to have

schroedk and others added 7 commits January 9, 2026 15:37

Refactor Gauss-Seidel sweeper and cache FE slices

1e11f97

Merge pull request #1 from schroedk/refactor/demean_accelerated

f25a9ac

Refactor/demean accelerated

Conversation

schroedk commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

s3alfisc commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schroedk commented Dec 15, 2025

Uh oh!

s3alfisc commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 15, 2025

Uh oh!

s3alfisc commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 27, 2025

Uh oh!

schroedk commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Jan 3, 2026

Uh oh!

s3alfisc commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schroedk commented Jan 4, 2026

Uh oh!

s3alfisc commented Jan 4, 2026

Uh oh!

schroedk commented Jan 4, 2026

Uh oh!

s3alfisc commented Jan 6, 2026

Uh oh!

schroedk commented Jan 8, 2026

Uh oh!

s3alfisc commented Jan 8, 2026

Uh oh!

s3alfisc commented Jan 11, 2026

Uh oh!

schroedk commented Jan 11, 2026

Uh oh!

s3alfisc commented Jan 11, 2026

Uh oh!

schroedk commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schroedk commented Aug 21, 2025 •

edited

Loading

codecov bot commented Aug 21, 2025 •

edited

Loading

s3alfisc commented Oct 22, 2025 •

edited

Loading

s3alfisc commented Dec 2, 2025 •

edited

Loading

s3alfisc commented Dec 15, 2025 •

edited

Loading

s3alfisc commented Dec 15, 2025 •

edited

Loading

s3alfisc commented Dec 27, 2025 •

edited

Loading

schroedk commented Jan 3, 2026 •

edited

Loading

s3alfisc commented Jan 3, 2026 •

edited

Loading

schroedk commented Jan 12, 2026 •

edited

Loading