Skip to content

Feature/demean accelerated#995

Open
schroedk wants to merge 25 commits intopy-econometrics:masterfrom
schroedk:feature/demean-accelerated
Open

Feature/demean accelerated#995
schroedk wants to merge 25 commits intopy-econometrics:masterfrom
schroedk:feature/demean-accelerated

Conversation

@schroedk
Copy link
Copy Markdown
Collaborator

@schroedk schroedk commented Aug 21, 2025

This PR implements a port of the Iron-Tucks acceleration implementation from the fixest package.

It includes:

  • new package demean_accelerated
  • new rust implementation of detect_singletons
  • speed optimization for feols

Closes #357

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
core-tests 75.11% <100.00%> (+0.02%) ⬆️
tests-extended ?
tests-vs-r 17.59% <31.03%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pyfixest/core/__init__.py 100.00% <100.00%> (ø)
pyfixest/core/demean.py 100.00% <100.00%> (ø)
pyfixest/core/detect_singletons.py 100.00% <100.00%> (ø)
pyfixest/estimation/__init__.py 100.00% <100.00%> (ø)
pyfixest/estimation/backends.py 72.41% <100.00%> (ø)
pyfixest/estimation/demean_.py 55.64% <100.00%> (ø)
pyfixest/estimation/feols_.py 86.77% <ø> (-4.61%) ⬇️
pyfixest/estimation/model_matrix_fixest_.py 90.95% <100.00%> (-0.55%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Oct 22, 2025

@schroedk just rebased / merged changes from master in here (new features plus I moved to using a pixi toml, plus the dev env no longer installs on windows due to compatibility challenges).

As before, you can build the Rust bindings by typing

pixi r -e dev maturin-develop

and run the benchmarks via

pixi run -e dev pytest  tests/test_demean.py::test_demean_complex_fixed_effects

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Dec 2, 2025

Benchmarks of the accelerated vs regular rust vs fixest and FixedEffectsModels.jl. Looks like good progress to me!

image

@schroedk schroedk force-pushed the feature/demean-accelerated branch from dbf6e71 to 1e633c1 Compare December 12, 2025 11:51
@schroedk
Copy link
Copy Markdown
Collaborator Author

Benchmarks of the accelerated vs regular rust vs fixest and FixedEffectsModels.jl. Looks like good progress to me!

image

@s3alfisc do you have the code to run this benchmark?

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Dec 15, 2025

Yes, it's here: https://github.com/s3alfisc/fixest_benchmarks

Hope I documented the setup well, but I think clone + just + task runner should get you started.

It's the OLS benchmarks for the hard problem that are relevant.

Note: @grantmcdermott mentioned the other day there might be a minor issue with the benchmarks (though I don't know what exactly) so best to take it with a grain of salt (though I couldn't spot it, everything looked ok to me).

@s3alfisc
Copy link
Copy Markdown
Member

Wait it looks like I didn't push my local changes including the just setup. One sec

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Dec 15, 2025

It's in the justfile branch on the remote 😅

https://github.com/s3alfisc/fixest_benchmarks

Requirements: global R and Julia installations. Just.

Then type

just setup

To install all package deps as well as python (in local env).

Then just bench-ols for benchmarks.

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Dec 27, 2025

Is this confirmed? 👀 😄 🚀

image

@s3alfisc
Copy link
Copy Markdown
Member

Re failing tests atm - seems to be the prediction method, which is notoriously fickle and should be fixable by changing the tolerance or just checking on a different subset of the prediction array:

 x1 = array([23.03009222,  7.43960451, 14.04928582, 17.83110832, 17.46495242])
x2 = array([23.03009052,  7.43960323, 14.04928428, 17.83110934, 17.4649537 ])

@schroedk schroedk force-pushed the feature/demean-accelerated branch from dba72f4 to a2e68a2 Compare January 3, 2026 14:49
@schroedk
Copy link
Copy Markdown
Collaborator Author

schroedk commented Jan 3, 2026

@s3alfisc Looks like it is on par now on the benchmark
plot_ols

@schroedk schroedk force-pushed the feature/demean-accelerated branch from 9edc99e to 839fcaa Compare January 3, 2026 16:16
@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Jan 3, 2026

@s3alfisc Looks like it is on par now on the benchmark

So cool! Super awesome!!!

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Jan 3, 2026

Let me know once you think it is time for a review =) (I will start anyways as it will talk me some time to study the code 😄 )

@schroedk schroedk marked this pull request as ready for review January 4, 2026 12:20
@schroedk
Copy link
Copy Markdown
Collaborator Author

schroedk commented Jan 4, 2026

@s3alfisc the PR is ready for review. My suggestions/questions would be:

  1. Remove the added benchmark code (no diff to .gitignore, benchmarks)
  2. I added a rust version of detect_singletons and wired it into feols for speed -> what to do with the numba implementation?
  3. Rust backend now resolves to the accelerated version -> what to do with the original demean.rs?
  4. Too many commits, after review I would squash merge

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Jan 4, 2026

Awesome, I will start my review then! I will try to be fast, but also have a big PR from @leostimpfle to review thoroughly + all the Rust code will be a bit of a challenge for me 😄

On your questions, I'd say:

  1. Yes, let's delete the benchmark code? I will push some changes to the fixest_benchmark repo that I have been working on going forward, we can use it to benchmark the routines?
  2. This is super awesome! For the numba implementation - I guess I would keep it for now? I haven't fully decided what to do with the numba code once we make the switch and make the rust backend the default - option 1 is to allow users to choose a numba backend, option 2 would be to delete all numba code. Do you happen have an opinion on this?
  3. I would probably delete it? Or is there any use in keeping it for testing / as a reference?
  4. Yes, squash commit makes sense to me 👍

Implement fixest's Irons-Tuck-Grand acceleration algorithm for high-dimensional
fixed effects demeaning in Rust. This is a coefficient-space iterative method
that provides significant speedups over naive alternating projections.

Key features:
- Irons-Tuck acceleration with grand acceleration steps
- Support for 2-FE and 3+ FE cases with optimized projectors
- Algorithm aligned with R fixest implementation
- Auto-vectorized loops (no explicit SIMD dependencies)

Reference: https://github.com/lrberge/fixest (CCC_demean.cpp)
Performance improvements to the accelerated demeaning implementation:
- Optimize memory layout and share FEInfo across columns
- Add SSR (sum of squared residuals) stopping criterion for 2-FE
- Loop unrolling for 3-FE projection hot paths
- Align tolerance default with fixest (1e-6 instead of 1e-8)
Restructure the Rust demeaning code for clarity and maintainability:
- Introduce Projector trait for FE-specific projection strategies
- Introduce Demeaner trait for high-level solver strategies
- Unified DemeanBuffers struct for scratch space management
- Replace unsafe pointer code with safe iterator-based implementations
- Move related functions into appropriate impl blocks
Eliminate Python/numba overhead in the estimation pipeline:
- Implement detect_singletons in Rust to avoid numba JIT compilation
- Add Python wrapper maintaining API compatibility
- Optimize factorize() using pd.factorize instead of category conversion
- Replace slow df.isin() with np.isinf() for infinite value detection
Testing and code quality improvements:
- Add edge case tests for demean_accelerated
- Implement buffer reuse via for_each_init pattern
- Extract MultiFEBuffers struct for better readability
- Refactor Demeaner trait to own context and config references
Remove unnecessary abstractions after experimentation phase:
- Remove Accelerator trait in favor of direct IronsTuckGrand impl
- Move config into IronsTuckGrand struct
- Consolidate ConvergenceState and related types
- Update to PyO3 0.26 API (allow_threads -> detach)
Connect the new demean_accelerated module to Python and polish:
- Wire rust backend to use demean_accelerated instead of simple demean
- Fix MultiFE early convergence bug in 3+ FE demeaning
- Rename scatter/gather to apply_design_matrix for clarity
- Avoid per-column copy for Fortran-ordered input arrays
- Add type cast guard and #[inline(always)] on hot methods
@schroedk schroedk force-pushed the feature/demean-accelerated branch from 65a4cd2 to c3ca143 Compare January 4, 2026 22:39
@schroedk
Copy link
Copy Markdown
Collaborator Author

schroedk commented Jan 4, 2026

Awesome, I will start my review then! I will try to be fast, but also have a big PR from @leostimpfle to review thoroughly + all the Rust code will be a bit of a challenge for me 😄

On your questions, I'd say:

  1. Yes, let's delete the benchmark code? I will push some changes to the fixest_benchmark repo that I have been working on going forward, we can use it to benchmark the routines?
  2. This is super awesome! For the numba implementation - I guess I would keep it for now? I haven't fully decided what to do with the numba code once we make the switch and make the rust backend the default - option 1 is to allow users to choose a numba backend, option 2 would be to delete all numba code. Do you happen have an opinion on this?
  3. I would probably delete it? Or is there any use in keeping it for testing / as a reference?
  4. Yes, squash commit makes sense to me 👍
  1. deleted
  2. I will keep it for now, feols already uses the rust version. There is no other usage (beside tests) of the numba code, so one can rid of it any time and re-export the rust backed version for backwards compatibility. I cannot think of a reason to keep the numba backend in the middle/long run, so I would go with Option 2.
  3. I deleted it and rewired the python bindings to the accelerated version (commit
    6c091d7) -> no breaking change in the python API
  4. I created a smaller more meaningful set of commits, so you could also step through the commits for the review. If you like, we can squash to one commit for merge.

@schroedk schroedk force-pushed the feature/demean-accelerated branch from 6c091d7 to e20668e Compare January 4, 2026 23:55
Replace the simple alternating projections implementation with the
accelerated Irons-Tuck algorithm as the sole Rust demean backend.

Changes:
- Remove src/demean.rs (old simple implementation)
- Update demean.py to call _demean_accelerated_rs
- Remove demean_accelerated.py (was only needed during development)
- Update backends.py and demean_.py imports
- Clean up tests to remove redundant fixtures

The public Python API is unchanged - users calling demean() or using
the "rust" backend get the accelerated implementation transparently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@schroedk schroedk force-pushed the feature/demean-accelerated branch from e20668e to 81e2aa1 Compare January 4, 2026 23:56
Reorder fixed effects by number of groups (largest first) to match
fixest's default `fixef.reorder = TRUE` behavior. This improves
convergence for 3+ FE cases by making the 2-FE sub-convergence
phase work on the largest FEs first.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@schroedk schroedk force-pushed the feature/demean-accelerated branch from 32a52d1 to 420d7fc Compare January 5, 2026 10:51
schroedk and others added 3 commits January 5, 2026 12:36
- Add coef_sums_buffer to SingleFEDemeaner, TwoFEDemeaner, and MultiFEBuffers
- Change apply_design_matrix_t to write to caller-provided buffer
- Remove unnecessary in_out_2fe.to_vec() copy in MultiFEDemeaner
- Rename in_out to coef_sums/coef_sums_buffer for clarity

This eliminates per-column allocations: 1 for 2FE, 4 for 3+FE cases.
Benchmarks show 4-12% improvement for medium-sized datasets (100K obs).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Unroll the accumulate_fe_contributions loop 4x to enable better
instruction-level parallelism. This produces paired loads (ldp)
and reduces loop overhead, providing ~7% speedup on large 3FE
demeaning workloads.

Also refactor compute_ssr to reuse the optimized accumulate method.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Jan 6, 2026

If I understand it correctly, the algorithm is set up to operate in "coef space" - this means that we have estimated fixed effects at the end of the demeaning routine. Would it be possible to return these fixed effects? In these fixed effects models, we are usually not interested in them, but we might need them to conduct predictions, which in turn is needed to compute marginal effects quickly. This is more or less how the R margins package does it, and Stata and R and Pythons marginal effects too. From the margins paper:
image
image

In pyfixest, we currently get the estimated fixed effects post hoc by solving a sparse system via LSMR (see here), which I think we could skip now?

@schroedk
Copy link
Copy Markdown
Collaborator Author

schroedk commented Jan 8, 2026

@s3alfisc latest benchmark
With numba:
plot_ols

Without numba:
plot_ols

@s3alfisc
Copy link
Copy Markdown
Member

s3alfisc commented Jan 8, 2026

This looks awesome! Should we maybe update the benchmarks in the readme?

schroedk and others added 7 commits January 9, 2026 15:37
Fixed effects are now always sorted by number of groups (largest first),
matching fixest's default behavior. This simplifies the API and ensures
optimal convergence properties.

Changes:
- Remove `reorder_fe` field from `FixestConfig`
- Remove `with_reorder` method from `FixedEffectsIndex`
- Remove `with_config` method from `DemeanContext`
- Simplify `FixedEffectsIndex::new()` to always reorder

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace `convergence_len() -> usize` with `convergence_range() -> Range<usize>`
in the Projector trait. This makes the accelerator fully generic over any
Projector implementation, not just FE-specific ones that check a prefix.

The accelerator extracts (start, end) from the range to avoid cloning overhead.

Following fixest's approach, FE projectors exclude the last FE (smallest after
reordering) from convergence checking. At a fixed point, if (n_fe - 1) FEs
have converged, the remaining one must also have converged.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add original_to_reordered mapping to FixedEffectsIndex for tracking
  how FEs are reordered internally (by size for optimal convergence)
- Add fe_coefficients field to DemeanResult
- Add reorder_coefficients_to_original() method to restore coefficients
  to the user's original FE order
- Add total_coef buffer to MultiFEBuffers for accumulating coefficients
  across all demeaning phases (warmup, two_fe_convergence, reacceleration)
- Update all demeaners to populate and return FE coefficients

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test cases:
- Single FE coefficient correctness
- Two FE coefficient correctness
- Three FE coefficient correctness (random order)
- Coefficient ordering preservation (verifies coefficients match
  original FE order, not internal reordered order)
- Weighted demeaning with coefficient extraction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rust changes:
- DemeanContext now has weights: Option<ObservationWeights>
- When None, uses group_counts for denominators (no per-obs multiplication)
- _demean_rs binding takes weights=None by default

Python changes:
- demean() wrapper detects uniform weights (all equal) via np.allclose
- Passes None to Rust when weights are uniform, enabling fast path
- Public API unchanged (weights parameter still required)

This saves memory (no per-obs weight storage) and computation
(no weight multiplication in scatter operations) for unweighted regression.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@s3alfisc
Copy link
Copy Markdown
Member

Just a thought - is it currently possible for users to turn of the "reorder fixed effects" solution via the API?
I think it could be useful to allow this for cases where users know which fixed effects are most complicated, which might not necessarily be identical to the "largest" fixed effects.

Example:

We have three fixed effects: nba_player, district, city. NBA player is just an identifier of a player, while district and city tell us about where a game is played. Because of schedules, players play games in different cities and districts and hence there is little correlation. But of course there are much more players than cities / districts.

As cities are nested within districts, the fixed effects will be super correlated. In this case, it might make sense to work on district & city in the two way part of the three part strategy for 3+ FE even if the cardinality of the FE is lower?

@schroedk
Copy link
Copy Markdown
Collaborator Author

Just a thought - is it currently possible for users to turn of the "reorder fixed effects" solution via the API? I think it could be useful to allow this for cases where users know which fixed effects are most complicated, which might not necessarily be identical to the "largest" fixed effects.

Example:

We have three fixed effects: nba_player, district, city. NBA player is just an identifier of a player, while district and city tell us about where a game is played. Because of schedules, players play games in different cities and districts and hence there is little correlation. But of course there are much more players than cities / districts.

As cities are nested within districts, the fixed effects will be super correlated. In this case, it might make sense to work on district & city in the two way part of the three part strategy for 3+ FE even if the cardinality of the FE is lower?

It is not, but easy to change on the rust side. Actually it is just reverting 35ba573, as I had it before but decided to keep it more simple. But I did not consider your very valid point. The bigger problem is, how to add it to the python part?

@s3alfisc
Copy link
Copy Markdown
Member

The bigger problem is, how to add it to the python part?

Fixest has a Boolean argument "reorder_coef" - I think we could just add it to the python interface and then pass it to the Rust code? For the non-accelerated algos, it would just not do anything. Maybe I am missing something?

Remove manual 4x loop unrolling from compute_ssr methods in
TwoFEProjector and MultiFEProjector. LLVM auto-vectorizes simple
loops effectively, making manual unrolling unnecessary complexity.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@schroedk
Copy link
Copy Markdown
Collaborator Author

schroedk commented Jan 12, 2026

The bigger problem is, how to add it to the python part?

Fixest has a Boolean argument "reorder_coef" - I think we could just add it to the python interface and then pass it to the Rust code? For the non-accelerated algos, it would just not do anything. Maybe I am missing something?

@s3alfisc that would be one possibility, but then you have to also expose it to calling functions like feols. In this case the parameter is confusing, because it is only relevant for the rust backend.

What about this, I make it configurable on the rust side and we handle the python side in a separate issue? In this case, here we only have to talk about the default:

  • true: sorts every time, the average user would profit, but "power user" sorting the FE would not be able to
  • false: on average, the convergence might be slower due to the missing sorting (no idea how much), we would have to document, that manual sorting the hard FE to the front can increase convergence speed. But again, this only holds true for the rust backend.

Edit: I would vote for false (not sorting by default), it is more explicit (has a higher value for me than pure maximal performance)

Previously, fixed effects were always reordered by size (largest first)
during demeaning. This adds a `reorder_fe` boolean parameter that allows
users to control this behavior. Default is `false` (no reordering).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Demean: Implement IT Acceleration for demeaning algo

2 participants