Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 7 files with indirect coverage changes 🚀 New features to boost your workflow:
|
|
@schroedk just rebased / merged changes from master in here (new features plus I moved to using a pixi toml, plus the dev env no longer installs on windows due to compatibility challenges). As before, you can build the Rust bindings by typing and run the benchmarks via |
dbf6e71 to
1e633c1
Compare
@s3alfisc do you have the code to run this benchmark? |
|
Yes, it's here: https://github.com/s3alfisc/fixest_benchmarks Hope I documented the setup well, but I think clone + just + task runner should get you started. It's the OLS benchmarks for the hard problem that are relevant. Note: @grantmcdermott mentioned the other day there might be a minor issue with the benchmarks (though I don't know what exactly) so best to take it with a grain of salt (though I couldn't spot it, everything looked ok to me). |
|
Wait it looks like I didn't push my local changes including the just setup. One sec |
|
It's in the justfile branch on the remote 😅 https://github.com/s3alfisc/fixest_benchmarks Requirements: global R and Julia installations. Just. Then type just setup To install all package deps as well as python (in local env). Then just bench-ols for benchmarks. |
|
Re failing tests atm - seems to be the prediction method, which is notoriously fickle and should be fixable by changing the tolerance or just checking on a different subset of the prediction array: x1 = array([23.03009222, 7.43960451, 14.04928582, 17.83110832, 17.46495242])
x2 = array([23.03009052, 7.43960323, 14.04928428, 17.83110934, 17.4649537 ]) |
dba72f4 to
a2e68a2
Compare
|
@s3alfisc Looks like it is on par now on the benchmark |
9edc99e to
839fcaa
Compare
So cool! Super awesome!!! |
|
Let me know once you think it is time for a review =) (I will start anyways as it will talk me some time to study the code 😄 ) |
|
@s3alfisc the PR is ready for review. My suggestions/questions would be:
|
|
Awesome, I will start my review then! I will try to be fast, but also have a big PR from @leostimpfle to review thoroughly + all the Rust code will be a bit of a challenge for me 😄 On your questions, I'd say:
|
Implement fixest's Irons-Tuck-Grand acceleration algorithm for high-dimensional fixed effects demeaning in Rust. This is a coefficient-space iterative method that provides significant speedups over naive alternating projections. Key features: - Irons-Tuck acceleration with grand acceleration steps - Support for 2-FE and 3+ FE cases with optimized projectors - Algorithm aligned with R fixest implementation - Auto-vectorized loops (no explicit SIMD dependencies) Reference: https://github.com/lrberge/fixest (CCC_demean.cpp)
Performance improvements to the accelerated demeaning implementation: - Optimize memory layout and share FEInfo across columns - Add SSR (sum of squared residuals) stopping criterion for 2-FE - Loop unrolling for 3-FE projection hot paths - Align tolerance default with fixest (1e-6 instead of 1e-8)
Restructure the Rust demeaning code for clarity and maintainability: - Introduce Projector trait for FE-specific projection strategies - Introduce Demeaner trait for high-level solver strategies - Unified DemeanBuffers struct for scratch space management - Replace unsafe pointer code with safe iterator-based implementations - Move related functions into appropriate impl blocks
Eliminate Python/numba overhead in the estimation pipeline: - Implement detect_singletons in Rust to avoid numba JIT compilation - Add Python wrapper maintaining API compatibility - Optimize factorize() using pd.factorize instead of category conversion - Replace slow df.isin() with np.isinf() for infinite value detection
Testing and code quality improvements: - Add edge case tests for demean_accelerated - Implement buffer reuse via for_each_init pattern - Extract MultiFEBuffers struct for better readability - Refactor Demeaner trait to own context and config references
Remove unnecessary abstractions after experimentation phase: - Remove Accelerator trait in favor of direct IronsTuckGrand impl - Move config into IronsTuckGrand struct - Consolidate ConvergenceState and related types - Update to PyO3 0.26 API (allow_threads -> detach)
Connect the new demean_accelerated module to Python and polish: - Wire rust backend to use demean_accelerated instead of simple demean - Fix MultiFE early convergence bug in 3+ FE demeaning - Rename scatter/gather to apply_design_matrix for clarity - Avoid per-column copy for Fortran-ordered input arrays - Add type cast guard and #[inline(always)] on hot methods
65a4cd2 to
c3ca143
Compare
|
6c091d7 to
e20668e
Compare
Replace the simple alternating projections implementation with the accelerated Irons-Tuck algorithm as the sole Rust demean backend. Changes: - Remove src/demean.rs (old simple implementation) - Update demean.py to call _demean_accelerated_rs - Remove demean_accelerated.py (was only needed during development) - Update backends.py and demean_.py imports - Clean up tests to remove redundant fixtures The public Python API is unchanged - users calling demean() or using the "rust" backend get the accelerated implementation transparently. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
e20668e to
81e2aa1
Compare
Reorder fixed effects by number of groups (largest first) to match fixest's default `fixef.reorder = TRUE` behavior. This improves convergence for 3+ FE cases by making the 2-FE sub-convergence phase work on the largest FEs first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
32a52d1 to
420d7fc
Compare
- Add coef_sums_buffer to SingleFEDemeaner, TwoFEDemeaner, and MultiFEBuffers - Change apply_design_matrix_t to write to caller-provided buffer - Remove unnecessary in_out_2fe.to_vec() copy in MultiFEDemeaner - Rename in_out to coef_sums/coef_sums_buffer for clarity This eliminates per-column allocations: 1 for 2FE, 4 for 3+FE cases. Benchmarks show 4-12% improvement for medium-sized datasets (100K obs). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Unroll the accumulate_fe_contributions loop 4x to enable better instruction-level parallelism. This produces paired loads (ldp) and reduces loop overhead, providing ~7% speedup on large 3FE demeaning workloads. Also refactor compute_ssr to reuse the optimized accumulate method. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
If I understand it correctly, the algorithm is set up to operate in "coef space" - this means that we have estimated fixed effects at the end of the demeaning routine. Would it be possible to return these fixed effects? In these fixed effects models, we are usually not interested in them, but we might need them to conduct predictions, which in turn is needed to compute marginal effects quickly. This is more or less how the R In pyfixest, we currently get the estimated fixed effects post hoc by solving a sparse system via LSMR (see here), which I think we could skip now? |
|
@s3alfisc latest benchmark |
|
This looks awesome! Should we maybe update the benchmarks in the readme? |
Fixed effects are now always sorted by number of groups (largest first), matching fixest's default behavior. This simplifies the API and ensures optimal convergence properties. Changes: - Remove `reorder_fe` field from `FixestConfig` - Remove `with_reorder` method from `FixedEffectsIndex` - Remove `with_config` method from `DemeanContext` - Simplify `FixedEffectsIndex::new()` to always reorder Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace `convergence_len() -> usize` with `convergence_range() -> Range<usize>` in the Projector trait. This makes the accelerator fully generic over any Projector implementation, not just FE-specific ones that check a prefix. The accelerator extracts (start, end) from the range to avoid cloning overhead. Following fixest's approach, FE projectors exclude the last FE (smallest after reordering) from convergence checking. At a fixed point, if (n_fe - 1) FEs have converged, the remaining one must also have converged. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add original_to_reordered mapping to FixedEffectsIndex for tracking how FEs are reordered internally (by size for optimal convergence) - Add fe_coefficients field to DemeanResult - Add reorder_coefficients_to_original() method to restore coefficients to the user's original FE order - Add total_coef buffer to MultiFEBuffers for accumulating coefficients across all demeaning phases (warmup, two_fe_convergence, reacceleration) - Update all demeaners to populate and return FE coefficients Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test cases: - Single FE coefficient correctness - Two FE coefficient correctness - Three FE coefficient correctness (random order) - Coefficient ordering preservation (verifies coefficients match original FE order, not internal reordered order) - Weighted demeaning with coefficient extraction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rust changes: - DemeanContext now has weights: Option<ObservationWeights> - When None, uses group_counts for denominators (no per-obs multiplication) - _demean_rs binding takes weights=None by default Python changes: - demean() wrapper detects uniform weights (all equal) via np.allclose - Passes None to Rust when weights are uniform, enabling fast path - Public API unchanged (weights parameter still required) This saves memory (no per-obs weight storage) and computation (no weight multiplication in scatter operations) for unweighted regression. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Refactor/demean accelerated
|
Just a thought - is it currently possible for users to turn of the "reorder fixed effects" solution via the API? Example: We have three fixed effects: nba_player, district, city. NBA player is just an identifier of a player, while district and city tell us about where a game is played. Because of schedules, players play games in different cities and districts and hence there is little correlation. But of course there are much more players than cities / districts. As cities are nested within districts, the fixed effects will be super correlated. In this case, it might make sense to work on district & city in the two way part of the three part strategy for 3+ FE even if the cardinality of the FE is lower? |
It is not, but easy to change on the rust side. Actually it is just reverting 35ba573, as I had it before but decided to keep it more simple. But I did not consider your very valid point. The bigger problem is, how to add it to the python part? |
Fixest has a Boolean argument "reorder_coef" - I think we could just add it to the python interface and then pass it to the Rust code? For the non-accelerated algos, it would just not do anything. Maybe I am missing something? |
Remove manual 4x loop unrolling from compute_ssr methods in TwoFEProjector and MultiFEProjector. LLVM auto-vectorizes simple loops effectively, making manual unrolling unnecessary complexity. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@s3alfisc that would be one possibility, but then you have to also expose it to calling functions like feols. In this case the parameter is confusing, because it is only relevant for the rust backend. What about this, I make it configurable on the rust side and we handle the python side in a separate issue? In this case, here we only have to talk about the default:
Edit: I would vote for false (not sorting by default), it is more explicit (has a higher value for me than pure maximal performance) |
Previously, fixed effects were always reordered by size (largest first) during demeaning. This adds a `reorder_fe` boolean parameter that allows users to control this behavior. Default is `false` (no reordering). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>





This PR implements a port of the Iron-Tucks acceleration implementation from the fixest package.
It includes:
Closes #357