feat: implement LSMR demeaning in torch for CUDA MPS support#1220
feat: implement LSMR demeaning in torch for CUDA MPS support#1220janfb wants to merge 16 commits intopy-econometrics:masterfrom
Conversation
Results on new benchmarking suiteAfter #1211 being merged, I integrated the torch backend options into this framework and ran all the tasks and algos on my mac (excluding all CUDA related algos).
Summary (by Claude Code)
|
LSMR Unification + better CUDA Results with
|
| Device | Default path | Why |
|---|---|---|
| CUDA | compiled (torch.compile + precomputed D^T) |
Fuses ~60 per-iteration scalar kernels into 1; avoids sparse transpose reconversion |
| CPU | Given rotations on CPU (Python math) |
No kernel launch overhead to optimize away |
| MPS | Given rotations on CPU (Python math) |
Metal command buffer batching already amortizes launches (see below) |
Callers can override with use_compile=True/False. All existing imports (from pyfixest.estimation.torch.lsmr_torch import lsmr_torch) continue to work — tests pass unchanged.
CUDA benchmark results
Claude's summary:
Benchmarked on DGX (NVIDIA A100), OLS with fixed effects, comparing old (uncompiled) vs optimized (compiled + precomputed D^T):
- Small N (<100K): No measurable difference — few LSMR iterations, kernel launch overhead is negligible
- 500K–1M: Optimized version pulls ahead, ~1.3–1.5x
- 2M–5M: Consistent 1.5–1.8x speedup across both f32 and f64
- Difficult 3FE (most LSMR iterations) shows the largest gains — the per-iteration savings compound over hundreds of iterations
The two optimizations are complementary:
torch.compilefuses the scalar Givens rotation + norm estimation + convergence check into a single GPU kernel, eliminating ~60 tiny kernel launches per iteration- Precomputed D^T materializes the sparse transpose once upfront instead of reconverting it every LSMR iteration (COO coalesce on MPS, CSR radixSort on CUDA)
Why torch.compile doesn't help on MPS
On CUDA, each scalar operation (e.g., a Givens rotation on two floats) launches a separate kernel. At ~60 scalar ops per LSMR iteration, the CPU→GPU dispatch overhead dominates. torch.compile fixes this by fusing them into one kernel.
MPS (Metal) doesn't have this problem. Metal uses a command buffer model — scalar operations are batched into a command buffer on the CPU side and submitted to the GPU in bulk. The dispatch overhead is already amortized without compilation. What torch.compile adds on MPS is Python-side tracing and graph capture overhead, which is pure cost with no kernel-side benefit. Our A/B benchmarks on Apple Silicon confirmed this: the compiled path was slower than the scalar path at every dataset size on MPS.
This is why the dispatcher defaults to use_compile=False on MPS — it's not a missing optimization, it's the correct choice for the hardware.
More tweaking?Of course there are even more options for optimizing this on Moving on@s3alfisc , I suggest to get your review on the LSMR part of this code and the tests. The changes to any benchmarking files are less relevant given that the benchmarking will be refactored anyway. I can push the current timings as csv for later reference, shall I? |
- Added `lsmr_torch_fused.py` for a fused version of the LSMR algorithm, utilizing branchless Givens rotations and 0-d tensors to reduce CPU-GPU sync overhead. - Introduced tests for the new fused LSMR implementation in `test_lsmr_fused.py`, ensuring correctness against the original LSMR and benchmarking performance. - Created `test_lsmr_compiled.py` to validate the compiled version of the original LSMR, including auto-detection and MPS compatibility tests.
- also enhance GPU efficiency with pre-computed transpose
for more information, see https://pre-commit.ci
…oduce fused version - Deleted `lsmr_torch_compiled.py` and `lsmr_torch_fused.py` files, consolidating functionality into `lsmr_torch.py`. - Updated tests to reflect changes in the LSMR implementation, ensuring correctness and performance benchmarks. - Adjusted convergence checks and state management to optimize CPU-GPU synchronization. - Enhanced the branchless Givens rotation implementation for improved efficiency on CUDA/MPS.
f009def to
824c315
Compare













This PR implements LSMR in torch and adds a PyTorch-based fixed-effect demeaning backend to
pyfixest.Users can access it through the existing
demeaner_backendargument, for example:The main motivation is enabling GPU support on consumer Apple laptops via MPS, which is relevant for a large part of the user base. The same backend family also supports CUDA and CPU, and "batched" LSMR for systems with more features.
User-facing API
This PR extends the existing
demeaner_backendoption with:torchtorch_cputorch_mpstorch_cudatorch_cuda32No new top-level estimation API is introduced. Internally, we additionally choose between
compiledandbatchedversions of LSMR.Main caveat
MPS only supports float32 for this implementation, not the current
pyfixestdefaultfloat64. So the torch MPS backend is effectively a float32 path.Core implementation choices
Sparse matrix layout is device-specific (due to MPS limitations):
MPSusesCOO,CPUandCUDAuse CSR.Torch LSMR is integrated into the existing demeaning pipeline:
the backend builds the sparse FE dummy matrix directly from encoded fixed effects, applies optional weighting and diagonal preconditioning, and solves the FWL system with LSMR.
Dispatch is device-specific:
CPUdefaults to the eager (not compiled) single-RHS (no batching) pathCUDAdefaults to the compiled torch path, then switches to batched LSMR for K>=2MPSdefaults to the eager path (compiled is slower on MPS), then switches to batched LSMR for K>=5Batched LSMR is included for multi-column demeaning:
lsmr_torch_batched()is used internally when solving several RHS columns jointly on devices where batched sparse matmul is beneficial.Scalar-step handling changed during development: the early version relied heavily on
.item()/ CPU scalar math for Givens rotations. The current implementation is more nuanced:mathwhere that helps_lsmr_compiled_core.pyBenchmarking