diff --git a/README.md b/README.md index 00f0053..de086a5 100644 --- a/README.md +++ b/README.md @@ -1,72 +1,170 @@ # within -High-performance fixed-effects solver for econometric panel data. Solves -**y = D x** where D is a sparse categorical design matrix, using -iterative methods (preconditioned CG and right-preconditioned GMRES) with -domain decomposition (Schwarz) preconditioners backed by approximate Cholesky -local solvers. +`within` provides high-performance solvers for projecting out high-dimensional fixed effects from regression problems. -Designed for millions of observations with multiple high-dimensional fixed -effects. +By the Frisch-Waugh-Lovell theorem, estimating a regression of the form *y = Xβ + Dα + ε* reduces to a sequence of least-squares projections, one for y and one for each column of X, followed by a cheap regression fit on the resulting residuals. The projection step of solving the normal equations *D'Dx = D'z* is the computational bottleneck, which is the problem `within` is designed to solve. -## Python quickstart +`within`'s solvers are tailored to the structure of fixed effects problems, which can be represented as a graph (as first noted by Correia, 2016). Concretely, `within` uses iterative methods (preconditioned CG, right-preconditioned GMRES) with domain decomposition (Schwarz) preconditioners, backed by approximate Cholesky local solvers (Gao et al, 2025). + +## Installation Requires Python >= 3.9. -```bash -pip install within -``` +## Python Quickstart + +`within`'s main user-facing function is `solve`. Provide a 2-D `uint32` array of category codes (one column per fixed-effect factor) and a response vector `y`. The solver finds x in the normal equations **D'D x = D'y**, where D is the sparse categorical design matrix. ```python -from within import solve, CG, GMRES, MultiplicativeSchwarz +from within import solve, solve_batch import numpy as np -# Two factors with 500 levels each, 100k observations -categories = [ - np.random.randint(0, 500, size=100_000), - np.random.randint(0, 500, size=100_000), -] -y = np.random.randn(100_000) - -# Solve with Schwarz-preconditioned CG (default) -result = solve(categories, y) -print(f"converged={result.converged} iters={result.iterations} residual={result.residual:.2e}") - -# Solve with unpreconditioned CG -result = solve(categories, y, CG(preconditioner=None)) -print(f"converged={result.converged} iters={result.iterations}") - -# Solve with GMRES + multiplicative Schwarz -result = solve(categories, y, GMRES(preconditioner=MultiplicativeSchwarz())) -print(f"converged={result.converged} iters={result.iterations}") - -# Weighted solve -weights = np.random.exponential(1.0, size=100_000) -result = solve(categories, y, weights=weights) -print(f"converged={result.converged} iters={result.iterations}") +np.random.seed(1) +n = 100_000 +fe = np.asfortranarray(np.column_stack([ + np.random.randint(0, 500, n).astype(np.uint32), + np.random.randint(0, 200, n).astype(np.uint32), +])) +y = np.random.randn(n) + +result = solve(fe, y) # Schwarz-preconditioned CG +result = solve(fe, y, weights=np.ones(n)) # weighted solve +``` + +### FWL regression example + +```python +beta_true = np.array([1.0, -2.0, 0.5]) +X = np.random.randn(n, 3) +y = X @ beta_true + np.random.randn(n) + +result = solve_batch(fe, np.column_stack([y, X])) +y_tilde, X_tilde = result.demeaned[:, 0], result.demeaned[:, 1:] +beta_hat = np.linalg.lstsq(X_tilde, y_tilde, rcond=None)[0] +print(np.round(beta_hat, 4)) # [ 0.9982 -2.006 0.5005] +``` + +## Python API + +### High-level functions + +| Function | Description | +|---|---| +| `solve(categories, y, config?, weights?, preconditioner?)` | Solve a single right-hand side. Returns `SolveResult`. | +| `solve_batch(categories, Y, config?, weights?, preconditioner?)` | Solve multiple RHS vectors in parallel. `Y` has shape `(n_obs, k)`. Returns `BatchSolveResult`. | + +`categories` is a 2-D `uint32` array of shape `(n_obs, n_factors)`. A `UserWarning` is emitted when a C-contiguous array is passed — use `np.asfortranarray(categories)` for best performance. + +### Persistent solver + +For repeated solves with the same design matrix, `Solver` builds the preconditioner once and reuses it. + +```python +from within import Solver + +solver = Solver(fe) +r = solver.solve(y) # reuses preconditioner +r = solver.solve_batch(np.column_stack([y, X])) + +precond = solver.preconditioner() # picklable +solver2 = Solver(fe, preconditioner=precond) # skip re-factorization ``` -### Solver methods +| Property / Method | Description | +|---|---| +| `Solver(categories, config?, weights?, preconditioner?)` | Build solver. Factorizes the preconditioner at construction. | +| `.solve(y)` | Solve a single RHS. Returns `SolveResult`. | +| `.solve_batch(Y)` | Solve multiple RHS columns in parallel. Returns `BatchSolveResult`. | +| `.preconditioner()` | Return the built `FePreconditioner` (picklable), or `None`. | + + +### Solver configuration | Class | Description | |---|---| -| `CG(tol, maxiter, preconditioner, operator)` | Conjugate gradient on the normal equations. Default is additive Schwarz with the implicit operator. | -| `GMRES(tol, maxiter, restart, preconditioner, operator)` | Right-preconditioned GMRES on the normal equations. | +| `CG(tol=1e-8, maxiter=1000, operator=Implicit)` | Conjugate gradient on the normal equations. Default solver. | +| `GMRES(tol=1e-8, maxiter=1000, restart=30, operator=Implicit)` | Right-preconditioned GMRES. | ### Preconditioners | Class | Description | |---|---| -| `AdditiveSchwarz(...)` | Additive one-level Schwarz. Use with `CG` or `GMRES`. | -| `MultiplicativeSchwarz(...)` | Multiplicative one-level Schwarz. Use with `GMRES` only. | +| `AdditiveSchwarz(local_solver?)` | Additive one-level Schwarz. Works with CG and GMRES. | +| `MultiplicativeSchwarz(local_solver?)` | Multiplicative one-level Schwarz. GMRES only. | +| `Preconditioner.Off` | Disable preconditioning. | -## Rust quickstart +Pass `None` (the default) to use additive Schwarz with the default local solver. -```bash -cargo add within +### Local solver configuration (advanced) + +| Class | Description | +|---|---| +| `SchurComplement(approx_chol?, approx_schur?, dense_threshold=24)` | Schur complement reduction with approximate Cholesky on the reduced system. Default local solver. | +| `FullSddm(approx_chol?)` | Full bipartite SDDM factorized via approximate Cholesky. | +| `ApproxCholConfig(seed=0, split=1)` | Approximate Cholesky parameters. | +| `ApproxSchurConfig(seed=0, split=1)` | Approximate Schur complement sampling parameters. | + +### Result types + +**`SolveResult`**: `x` (coefficients), `demeaned` (residuals), `converged`, `iterations`, `residual`, `time_total`, `time_setup`, `time_solve`. + +**`BatchSolveResult`**: Same fields, with `converged`, `iterations`, `residual`, and `time_solve` as lists (one entry per RHS). + +## Rust API + +```rust +use ndarray::Array2; +use within::{solve, SolverParams, KrylovMethod, Preconditioner, LocalSolverConfig}; + +let categories = /* Array2 of shape (n_obs, n_factors) */; +let y: &[f64] = /* response vector */; + +// Default: CG + additive Schwarz +let r = solve(categories.view(), &y, None, &SolverParams::default(), None)?; +assert!(r.converged); + +// GMRES + multiplicative Schwarz +let params = SolverParams { + krylov: KrylovMethod::Gmres { restart: 30 }, + ..SolverParams::default() +}; +let precond = Preconditioner::Multiplicative(LocalSolverConfig::default()); +let r = solve(categories.view(), &y, None, ¶ms, Some(&precond))?; ``` -See [`crates/within/README.md`](crates/within/README.md) for Rust API usage and architecture details. +Persistent solver — build once, solve many: + +```rust +use within::Solver; + +let solver = Solver::new(categories.view(), None, &SolverParams::default(), None)?; +let r1 = solver.solve(&y)?; +let r2 = solver.solve(&another_y)?; // reuses preconditioner +``` + +| Type | Variants / Fields | +|---|---| +| `SolverParams` | `krylov: KrylovMethod`, `operator: OperatorRepr`, `tol: f64`, `maxiter: usize` | +| `KrylovMethod` | `Cg` (default), `Gmres { restart }` | +| `OperatorRepr` | `Implicit` (default, matrix-free D'WD), `Explicit` (CSR Gramian) | +| `Preconditioner` | `Additive(LocalSolverConfig)`, `Multiplicative(LocalSolverConfig)` | +| `LocalSolverConfig` | `SchurComplement { approx_chol, approx_schur, dense_threshold }`, `FullSddm { approx_chol }` | + +### Lower-level types + +The crate exposes its internals for advanced use: + +| Module | Key types | +|---|---| +| `observation` | `FactorMajorStore`, `ArrayStore`, `ObservationStore` trait | +| `domain` | `WeightedDesign`, `FixedEffectsDesign`, `Subdomain` | +| `operator` | `Gramian` (CSR), `GramianOperator` (implicit), `DesignOperator`, `build_schwarz`, `FeSchwarz` | +| `solver` | `Solver` — generic persistent solver | + +### Feature flags + +| Feature | Default | Effect | +|---|---|---| +| `ndarray` | yes | Enables `from_array` constructors for `ndarray::ArrayView2` interop. | ## Project structure @@ -76,7 +174,7 @@ crates/ within/ Core fixed-effects solver (observation stores, domains, operators, orchestration) within-py/ PyO3 bridge (cdylib → within._within) python/within/ Python package re-exporting the Rust extension -benchmarks/ Python benchmark framework (18 suites) +benchmarks/ Python benchmark framework ``` ## Development @@ -99,18 +197,7 @@ MIT ## References -- Correia, S. (2017). Linear Models with High-Dimensional Fixed Effects: An - Efficient and Feasible Estimator. - https://hdl.handle.net/10.2139/ssrn.3129010 - -- Gaure, S. (2013). OLS with Multiple High Dimensional Category Variables. - *Computational Statistics & Data Analysis*, 66, 2--17. - -- Gao, Y., Kyng, R. & Spielman, D. A. (2025). AC(k): Robust Solution of - Laplacian Equations by Randomized Approximate Cholesky Factorization. - *SIAM Journal on Scientific Computing*. - -- Xu, J. (1992). Iterative Methods by Space Decomposition and Subspace - Correction. *SIAM Review*, 34(4), 581--613. - +- Correia, Sergio. "A feasible estimator for linear models with multi-way fixed effects." *Preprint* at http://scorreia.com/research/hdfe.pdf (2016). +- Gao, Y., Kyng, R. & Spielman, D. A. (2025). AC(k): Robust Solution of Laplacian Equations by Randomized Approximate Cholesky Factorization. *SIAM Journal on Scientific Computing*. - Toselli & Widlund (2005). *Domain Decomposition Methods — Algorithms and Theory*. Springer. +- Xu, J. (1992). Iterative Methods by Space Decomposition and Subspace Correction. *SIAM Review*, 34(4), 581--613. diff --git a/crates/within/README.md b/crates/within/README.md index a8bb868..6fb13a7 100644 --- a/crates/within/README.md +++ b/crates/within/README.md @@ -16,66 +16,60 @@ cargo add within ## Quick example ```rust -use within::{ - solve, GmresPrecond, LocalSolverConfig, OperatorRepr, SolverMethod, SolverParams, -}; +use ndarray::Array2; +use within::{solve, SolverParams, KrylovMethod, Preconditioner, LocalSolverConfig}; // Two factors: 100 levels each, 10 000 observations -let factor_0: Vec = (0..10_000).map(|i| (i % 100) as u32).collect(); -let factor_1: Vec = (0..10_000).map(|i| (i / 100) as u32).collect(); -let y: Vec = (0..10_000).map(|i| i as f64 * 0.01).collect(); +let n_obs = 10_000usize; +let mut categories = Array2::::zeros((n_obs, 2)); +for i in 0..n_obs { + categories[[i, 0]] = (i % 100) as u32; + categories[[i, 1]] = (i / 100) as u32; +} +let y: Vec = (0..n_obs).map(|i| i as f64 * 0.01).collect(); // Solve with the default solver: CG + additive Schwarz + implicit operator -let result = solve( - &[factor_0.clone(), factor_1.clone()], - &[100, 100], - &y, - None, - &SolverParams::default(), -).expect("solve should succeed"); +let result = solve(categories.view(), &y, None, &SolverParams::default(), None) + .expect("solve should succeed"); assert!(result.converged); -println!("default CG converged in {} iterations", result.iterations); +println!("CG converged in {} iterations", result.iterations); -// Solve with GMRES + multiplicative Schwarz on the implicit operator +// GMRES with multiplicative Schwarz let params = SolverParams { - method: SolverMethod::Gmres { - preconditioner: Some(GmresPrecond::Multiplicative(LocalSolverConfig::gmres_default())), - operator: OperatorRepr::Implicit, - restart: 30, - }, - tol: 1e-8, - maxiter: 1000, + krylov: KrylovMethod::Gmres { restart: 30 }, + ..SolverParams::default() }; -let result = solve( - &[factor_0, factor_1], - &[100, 100], - &y, - None, - ¶ms, -).expect("solve should succeed"); +let precond = Preconditioner::Multiplicative(LocalSolverConfig::default()); +let result = solve(categories.view(), &y, None, ¶ms, Some(&precond)) + .expect("solve should succeed"); assert!(result.converged); println!("GMRES converged in {} iterations", result.iterations); ``` +## Feature flags + +| Feature | Default | Effect | +|-----------|---------|------------------------------------------------------------------------| +| `ndarray` | yes | Enables `from_array` constructors on observation stores for interop with `ndarray::ArrayView2`. | + ## Architecture The crate is organized in four layers: -1. **`observation`** -- Per-observation factor levels and weights via +1. **`observation`** — Per-observation factor levels and weights via `FactorMajorStore` and the `ObservationStore` trait. -2. **`domain`** -- Domain decomposition. `WeightedDesign` wraps a store with +2. **`domain`** — Domain decomposition. `WeightedDesign` wraps a store with factor metadata; `build_local_domains` constructs factor-pair subdomains with partition-of-unity weights for the Schwarz preconditioner. -3. **`operator`** -- Linear algebra primitives. `Gramian` (explicit CSR) and +3. **`operator`** — Linear algebra primitives. `Gramian` (explicit CSR) and `GramianOperator` (implicit D^T W D) for the normal equations; Schwarz preconditioner builders that wire approximate Cholesky local solvers into the generic `schwarz-precond` framework. -4. **`orchestrate`** -- End-to-end solve entry points (`solve`, - `solve_normal_equations`) with - typed configuration (`SolverParams`, `SolverMethod`, `GmresPrecond`, +4. **`orchestrate`** — End-to-end solve entry points (`solve`, `solve_batch`) + with typed configuration (`SolverParams`, `KrylovMethod`, `Preconditioner`, `OperatorRepr`). ## License @@ -84,15 +78,7 @@ MIT ## References -- Correia, S. (2017). Linear Models with High-Dimensional Fixed Effects: An - Efficient and Feasible Estimator. *Working paper*. - https://hdl.handle.net/10.2139/ssrn.3129010 - -- Gaure, S. (2013). OLS with Multiple High Dimensional Category Variables. - *Computational Statistics & Data Analysis*, 66, 2--17. - https://doi.org/10.1016/j.csda.2013.03.024 - -- Spielman, D. A. & Teng, S.-H. (2014). Nearly Linear Time Algorithms for - Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems. - *SIAM Journal on Matrix Analysis and Applications*, 35(3), 835--885. - https://doi.org/10.1137/090771430 +- Correia, Sergio. "A feasible estimator for linear models with multi-way fixed effects." *Preprint* at http://scorreia.com/research/hdfe.pdf (2016). +- Gao, Y., Kyng, R. & Spielman, D. A. (2025). AC(k): Robust Solution of Laplacian Equations by Randomized Approximate Cholesky Factorization. *SIAM Journal on Scientific Computing*. +- Toselli & Widlund (2005). *Domain Decomposition Methods — Algorithms and Theory*. Springer. +- Xu, J. (1992). Iterative Methods by Space Decomposition and Subspace Correction. *SIAM Review*, 34(4), 581--613. diff --git a/docs/1_fixed_effects_and_block_methods.md b/docs/1_fixed_effects_and_block_methods.md new file mode 100644 index 0000000..6c31f0c --- /dev/null +++ b/docs/1_fixed_effects_and_block_methods.md @@ -0,0 +1,286 @@ +# Part 1: Fixed Effects and Block Iterative Methods + +This is Part 1 of the algorithm documentation for the `within` solver. It introduces the fixed-effects estimation problem, derives the normal equations and their block structure, describes the classical demeaning algorithm and its limitations, and motivates the domain decomposition approach in [Part 2](2_solver_architecture.md). + +**Series overview**: +- **Part 1: Fixed Effects and Block Iterative Methods** (this document) +- [Part 2: Preconditioned Krylov Solvers and Schwarz Decomposition](2_solver_architecture.md) +- [Part 3: Local Solvers and Approximate Cholesky](3_local_solvers.md) + +--- + +## Notation + +| Symbol | Meaning | +|--------|---------| +| $n$ | Number of observations | +| $Q$ | Number of factors (fixed-effect grouping variables) | +| $m_q$ | Number of levels (categories) in factor $q$ | +| $m = \sum_q m_q$ | Total degrees of freedom / total number of factor levels over all fixed effects| +| $D$ | Design matrix ($n \times m$), with exactly $Q$ ones per row | +| $W$ | Diagonal weight matrix ($n \times n$), $W = \text{diag}(w_1, \dots, w_n)$ | +| $G$ | Gramian matrix, $G = D^\top W D$ ($m \times m$) | +| $y$ | Response vector ($n \times 1$) | +| $\alpha$ | Fixed-effects coefficient vector ($m \times 1$) | +| $D_q$ | Diagonal block of $G$ for factor $q$ (weighted level counts) | +| $C_{qr}$ | Cross-tabulation block of $G$ for factor pair $(q, r)$ | + +--- + +## 1. The Fixed-Effects Model + +A panel dataset records $n$ observations, each classified by $Q$ categorical factors that econometricians commonly refer to as "fixed effects". Factor $q$ has $m_q$ distinct levels. The linear fixed-effects model is: + +$$ +y_i = \sum_{q=1}^{Q} \alpha_{q, f_q(i)} + \varepsilon_i, \qquad i = 1, \dots, n +$$ + +where $f_q(i) \in \{1, \dots, m_q\}$ is the level of factor $q$ for observation $i$ and $\alpha_{q,j}$ is the coefficient for level $j$ of factor $q$. + +In matrix form, we could writ this as + +$$y = D\alpha + \varepsilon +$$, + +where $D$ is the sparse $n \times m$ design matrix of 0s and 1s. Each row of $D$ has exactly $Q$ ones, one in each factor's column block. + +--- + +## 2. Weighted Normal Equations + +To fit $\alpha$, we minimize the weighted least-squares objective + +$$\hat{\alpha} = \arg\min_\alpha \sum_{i=1}^{n} w_i \left( y_i - +\sum_{q=1}^{Q} \alpha_{q,f_q(i)} \right)^2 +$$ + +where $W = \text{diag}(w_1, \dots, w_n)$ is a diagonal weight matrix (the +unweighted case corresponds to $W = I$). Setting the gradient to zero yields +the weighted normal equations: + +$$ +G\alpha = D^\top W y, \qquad G = D^\top W D +$$ + +The Gramian $G$ is symmetric positive semi-definite and always singular: within each connected component of the factor interaction graph, a constant can be shifted between factors without changing $D\alpha$. The system is always consistent, and the solver (starting from zero) converges to the minimum-norm solution. + +--- + +## 3. Block Structure of the Gramian + +### 3.1 Factor-block partition + +The $m \times m$ Gramian inherits a natural block structure from the factor partition. With columns ordered by factor ($m_1$ columns for factor 1, then $m_2$ for factor 2, etc.): + +$$ +G = \begin{pmatrix} +{\color{royalblue}D_1} & C_{12} & C_{13} & \cdots & C_{1Q} \\ +C_{12}^\top & {\color{crimson}D_2} & C_{23} & \cdots & C_{2Q} \\ +C_{13}^\top & C_{23}^\top & {\color{forestgreen}D_3} & \cdots & C_{3Q} \\ +\vdots & \vdots & \vdots & \ddots & \vdots \\ +C_{1Q}^\top & C_{2Q}^\top & C_{3Q}^\top & \cdots & {\color{goldenrod}D_Q} +\end{pmatrix} +$$ + +The blocks are: + +- **Diagonal blocks** $D_q$ ($m_q \times m_q$, diagonal): $D_q[j,j] = \sum_{i:\, f_q(i) = j} w_i$ - the weighted count of observations at level $j$ of factor $q$. + +- **Cross-tabulation blocks** $C_{qr}$ ($m_q \times m_r$, sparse): $C_{qr}[j,k] = \sum_{i:\, f_q(i) = j,\; f_r(i) = k} w_i$ - the weighted count of observations simultaneously at level $j$ of factor $q$ and level $k$ of factor $r$. + +There are $Q$ diagonal blocks and $\binom{Q}{2}$ cross-tabulation blocks. + +### 3.2 A concrete example + +Consider a worker-firm data set (often referred to as AKM-style panel) with $n = 6$ observations and $Q = 3$ factors (worker id's, firm id's, and years). Worker W1 moves from Firm F1 to F2. This mobility is what connects the two firms in the estimation graph. Workers 2 (F1) and 3 (F2) are only observed in a single firm. + +| Obs | Worker ($f_1$) | Firm ($f_2$) | Year ($f_3$) | Weight | $y$ | +|-----|---------|------|------|--------|------| +| 1 | W1 | F1 | Y1 | 1 | 3.2 | +| 2 | W1 | F2 | Y2 | 1 | 4.1 | +| 3 | W2 | F1 | Y1 | 1 | 2.8 | +| 4 | W2 | F1 | Y2 | 1 | 3.9 | +| 5 | W3 | F2 | Y1 | 1 | 5.0 | +| 6 | W3 | F2 | Y2 | 1 | 4.5 | + +Factor 1 (the worker fixed effect) has $m_1 = 3$ levels: {W1, W2, W3}. Factor 2 (the firm fixed effect) has $m_2 = 2$ levels: {F1, F2}. Last, Factor 3 (Year) has $m_3 = 2$ levels: {Y1, Y2}. In total, we have $m = 7$ fixed effects levels over all factors. + +The Gramian has $Q = 3$ diagonal blocks and $\binom{3}{2} = 3$ cross-tabulation blocks: + +$$ +G = \begin{pmatrix} +{\color{royalblue}D_W} & {\color{gray}C_{WF}} & {\color{gray}C_{WY}} \\ +{\color{gray}C_{WF}^\top} & {\color{crimson}D_F} & {\color{gray}C_{FY}} \\ +{\color{gray}C_{WY}^\top} & {\color{gray}C_{FY}^\top} & {\color{forestgreen}D_Y} +\end{pmatrix} += \left(\begin{array}{ccc|cc|cc} +{\color{royalblue}2} & {\color{royalblue}0} & {\color{royalblue}0} & {\color{gray}1} & {\color{gray}1} & {\color{gray}1} & {\color{gray}1} \\ +{\color{royalblue}0} & {\color{royalblue}2} & {\color{royalblue}0} & {\color{gray}2} & {\color{gray}0} & {\color{gray}1} & {\color{gray}1} \\ +{\color{royalblue}0} & {\color{royalblue}0} & {\color{royalblue}2} & {\color{gray}0} & {\color{gray}2} & {\color{gray}1} & {\color{gray}1} \\ +\hline +{\color{gray}1} & {\color{gray}2} & {\color{gray}0} & {\color{crimson}3} & {\color{crimson}0} & {\color{gray}2} & {\color{gray}1} \\ +{\color{gray}1} & {\color{gray}0} & {\color{gray}2} & {\color{crimson}0} & {\color{crimson}3} & {\color{gray}1} & {\color{gray}2} \\ +\hline +{\color{gray}1} & {\color{gray}1} & {\color{gray}1} & {\color{gray}2} & {\color{gray}1} & {\color{forestgreen}3} & {\color{forestgreen}0} \\ +{\color{gray}1} & {\color{gray}1} & {\color{gray}1} & {\color{gray}1} & {\color{gray}2} & {\color{forestgreen}0} & {\color{forestgreen}3} +\end{array}\right) +$$ + +$D_W$ is $3 \times 3$ (one row/column per worker) with 2s on the diagonal because each worker appears in exactly 2 observations (e.g. W1 in obs 1, 2). Off-diagonals are zero because no observation belongs to two workers. $D_F$ is $2 \times 2$ with 3s on the diagonal because each firm appears in 3 observations (F1 in obs 1, 3, 4; F2 in obs 2, 5, 6). The cross-tabulation block $C_{WY}$ is $3 \times 2$ (3 workers $\times$ 2 years); entry $[j,k]$ counts observations where worker $j$ is observed in year $k$. Here every worker appears once per year, so $C_{WY}$ is all ones. + +The Gramian's sparsity pattern defines an interaction graph on all $m = 7$ factor levels. Each node is a factor level, and each nonzero entry $C_{qr}[j,k]$ in a cross-tabulation block becomes an edge between level $j$ of factor $q$ and level $k$ of factor $r$. For example, $C_{WF}[\text{W1},\text{F1}] = 1$ (W1 is observed once at F1) creates the W1–F1 edge, while $C_{WF}[\text{W2},\text{F1}] = 2$ (W2 is observed twice at F1) creates the W2–F1 edge with weight 2: + +![Gramian interaction graph](images/graph_plain.svg) + +No edges connect degrees of freedom (factor levels) within the same factor. As edges only cross factor boundaries, we say that the graph is $Q$-partite. Each factor-pair subgraph is bipartite. The diagonal entries reflect the same structure: $D_Y[\text{Y1},\text{Y1}] = 3$ because Y1 is connected to 3 other nodes, one for each observation in year Y1 (obs 1, 3, 5). The graph is connected because W1's mobility between F1 and F2 bridges the two firms; without it, the graph would split into disconnected components. + +### 3.3 Key properties of the Fixed Effects Problem + +We note two properties of the Gramian $G$ that drive the algorithmic design of solvers for the fixed effects problem: + +> 1. **The diagonal blocks $D_q$ are diagonal matrices**, which are trivially invertible. This makes the classical demeaning algorithm (Section 4) cheap per iteration. +> +> 2. **The cross-tabulation blocks $C_{qr}$ are typically sparse** - an entry is nonzero only when at least one observation has that specific (level-of-$q$, level-of-$r$) combination. + +--- + +## 4. The Classical Solution: Iterative Demeaning + +### 4.1 The Demeaning Algorithm / MAP Algorithm + +The classical approach to solving the normal equations for fixed effects regression goes back to Guimarães & Portugal (2010) and Gaure, 2013. The algorithm sweeps through factors one at a time, updating each factor's coefficients while holding the others fixed. Guimaraes and Portugal called this the "Zig-Zag", but the method is commonly know as the "method of alternating projections" or MAP algorithm. + +More concretely, for factor $q$, each coefficient is set to the weighted average of the factor-$q$ partial residual at that level (for observation $i$, this is $y_i$ minus all fitted effects except factor $q$): + +$$ +\alpha_{q,j} \leftarrow \frac{\sum_{i:\, f_q(i)=j} w_i\, (y_i - \sum_{r \neq q} \alpha_{r,f_r(i)})}{\sum_{i:\, f_q(i)=j} w_i} +$$ + +This is **demeaning by factor $q$**: for each level $j$, compute the weighted average of $y_i - \sum_{r \neq q} \alpha_{r,f_r(i)}$ across all observations $i$ with $f_q(i) = j$, and set $\alpha_{q,j}$ to that average. + +One full sweep updates all $Q$ factors in order - the MAP algorithm is identical to block Gauss-Seidel on the normal equations. Note that each factor update only involves the diagonal block $D_q$; the cross-tabulation blocks $C_{qr}$ are never used. + +### 4.2 Solving a Worker-Firm-Year Regression via the MAP Algorithm + +Continuing the Worker/Firm/Year example, one full sweep processes $Q = 3$ factors in order: + +**Step 1** In step 1, we update the worker coefficients, holding coefficients for firm and year fixed. Each worker's coefficient becomes the weighted average of $(y_i - \text{firm effect} - \text{year effect})$ for observations at that worker. + +**Step 2** We then update the coefficients for firms, using the **updated** worker values from step 1, but **stale** year values. Each firm's coefficient becomes the weighted average of $(y_i - \text{worker effect} - \text{year effect})$, but the year effects are still from the previous iteration. + +**Step 3** Finally, we update the year effects, using updated worker and updated firm values. + +This process is repeated until convergence. + +Only the last factor in a given sweep - years - sees fully updated values from all other factors. Workers were updated with stale firm and year effects; firms were updated with stale year effects. When factors are correlated (e.g. high-skill workers sort into high-paying firms), the correct worker effect depends on the firm effect and vice versa: updating workers with a stale firm estimate means the worker coefficients won't land in the right place, and the subsequent firm update has to compensate - partially undoing the worker correction. With $Q = 3$, two out of three updates use out-of-date information. As $Q$ grows, the problem worsens: more of each sweep is spent on corrections that the remaining updates will undo, and the method needs more sweeps to converge (see Section 4.4). + +### 4.3 Convergence rate + +How many more sweeps? For two factors, the error contracts by $\cos^2(\theta_F)$ per sweep, where $\theta_F$ is the Friedrichs angle between the factor subspaces: + +![Convergence: high vs low mobility](images/convergence_zigzag.svg) + +The convergence rate is determined by the structure of the cross-tabulation matrix $C_{WF}$. Consider two labor markets, both with 6 workers and 2 firms: + +**High mobility labour market** - Every worker is observed at both firms: + +$$ +C_{WF} = \begin{pmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{pmatrix} +$$ + +Every row has weight at both firms. If W1 has a high $y_i$ at F1 *and* a high $y_i$ at F2, that's evidence for a high worker effect of W1; if only the F1 observations are high, that's evidence for a high firm effect, and it is relatively straightforward to disentangle the two. The factor subspaces are nearly orthogonal, $\cos(\theta_F)$ is small, and convergence is fast. + +**Low mobility labour market** - There is only one mover (W1), the rest are stayers: + +$$ +C_{WF} = \begin{pmatrix} 1 & 1 \\ 2 & 0 \\ 2 & 0 \\ 0 & 2 \\ 0 & 2 \\ 0 & 2 \end{pmatrix} +$$ + +The matrix is nearly block-diagonal. W2 and W3 are only observed at F1; W4–W6 only at F2. For these stayers, every observation of the worker is also an observation of their firm - the effects are confounded. The only information separating the two firms comes through W1, the single mover. The subspaces are nearly collinear, $\cos(\theta_F) \approx 1$, and the algorithm makes little progress per sweep - the information separating workers from firms lives in $C_{WF}$, which the algorithm never touches. In the limit where there are no movers at all, $C_{WF}$ becomes exactly block-diagonal, the system splits into disconnected components, and $\cos(\theta_F) = 1$. + +For more than two fixed effects ($Q > 2$), any near-collinear pair bottlenecks the entire iteration. + +### 4.4 Limitations of the MAP Algorithm + +Based on the exposition above, there are three structural limitiations of the MAP algorithm: + +1. **The local solve can't be improved** - each factor update already solves $D_q$ exactly (it's diagonal), so there's nothing to gain within a single factor. The only way to make progress is to do more sweeps. + +2. **The cross-factor structure is ignored** - the algorithm only ever touches the diagonal blocks $D_q$ of the Gramian, which makes each sweep cheap but means it's entirely oblivious to the cross-tabulation blocks $C_{qr}$ that couple the factors. + +3. **Degradation with more factors** - for more than two fixed effects with $Q > 2$, each factor's update uses stale values from factors not yet processed in the current sweep. The effective convergence rate worsens as the number of interacting pairs grows. + +All three motivate the algorithm used in `within`. + +--- + +## 5. The Domain Decomposition Perspective + +### 5.1 Demeaning as a domain decomposition method + +As we saw in Section 4, the MAP algorithm sweeps through one factor at a time, solving the diagonal block $D_q$ at each step. In domain decomposition terminology, this is a **multiplicative Schwarz** method with $Q$ non-overlapping subdomains, one per factor: + +![Factor-level decomposition](images/graph_factor_level.svg) + +This is essentially the same interaction graph from Section 3.2, now grouped into factor-level subdomains. Every edge crosses a subdomain boundary - the local operators $D_q$ are trivially diagonal, with no internal edges. The local solves are trivial but they capture **none** of the cross-factor coupling. + +### 5.2 The key idea: factor-pair subdomains + +`within` uses a fundamentally different decomposition: subdomains are **factor pairs**, not individual factors. + +For each pair $(q, r)$, the subdomain contains all levels from both factors. The local operator is the full bipartite block: + +$$ +A_{qr} = \begin{pmatrix} D_q & C_{qr} \\ C_{qr}^\top & D_r \end{pmatrix} +$$ + +This captures the complete interaction between the two factors — the diagonal counts **and** the cross-tabulation. In practical applications, these bipartite systems are often too large to solve exactly, but their structure — a bipartite graph that can be transformed into a graph Laplacian — admits nearly-linear-time approximate solvers (Schur complement reduction + approximate Cholesky, see [Part 3](3_local_solvers.md)). + +The figure below illustrates the difference between what MAP algorithm and `within` algo "see". The MAP algo is only aware of the diagonal blocks $D_q$, which are cheap to invert, but ignores all cross-factor coupling - the off-diagonal blocks $C_{qr}$. `within`'s domain decomposition solver instead incorporates the cross-factor structure into each local solve, at the cost of solving a larger, coupled system per subdomain. This is the central tradeoff: via the MAP algorithm, we get cheaper iterations that ignore part of the structure of the problem vs. more expensive iterations that exploit structure it via `within` algorithm. + +![Factor-level vs factor-pair local solve](images/local_solve_comparison.svg) + +| Property | Factor-level (demeaning) | Factor-pair (`within`) | +|---|---|---| +| Local solve | **Exact** (diagonal inversion) | **Approximate** (sampled Cholesky) | +| Coupling captured | None - ignores $C_{qr}$ entirely | Full pairwise interaction | +| Number of subdomains | $Q$ | $\leq \binom{Q}{2} \times$ (components) | +| Overlap | None | Yes - each factor level appears in $Q - 1$ pairs | + +Demeaning solves each factor *exactly* but learns nothing about cross-factor structure. Factor-pair subdomains solve each pair *approximately* but capture the coupling that makes convergence slow in the first place. As a result, the approximate pair-solves carry much more information per iteration, and fewer outer iterations are needed until convergence. + +### 5.3 Continuing the Worker-Firm Example + +The Worker/Firm/Year example ($Q = 3$) produces $\binom{3}{2} = 3$ factor-pair subdomains: + +| Subdomain | Factor pair | Factor Levels | Size | +|-----------|-------------|------|------| +| 1 | (Worker, Firm) | W1, W2, W3, F1, F2 | 5 | +| 2 | (Worker, Year) | W1, W2, W3, Y1, Y2 | 5 | +| 3 | (Firm, Year) | F1, F2, Y1, Y2 | 4 | + +Each factor level appears in $Q - 1 = 2$ subdomains, requiring partition-of-unity weights to prevent double-counting (see [Part 2, Section 4.2](2_solver_architecture.md#42-partition-of-unity)). + +To illustrate the process, we draw the same interaction graph from the beginning of this section, but now with three overlapping factor-pair subdomains drawn around it: + +![Factor-pair decomposition](images/graph_factor_pair.svg) + +The Worker–Firm subdomain (red) covers the top two rows. The Firm–Year subdomain (blue) covers the bottom two. The Worker–Year subdomain (green) wraps around the Firms in a C-shape. Every edge is now *inside* a subdomain. Each node sits in exactly 2 of the 3 boxes, which is why partition-of-unity weights are needed. + +### 5.4 Where this leads + +The factor-pair decomposition raises three algorithmic questions: + +1. **How can we solve the local systems efficiently?** Here the answer is to use an approximate Cholesky factorization with Schur complement reduction ([Part 3](3_local_solvers.md)) +2. **How can we combine the local corrections?** Via additive or multiplicative Schwarz with partition-of-unity weights ([Part 2](2_solver_architecture.md)) +3. **How can we drive the global iteration?** → Preconditioned CG or GMRES ([Part 2](2_solver_architecture.md)) + +--- + +## References + +**Correia, S.** (2016). *A Feasible Estimator for Linear Models with Multi-Way Fixed Effects*. Working paper. Describes the fixed-effects normal equations, their block structure, and iterative solution via alternating projections. + +**Xu, J.** (1992). *Iterative Methods by Space Decomposition and Subspace Correction*. SIAM Review, 34(4), 581–613. Provides the abstract space decomposition framework for additive and multiplicative Schwarz methods. + +**Toselli, A. & Widlund, O. B.** (2005). *Domain Decomposition Methods - Algorithms and Theory*. Springer. Comprehensive reference for the theory and convergence analysis of Schwarz methods. diff --git a/docs/2_solver_architecture.md b/docs/2_solver_architecture.md new file mode 100644 index 0000000..619ebea --- /dev/null +++ b/docs/2_solver_architecture.md @@ -0,0 +1,194 @@ +# Part 2: Preconditioned Krylov Solvers and Schwarz Decomposition + +This is Part 2 of the algorithm documentation for the `within` solver. It describes the three-layer solver architecture, the graph structure that drives the decomposition, the Krylov outer iteration, and the Schwarz preconditioner framework. + +**Series overview**: +- [Part 1: Fixed Effects and Block Iterative Methods](1_fixed_effects_and_block_methods.md) +- **Part 2: Preconditioned Krylov Solvers and Schwarz Decomposition** (this document) +- [Part 3: Local Solvers and Approximate Cholesky](3_local_solvers.md) + +**Prerequisites**: Part 1 (problem formulation, Gramian block structure, demeaning as multiplicative Schwarz). + +--- + +## 1. Three-Layer Architecture + +The solver combines three algorithmic ideas in a layered architecture: + +![Three-layer solver architecture](images/three_layer_architecture.svg) + +1. A **Krylov solver** - either conjugate gradient (CG) or restarted GMRES - iterates toward the solution, using the preconditioner to accelerate convergence. + +2. A **Schwarz preconditioner** decomposes the global system into overlapping subdomains derived from the Gramian's block structure, applies local solves independently (additive) or sequentially (multiplicative), and combines corrections using partition-of-unity weights. + +3. **Local solvers**. Each subdomain system is a bipartite Gramian block that becomes a graph Laplacian after a sign-flip, factored in nearly-linear time using approximate Cholesky (see [Part 3](3_local_solvers.md)). + +### Why this combination + +As discussed in [Part 1, Section 5.2](1_fixed_effects_and_block_methods.md#52-the-key-idea-factor-pair-subdomains), the central trade-off for solving the fixed effects problem is between **exact solves on single factors** (demeaning via the MAP algorithm) vs. **approximate solves on factor pairs** (`within`). The three layers exist to make this trade-off pay off: + +- **Block-bipartite structure**: each factor pair $(q,r)$ induces a bipartite subgraph. Connected components of these subgraphs form natural subdomains with limited overlap. + +- **Laplacian connection**: after a sign-flip transformation (Section 2), each bipartite block becomes a graph Laplacian. This unlocks nearly-linear-time *approximate* solvers - exact factorization would be too expensive for large subdomains, but the approximate Cholesky factorization ([Part 3](3_local_solvers.md)) produces factors that are accurate enough to make the Krylov solver converge in very few iterations. + +- **Spectral acceleration**: the Krylov outer solver compensates for the approximate nature of the local solves. Even though each preconditioner application is inexact, the Krylov iteration refines the solution globally. The preconditioner clusters the eigenvalues of $M^{-1}G$, reducing the iteration count from $O(\sqrt{\kappa})$ (unpreconditioned demeaning) to a count determined by the quality of the local solves rather than the global condition number. + +--- + +## 2. Graph Structure of the Gramian + +Part 1 derived the block structure of $G = D^\top W D$, with diagonal blocks $D_q$ and cross-tabulation blocks $C_{qr}$. It is convenient to write $G = \mathcal{D} + \mathcal{C}$, where $\mathcal{D} = \operatorname{block-diag}(D_1, \ldots, D_Q)$ collects the diagonal blocks and $\mathcal{C}$ collects the off-diagonal cross-tabulation blocks. This section describes the graph-theoretic properties that drive the domain decomposition. + +### 2.1 Factor-pair bipartite blocks + +Each cross-tabulation block $C_{qr}$ defines a weighted bipartite graph: the left vertices are the $m_q$ levels of factor $q$, the right vertices are the $m_r$ levels of factor $r$, and the edge weight between $j$ and $k$ is $C_{qr}[j,k]$ (nonzero when at least one observation has $f_q = j$ and $f_r = k$). + +The full factor-pair block is: + +$$ +G_{qr} = \begin{pmatrix} D_q & C_{qr} \\ C_{qr}^\top & D_r \end{pmatrix} +$$ + +### 2.2 Connected components as subdomains + +The bipartite graph of $C_{qr}$ may have multiple connected components. Each connected component defines an independent subproblem and becomes a subdomain of the Schwarz preconditioner. + +| Full interaction graph | Worker–Firm subgraph | +|:---:|:---:| +| ![Full graph](images/graph_plain.svg) | ![Worker–Firm bipartite subgraph](images/graph_bipartite_wf.svg) | + +Continuing the Worker/Firm/Year example from [Part 1](1_fixed_effects_and_block_methods.md): extracting just the Worker–Firm edges (right) gives the bipartite graph of $C_{WF}$. Because W1 worked at both firms, the graph is connected — there is a path between any two nodes — so all 5 DOFs / factor levels belong to a single subdomain. Without W1's mobility, the graph would split into two components: {W1, W2, F1} and {W3, F2}, yielding two independent subdomains. + +### 2.3 Laplacian connection via sign-flip and Laplacian Solver + +The bipartite block $G_{qr}$ has non-negative off-diagonal entries, so it is not directly a graph Laplacian. Negating the off-diagonal blocks produces one: + +$$ +L_{qr} = \begin{pmatrix} D_q & -C_{qr} \\ -C_{qr}^\top & D_r \end{pmatrix} +$$ + +This is a valid graph Laplacian: $L_{qr}$ is symmetric, has non-positive off-diagonal entries, and zero row sums. The zero row-sum property holds because every observation at level $j$ of factor $q$ has exactly one level in factor $r$, so $D_q[j,j] = \sum_k C_{qr}[j,k]$. + +Equivalently, $G_{qr} = S^\top L_{qr} S$ where $S = \mathrm{diag}(I, -I)$. Since $S$ is orthogonal ($S^{-1} = S^\top$), solving $G_{qr} x = b$ - the subproblem we started out with - via the Laplacian is straightforward: negate one block of $b$ by multiplying with $S$, solve $L_{qr} z = \tilde{b}$, then negate the same block of $z$ by multiplying with $S$ again. This Laplacian structure is exploited by the local solvers described in [Part 3](3_local_solvers.md). + +--- + +## 3. Krylov Outer Iteration + +The outer solver iteratively solves $G\alpha = b$ where $b = D^\top W y$, building a sequence of improving approximations $\alpha_0, \alpha_1, \ldots$ At each step it computes the residual $r_k = b - G\alpha_k$, applies the preconditioner to obtain a search direction $z_k = M^{-1}r_k$, and updates the solution. The preconditioner $M^{-1}$ — the Schwarz solver — is a cheap approximate inverse of $G$ that is not exact, but accurate enough so that the iteration converges in far fewer steps than an unpreconditioned solver would require. + +![Preconditioned vs unpreconditioned convergence](images/preconditioned_convergence.svg) + +### 3.1 Solver selection + +From the class of Kyrlov solvers, we provide two algorithms: + +The choice of solver depends on whether the preconditioner is symmetric; this in turn depends on whether additive or multiplicative Schwarz is used (see [Section 4](#4-schwarz-domain-decomposition)): + +- **CG** (conjugate gradient) is used when the preconditioner is symmetric — this is the case with additive Schwarz. CG is optimal for symmetric positive-definite systems. + +- **GMRES** (generalized minimal residual) is used when the preconditioner is non-symmetric — this is the case with multiplicative Schwarz, where the sequential processing breaks symmetry. + +### 3.2 Convergence criterion + +The solver converges when the normalized residual $\|r_k\|_2 / \|b\|_2 \leq \text{tol}$ (default $10^{-8}$). The residual $r_k = b - G\alpha_k$ measures how well the current $\alpha_k$ satisfies the normal equations — it is a property of the linear system, not a direct measure of how accurately the fixed effects have been removed from $y$. In practice the two are closely linked: a small residual implies that $\alpha_k$ is close to the true solution $\alpha^\ast$, and hence that the demeaned residuals $e = y - D\alpha_k$ are accurate. + +--- + +## 4. Schwarz Domain Decomposition + +[Part 1, Section 5](1_fixed_effects_and_block_methods.md#5-the-domain-decomposition-perspective) introduced the Schwarz perspective and contrasted factor-level with factor-pair decompositions. This section provides the full algorithmic details. + +### 4.1 How it works + +The Schwarz preconditioner decomposes the global system into overlapping subdomains - one per factor pair - and applies local solves to each. The local operator on subdomain $i$ is $A_i = R_i G R_i^\top$, the principal submatrix of $G$ restricted to that subdomain's DOFs / factor levels. + +Two variants exist - additive and multiplicative - differing in how the local corrections are combined: + +![Additive vs multiplicative Schwarz](images/additive_vs_multiplicative.svg) + +### 4.2 Partition of unity + +When subdomains overlap (a DOF / factor level belongs to multiple subdomains), corrections must be weighted to avoid double-counting: + +![Partition of unity](images/partition_of_unity.svg) + +Each DOF / factor level $j$ that appears in $c_j$ subdomains gets weight $\omega_j = 1/\sqrt{c_j}$ in each subdomain. The weights are applied on both the restriction and prolongation sides, so they contribute $c_j \times \omega_j^2 = 1$ - correctly partitioning the correction. In the running example, every DOF / factor level appears in exactly 2 subdomains, so every weight is $\omega_j = 1/\sqrt{2}$. + +### 4.3 Additive Schwarz + +The additive Schwarz preconditioner applies all local solves independently and sums the weighted corrections: + +$$ +M^{-1}_{\text{add}} r = \sum_{i=1}^{N_s} R_i^\top \tilde{D}_i A_i^+ \tilde{D}_i R_i r +$$ + +Each subdomain restricts the global residual to its local DOFs / factor levels (with partition-of-unity weights applied on input), solves the local system, and prolongates the correction back to the global space (with weights applied on output). All subdomains are processed independently and in parallel. Because the weighting is applied symmetrically on both sides, the resulting preconditioner is symmetric, making it compatible with CG. + +### 4.4 Multiplicative Schwarz + +The multiplicative variant processes subdomains sequentially, updating the residual after each correction. Each subdomain "sees" corrections from earlier subdomains, generally improving convergence versus additive Schwarz. However, the sequential processing makes the preconditioner non-symmetric, requiring the use of the GMRES solver. + +### 4.5 Subdomain construction + +Subdomains are derived from the factor-pair structure of the Gramian: + +1. **Enumerate factor pairs**: all $\binom{Q}{2}$ unordered pairs $(q, r)$. +2. **Build cross-tabulation**: for each pair, scan observations to build the sparse bipartite block $C_{qr}$ and diagonal vectors $D_q$, $D_r$. +3. **Find connected components**: run DFS (depth-first search — a standard graph traversal that follows edges recursively until no new nodes are reachable) on the bipartite graph of $C_{qr}$ to identify independent components. +4. **Create subdomains**: each component becomes a subdomain with its global DOF / factor level indices. +5. **Compute partition-of-unity weights**: if subdomains overlap, count how many subdomains each DOF / factor level belongs to and assign $\omega_j = 1/\sqrt{c_j}$; for non-overlapping DOFs / factor levels the weight is trivially 1. + +Factor pairs are processed in parallel. + +--- + +## 5. Full Algorithm Summary + +We conclude with a summary of the full algorithm: + +### 5.1 Setup phase + +1. **Build weighted design** from `categories` and `weights` + - Infer $m_q$ (number of levels) per factor by scanning observations + +2. **For each factor pair $(q, r)$ in parallel:** + - Build cross-tabulation $C_{qr}$ and diagonal blocks $D_q$, $D_r$ + - Find connected components of the bipartite graph of $C_{qr}$ + - Create one subdomain per component, recording its global DOF / factor level indices + +3. **Compute partition-of-unity weights**: $\omega_j = 1/\sqrt{c_j}$ for each DOF / factor level $j$ + +4. **For each subdomain in parallel:** + - Build local Laplacian via sign-flip (Section 2.3) + - Factor with approximate Cholesky (or dense Cholesky for small systems; see Part 3) + +5. **Assemble** Schwarz preconditioner $M^{-1}$ from subdomain factors + +6. *(Optional)* Build explicit Gramian $G$ in CSR format from the cross-tabulation blocks + +### 5.2 Solve phase + +1. **Form right-hand side**: $b = D^\top W y$ + +2. **Select Krylov solver**: + - CG with additive Schwarz (symmetric preconditioner) + - GMRES with multiplicative Schwarz (non-symmetric) + +3. **Solve** $G\alpha = b$ with preconditioner $M^{-1}$ + - Each iteration: one $Gx$ product and one $M^{-1}r$ application + - Converge when $\|r_k\| / \|b\| \leq \text{tol}$ + +4. **Compute demeaned residuals**: $e = y - D\alpha$ + +5. **Verify**: report final residual $\|G\alpha - b\| / \|b\|$ + +--- + +## References + +**Correia, S.** (2016). *A Feasible Estimator for Linear Models with Multi-Way Fixed Effects*. Working paper. Describes the fixed-effects normal equations and their block structure. + +**Xu, J.** (1992). *Iterative Methods by Space Decomposition and Subspace Correction*. SIAM Review, 34(4), 581–613. Provides the abstract space decomposition framework for additive and multiplicative Schwarz methods. + +**Toselli, A. & Widlund, O. B.** (2005). *Domain Decomposition Methods - Algorithms and Theory*. Springer. Comprehensive reference for the theory and convergence analysis of Schwarz domain decomposition methods. diff --git a/docs/3_local_solvers.md b/docs/3_local_solvers.md new file mode 100644 index 0000000..71755f4 --- /dev/null +++ b/docs/3_local_solvers.md @@ -0,0 +1,127 @@ +# Part 3: Local Solvers and Approximate Cholesky + +This is Part 3 of the algorithm documentation for the `within` solver. It describes how each Schwarz subdomain is solved: the Schur complement reduction that shrinks the problem, and the approximate Cholesky factorization that solves it in nearly-linear time. + +**Series overview**: +- [Part 1: Fixed Effects and Block Iterative Methods](1_fixed_effects_and_block_methods.md) +- [Part 2: Preconditioned Krylov Solvers and Schwarz Decomposition](2_solver_architecture.md) +- **Part 3: Local Solvers and Approximate Cholesky** (this document) + +**Prerequisites**: Part 1 (Gramian block structure), Part 2 (Schwarz framework, Laplacian connection). + +--- + +## 1. The Local Solve Pipeline + +Each subdomain requires solving a system $A_i z_i = r_i$ where $A_i$ is the bipartite Gramian block for a factor pair. These systems are too large to solve exactly in practice - a Worker-Firm block in a real dataset may have millions of DOFs / factor levels. Instead, the solver produces an **approximate** factorization that is cheap to build and cheap to apply, trading exactness for speed. The Krylov outer solver ([Part 2](2_solver_architecture.md)) compensates for this approximation by refining the global solution across iterations. + +The local solve applies a pipeline of transformations that progressively simplify the problem: + +![Local solve pipeline](images/local_solve_pipeline.svg) + +The key stages are: + +0. We start by forming a **Bipartite block** of a factor pair. +1. We then **sign-flip** the bipartite block into a graph Laplacian (zero row-sums) +2. Next we **eliminate the larger factor** via Schur complement - which reduces e.g. a (workers + firms) system to a firms-only system. This step can be exact or approximate (Section 3.3). +3. We **factor** the reduced system with approximate Cholesky - the main source of approximation +4. Last, we **solve** via cheap triangular substitution, then back-substitute to recover the full solution + +The approximation enters in steps 2 (optionally) and 3 (always): both use **clique-tree sampling** to avoid the quadratic fill that exact Gaussian elimination would produce. + +--- + +## 2. Laplacian Connection + +The local operator for a factor pair is: + +$$ +A_i = G_{\text{local}} = \begin{pmatrix} D_q & C \\ C^\top & D_r \end{pmatrix} +$$ + +Negating the off-diagonal blocks produces a graph Laplacian: + +$$ +L = \begin{pmatrix} D_q & -C \\ -C^\top & D_r \end{pmatrix} +$$ + +This works because every observation at level $j$ of factor $q$ has exactly one level in factor $r$, so $D_q[j,j] = \sum_k C[j,k]$ - the row sums are exactly zero. + +The local solve wrapper handles the sign convention: negate the second block of the RHS before solving, negate the second block of the solution after. Downstream transformations (approximate Schur complement, approximate Cholesky) may introduce small row-sum deficits; when this happens, **Gremban augmentation** adds one extra "ground" node connected to all others, absorbing the deficit and restoring a valid Laplacian. + +--- + +## 3. Schur Complement Reduction + +Since workers share no edges in the bipartite graph, we can **eliminate all workers at once** to get a smaller system on firms only. + +### 3.1 What elimination looks like on the graph + +![Eliminating workers from the bipartite graph](images/schur_eliminate_workers.svg) + +Each worker elimination removes that worker and creates **fill edges** between the firms they worked at. Worker W1 worked at F1 and F2, so eliminating W1 creates a direct F1–F2 connection. These fill edges encode indirect connections mediated through workers. + +Since workers have no edges between themselves (the graph is bipartite), all worker eliminations are independent and can proceed in parallel. + +### 3.2 The reduced system + +Given the Laplacian: + +$$ +\begin{pmatrix} D_q & -C \\ -C^\top & D_r \end{pmatrix} +\begin{pmatrix} z_q \\ z_r \end{pmatrix} += \begin{pmatrix} b_q \\ b_r \end{pmatrix} +$$ + +Eliminate the larger block (say workers, $D_q$). Since $D_q$ is diagonal, this is exact and cheap: + +$$ +S\, z_r = b_r + C^\top D_q^{-1} b_q +$$ + +where the **Schur complement** is: + +$$ +S = D_r - C^\top D_q^{-1} C +$$ + +The Schur complement $S$ is a Laplacian on the smaller factor's levels (e.g. firms), with edge weights that capture indirect connections through the eliminated factor (e.g. workers). After solving for $z_r$, back-substitution recovers $z_q = D_q^{-1}(b_q + C\, z_r)$. + +### 3.3 Exact vs. approximate elimination + +Each eliminated worker with $d$ firm connections creates a dense **clique** of $\binom{d}{2}$ fill edges among its firms. A worker who worked at 100 firms creates $\binom{100}{2} = 4{,}950$ fill edges - this can be expensive. + +![Schur complement: clique vs. tree](images/schur_clique_vs_tree.svg) + +The **approximate** variant (Gao, Kyng, and Spielman, 2025) replaces each clique with a random **spanning tree** with only $d - 1$ edges instead of $\binom{d}{2}$. The tree weights are chosen so that the expected Laplacian matches the clique Laplacian (i.e. is an **unbiased estimator**). + +For a worker observed at 100 firms, this reduces the fill from 4,950 edges to just 99 - a 50× reduction - without introducing bias, since the tree weights are chosen so that the approximate Schur complement is unbiased. +--- + +## 4. Approximate Cholesky Factorization + +The approximate Cholesky algorithm (Gao, Kyng, and Spielman, 2025) factors the Schur complement $S$ into an approximate lower-triangular factor $\tilde{L}$ such that $\tilde{L}\tilde{L}^\top \approx S$. It is the same core idea as the Schur complement step, but applied sequentially to a general (non-bipartite) graph. + +### 4.1 How it works + +The algorithm eliminates vertices one by one in random order. Each elimination produces fill edges (a clique on the vertex's neighbors). Instead of materializing all $O(d^2)$ clique edges, it samples a random spanning tree with only $d - 1$ edges - the same clique-tree trick used in Section 3.3: + +![Approximate Cholesky: clique vs. tree on the reduced graph](images/ac_clique_vs_tree.svg) + +The difference from the Schur complement step is that here, eliminations are **sequential** - each one modifies the graph for the next: + +![Sequential elimination in approximate Cholesky](images/ac_sequential.svg) + +Eliminating F1 creates a fill edge F2–F4. When F2 is eliminated next, it now connects to F4 (via the fill from step 1), producing further fill F3–F4. This cascading fill is why the Schur complement reduction - which exploits the bipartite independence - is performed first, and approximate Cholesky is applied only to the smaller reduced system. + +### 4.2 Properties + +The key property is that $\mathbb{E}[\tilde{L}\tilde{L}^\top] = S$ (unbiased). The resulting triangular factor is cheap to apply: the local solve becomes a forward substitution followed by a back substitution, costing $O(m_{\text{local}})$ per application. + +--- + +## References + +**Gao, Y., Kyng, R., & Spielman, D. A.** (2025). *Robust and Practical Solution of Laplacian Equations by Approximate Gaussian Elimination*. arXiv:2303.00709. Primary reference for the approximate Cholesky factorization via clique-tree sampling (AC(k) algorithm), Schur complement approximation, and Gremban augmentation. + +**Gremban, K. D.** (1996). *Combinatorial Preconditioners for Sparse, Symmetric, Diagonally Dominant Linear Systems*. PhD thesis, Carnegie Mellon University. SDDM-to-Laplacian augmentation technique. diff --git a/docs/images/ac_clique_vs_tree.svg b/docs/images/ac_clique_vs_tree.svg new file mode 100644 index 0000000..2102c0e --- /dev/null +++ b/docs/images/ac_clique_vs_tree.svg @@ -0,0 +1,94 @@ + + + + + + + + + + + Reduced graph: star of F1 + + + + existing + + + + + + + + + F1 + + + F2 + + + F3 + + + F4 + + + + + + Exact: clique (3 fill edges) + + + + F1 + + + + + + + + + + + + F2 + + + F3 + + + F4 + + + + + + Approx: tree (2 fill edges) + + + + F1 + + + + + + + + + + + + + + F2 + + + F3 + + + F4 + + All nodes are firms (reduced graph after Schur complement). Gray edge existed before this elimination step. + diff --git a/docs/images/ac_sequential.svg b/docs/images/ac_sequential.svg new file mode 100644 index 0000000..aa059eb --- /dev/null +++ b/docs/images/ac_sequential.svg @@ -0,0 +1,90 @@ + + + + + + + + + + + Reduced graph + + + + + + + + + + F1 + + + F2 + + + F3 + + + F4 + + + + + + Eliminate F1 + + + + F1 + + + + + + + fill + + + + F2 + + + F3 + + + F4 + + + + + + Eliminate F2 + + + + F1 + + + + F2 + + + + + + + + fill + + + + F3 + + + F4 + + + F2 sees F4 via step-1 fill + diff --git a/docs/images/additive_vs_multiplicative.svg b/docs/images/additive_vs_multiplicative.svg new file mode 100644 index 0000000..c3c6c6b --- /dev/null +++ b/docs/images/additive_vs_multiplicative.svg @@ -0,0 +1,114 @@ + + + + + + + + + + + + + + Additive Schwarz + All subdomains solve in parallel from the same residual + + + + residual r + + + + + + + + + W-F + solve + + + W-Y + solve + + + F-Y + solve + + + + + + + + + + + + + + + correction z + + + Symmetric → use with CG + + + + + + Multiplicative Schwarz + Each subdomain sees the updated residual from the previous one + + + + r + + + + + W-F + solve + + + + update r + + + + r′ + + + + + W-Y + solve + + + + update r + + + + r″ + + + + + F-Y + solve + + + + + + correction z + + + + Each subdomain benefits from corrections + already made → generally converges faster + + + Non-symmetric → use with GMRES + diff --git a/docs/images/convergence_zigzag.svg b/docs/images/convergence_zigzag.svg new file mode 100644 index 0000000..9ba4d40 --- /dev/null +++ b/docs/images/convergence_zigzag.svg @@ -0,0 +1,49 @@ + + + + + High mobility + Workers move between many firms + + + + + residual + iterations + + + + + + + tol + + + + converged + + + + + + Low mobility + Workers stuck at one firm + + + + + residual + iterations + + + + + + + tol + + + still far from tol + diff --git a/docs/images/graph_bipartite_wf.svg b/docs/images/graph_bipartite_wf.svg new file mode 100644 index 0000000..ed023e9 --- /dev/null +++ b/docs/images/graph_bipartite_wf.svg @@ -0,0 +1,29 @@ + + + + + + + + + + + + + + W1 + + W2 + + W3 + + + + F1 + + F2 + + + Worker–Firm bipartite subgraph + Single connected component (W1 bridges both firms) + diff --git a/docs/images/graph_factor_level.svg b/docs/images/graph_factor_level.svg new file mode 100644 index 0000000..240bceb --- /dev/null +++ b/docs/images/graph_factor_level.svg @@ -0,0 +1,55 @@ + + + + + + D_W + + + D_F + + + D_Y + + + + + + + + + + + + + + + + + + + + + + W1 + + W2 + + W3 + + + F1 + + F2 + + + Y1 + + Y2 + + + Every edge crosses a subdomain boundary — no internal edges. + diff --git a/docs/images/graph_factor_pair.svg b/docs/images/graph_factor_pair.svg new file mode 100644 index 0000000..1919d14 --- /dev/null +++ b/docs/images/graph_factor_pair.svg @@ -0,0 +1,61 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + W1 + + W2 + + W3 + + + F1 + + F2 + + + Y1 + + Y2 + + + Worker–Firm + Worker–Year + Firm–Year + diff --git a/docs/images/graph_plain.svg b/docs/images/graph_plain.svg new file mode 100644 index 0000000..0612386 --- /dev/null +++ b/docs/images/graph_plain.svg @@ -0,0 +1,50 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + W1 + + W2 + + W3 + + + F1 + + F2 + + + Y1 + + Y2 + + + + Workers + + Firms + + Years + diff --git a/docs/images/local_solve_comparison.svg b/docs/images/local_solve_comparison.svg new file mode 100644 index 0000000..f455a82 --- /dev/null +++ b/docs/images/local_solve_comparison.svg @@ -0,0 +1,122 @@ + + + + + Factor-level solve + Local operator = diagonal block only + + + + + + + + + + + + + + + + 2 + 2 + 2 + + + W1 + W2 + W3 + W1 + W2 + W3 + + + DW = diag(2, 2, 2) + Trivial to invert, but ignores + all cross-factor structure + + + + vs + + + Factor-pair solve + Local operator = full bipartite block + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 2 + 2 + 2 + + + 1 + 1 + 2 + 2 + + 1 + 2 + 1 + 2 + + + 3 + 3 + + + W1 + W2 + W3 + F1 + F2 + W1 + W2 + W3 + F1 + F2 + + + + + + + + Captures how workers and firms + interact within each subdomain + diff --git a/docs/images/local_solve_pipeline.svg b/docs/images/local_solve_pipeline.svg new file mode 100644 index 0000000..af0cb5d --- /dev/null +++ b/docs/images/local_solve_pipeline.svg @@ -0,0 +1,99 @@ + + + + + + + + + + Local solve pipeline (one subdomain, e.g. Worker-Firm) + + + + Bipartite block + + + + + + D + C + Cᵀ + D + + + + flip + signs + + + + Laplacian + + + + + + D + −C + −Cᵀ + D + + + + elim. + workers + + + + Schur complement + S = Dᵣ − CᵀD⁻¹C + + + F1 + + F2 + + + + + approx + Cholesky + + + + Triangular factor + S ≈ ḼḼᵀ + cheap to solve: + forward + back sub. + + + + + + + Back-substitute + Solve for firms, + then recover + worker coefficients + + + + + + size: mₑ + mᵣ + (workers + firms) + + same size, + Laplacian form + + size: mᵣ only + (much smaller!) + + ≈linear time + factorization + + recover full + solution cheaply + diff --git a/docs/images/partition_of_unity.svg b/docs/images/partition_of_unity.svg new file mode 100644 index 0000000..3a4ce97 --- /dev/null +++ b/docs/images/partition_of_unity.svg @@ -0,0 +1,96 @@ + + + + Partition of unity: each DOF appears in 2 subdomains + + + + + + + W1 + + W2 + + W3 + + + F1 + + F2 + + + Y1 + + Y2 + + + W-F + + + + 1/√2 + + 1/√2 + + 1/√2 + + 1/√2 + + 1/√2 + + + + + + W-Y + + + + 1/√2 + + 1/√2 + + 1/√2 + + + + + 1/√2 + + 1/√2 + + + F-Y + + + + + + + 1/√2 + + 1/√2 + + 1/√2 + + 1/√2 + + + + + + w² sum + + + 1 + 1 + 1 + 1 + 1 + 1 + 1 + + + Each DOF appears in exactly 2 subdomains. Squared weights: (1/√2)² + (1/√2)² = 1 everywhere. + diff --git a/docs/images/preconditioned_convergence.svg b/docs/images/preconditioned_convergence.svg new file mode 100644 index 0000000..8e62c50 --- /dev/null +++ b/docs/images/preconditioned_convergence.svg @@ -0,0 +1,40 @@ + + + + Effect of preconditioning on convergence + + + + + residual (log scale) + iterations + + + 1 + + 10⁻² + + 10⁻⁴ + + 10⁻⁶ + + + + + tol = 10⁻⁸ + + + + + + + + + + Unpreconditioned (classical demeaning) + + + Preconditioned (factor-pair Schwarz) + diff --git a/docs/images/schur_clique_vs_tree.svg b/docs/images/schur_clique_vs_tree.svg new file mode 100644 index 0000000..f86e159 --- /dev/null +++ b/docs/images/schur_clique_vs_tree.svg @@ -0,0 +1,77 @@ + + + + + + + + + + + Star of W1 (degree 3) + + + + + + + + + W1 + + + + F1 + + + F2 + + + F3 + + + + exact + + + + approx + + + Clique: d(d−1)/2 = 3 fill edges + + + + + + + + + F1 + + + F2 + + + F3 + + + Tree: d−1 = 2 fill edges + + + + + + + + + + + F1 + + + F2 + + + F3 + diff --git a/docs/images/schur_eliminate_workers.svg b/docs/images/schur_eliminate_workers.svg new file mode 100644 index 0000000..8bcbd0d --- /dev/null +++ b/docs/images/schur_eliminate_workers.svg @@ -0,0 +1,113 @@ + + + + + + + + + + + Worker-Firm bipartite + + + + W1 + + W2 + + W3 + + + + F1 + + F2 + + + + + + + + + + elim W1 + + + Eliminate W1 + + + + W1 + + + + + + + + W2 + + W3 + + + + F1 + + F2 + + + + + + + + fill + + + + elim all + + + All workers eliminated + + + + W1 + + W2 + + W3 + + + + + + + + + + F1 + + F2 + + + + + + + + + + Schur + complement + Smaller system + on firms only. + Fill edges encode + indirect worker + connections. + + + Workers share no edges → all worker eliminations are independent → parallel + diff --git a/docs/images/three_layer_architecture.svg b/docs/images/three_layer_architecture.svg new file mode 100644 index 0000000..1b26d03 --- /dev/null +++ b/docs/images/three_layer_architecture.svg @@ -0,0 +1,61 @@ + + + + + + + + + + + + + + Panel data + weights + + + + + Krylov Solver + CG or GMRES — iterates toward the solution + Each iteration asks: "given the current error, what direction improves most?" + + + + residual r + + + + correction z + + + + Schwarz Preconditioner + Splits the problem into overlapping factor-pair subdomains + Each pair (e.g. Worker-Firm) becomes an independent sub-problem + + + + + + + + + Worker-Firm + approx Cholesky + + + Worker-Year + approx Cholesky + + + Firm-Year + approx Cholesky + + + Independent — solved in parallel + + + + Demeaned outcome +