From 3db15ca1ff366017c97b95c0ee057c38e47a5489 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 17 Dec 2025 21:06:55 +0000
Subject: [PATCH 1/3] docs: Add comprehensive documentation for routing and
 water balance models

Add detailed technical documentation for EF5's hydrological models:

- docs/ROUTING_MODELS.md: Documents Linear Reservoir and Kinematic Wave
  routing models including algorithms, parameters, and equations

- docs/WATER_BALANCE_MODELS.md: Documents SAC-SMA, CREST, HyMOD, and
  Hydrophobic water balance models with conceptual diagrams, parameter
  tables, and implementation details

Both documents include:
- Conceptual structure diagrams
- Complete parameter reference tables
- Algorithm descriptions with pseudocode
- Key equations and their derivations
- Comparison tables between model variants
- Configuration examples
- References to source literature
---
 docs/ROUTING_MODELS.md       | 384 ++++++++++++++++++
 docs/WATER_BALANCE_MODELS.md | 760 +++++++++++++++++++++++++++++++++++
 2 files changed, 1144 insertions(+)
 create mode 100644 docs/ROUTING_MODELS.md
 create mode 100644 docs/WATER_BALANCE_MODELS.md

diff --git a/docs/ROUTING_MODELS.md b/docs/ROUTING_MODELS.md
new file mode 100644
index 0000000..f42627b
--- /dev/null
+++ b/docs/ROUTING_MODELS.md
@@ -0,0 +1,384 @@
+# Routing Models in EF5
+
+This document provides detailed technical documentation for the routing models implemented in the Ensemble Framework For Flash Flood Forecasting (EF5). Routing models transport water from upstream grid cells to downstream cells, simulating the movement of water through the landscape and channel network.
+
+## Overview
+
+EF5 implements two routing models:
+
+| Model | Type | Complexity | States | Parameters | Best Use Case |
+|-------|------|------------|--------|------------|---------------|
+| **Linear Reservoir** | Conceptual | Simple | 2 reservoirs | 8 | Fast forecasting, large-scale applications |
+| **Kinematic Wave** | Physics-based | Moderate | 3 states | 7 | Detailed process studies, channel routing |
+
+Both routing models receive water from the water balance model as two flow components:
+- **Fast Flow (Overland)**: Surface runoff that moves quickly across the landscape
+- **Slow Flow (Interflow)**: Subsurface lateral flow that moves more slowly
+
+---
+
+## Linear Reservoir Routing
+
+**Source Files:** `src/LinearRoute.h`, `src/LinearRoute.cpp`
+
+### Concept
+
+The Linear Reservoir model treats each grid cell as containing two conceptual reservoirs - one for overland (fast) flow and one for interflow (slow) flow. Water enters these reservoirs from the water balance model and from upstream cells, then drains to downstream cells based on a linear outflow relationship.
+
+### Parameters
+
+| Parameter | Name | Description | Typical Range | Units |
+|-----------|------|-------------|---------------|-------|
+| `COEM` | Overland Coefficient | Manning's roughness for overland flow | 0.01 - 0.5 | - |
+| `RIVER` | Channel Coefficient | Manning's roughness for channel flow | 0.01 - 0.1 | - |
+| `UNDER` | Interflow Coefficient | Speed coefficient for subsurface flow | 0.001 - 0.1 | - |
+| `LEAKO` | Overland Leakage | Fraction of overland reservoir released per timestep | 0.0 - 1.0 | - |
+| `LEAKI` | Interflow Leakage | Fraction of interflow reservoir released per timestep | 0.0 - 1.0 | - |
+| `TH` | Threshold | Flow accumulation threshold for channel cells | 1 - 1000 | cells |
+| `ISO` | Initial Overland Storage | Initial water depth in overland reservoir | 0.0 - 10.0 | mm |
+| `ISU` | Initial Underground Storage | Initial water depth in interflow reservoir | 0.0 - 100.0 | mm |
+
+### Data Structure
+
+```cpp
+struct LRGridNode {
+    float params[8];              // Routing parameters
+    double slopeSqrt;             // Pre-calculated sqrt(slope)
+    double nexTime[2];            // Travel time for each layer [OVERLAND, INTERFLOW]
+    LRGridNode *routeCNode[2][2]; // Downstream cell pointers (2 layers x 2 targets)
+    double routeAmount[2][2];     // Fraction of flow to each downstream cell
+    double incomingWater[2];      // Accumulated incoming water from upstream
+};
+```
+
+### Algorithm
+
+#### Initialization Phase (`InitializeRouting`)
+
+1. **Calculate Flow Velocity** using Manning's equation:
+   ```
+   velocity = depth^0.66 * sqrt(slope) * coefficient
+   ```
+   Where:
+   - `depth` is assumed from hydraulic radius approximation
+   - `slope` is terrain slope from DEM
+   - `coefficient` is COEM (overland) or RIVER (channel) or UNDER (interflow)
+
+2. **Determine Travel Time** across each cell:
+   ```
+   travelTime = cellLength / velocity
+   ```
+
+3. **Pre-compute Routing Destinations**:
+   - If `travelTime < timestep`: Water reaches downstream cell(s)
+   - Calculate fractional split between cells based on how far water travels
+   - Store routing coefficients for runtime efficiency
+
+#### Routing Phase (`RouteInt`)
+
+For each timestep, the routing proceeds in order from upstream to downstream:
+
+```
+For each grid cell (upstream to downstream order):
+
+    // OVERLAND LAYER
+    1. Add incoming fast flow from upstream to overland reservoir
+    2. Calculate overland leakage:
+       overlandLeak = reservoir * LEAKO
+
+    // INTERFLOW LAYER
+    3. Add incoming slow flow from upstream to interflow reservoir
+    4. Calculate interflow leakage:
+       interflowLeak = reservoir * LEAKI
+
+    // ROUTE TO DOWNSTREAM
+    5. For each routing layer:
+       - Calculate outflow based on pre-computed routing fractions
+       - Split flow between primary and secondary downstream cells
+       - Add to downstream cells' incoming water accumulator
+
+    // DISCHARGE CALCULATION
+    6. If gauge location:
+       discharge = (overlandLeak + interflowLeak) * cellArea / timestep
+```
+
+### Routing Fraction Calculation
+
+When water travels through a cell, it may reach one or two downstream cells depending on travel time:
+
+```
+If travelTime >= timestep:
+    // All water stays in this cell for this timestep
+    routeAmount[0] = 1.0  // to self
+    routeAmount[1] = 0.0  // none to downstream
+
+Else:
+    remainingTime = timestep - travelTime
+    // Calculate how far into downstream cell water travels
+    // Split proportionally between current and downstream
+    routeAmount[0] = travelTime / timestep      // fraction staying
+    routeAmount[1] = remainingTime / timestep   // fraction moving downstream
+```
+
+### Strengths and Limitations
+
+**Strengths:**
+- Computationally efficient
+- Simple parameterization
+- Suitable for real-time forecasting
+- Pre-computed routing enables fast execution
+
+**Limitations:**
+- Linear assumption may not capture nonlinear channel dynamics
+- Less physically realistic for detailed channel flow studies
+- Does not account for backwater effects
+
+---
+
+## Kinematic Wave Routing
+
+**Source Files:** `src/KinematicRoute.h`, `src/KinematicRoute.cpp`
+
+### Concept
+
+The Kinematic Wave model is based on the shallow water equations with the kinematic wave approximation. This assumes that the friction slope equals the bed slope (no pressure gradient effects), which is valid for overland flow and steep channels where backwater effects are negligible.
+
+### Governing Equations
+
+The kinematic wave equation combines continuity and momentum:
+
+**Continuity:**
+```
+dh/dt + dq/dx = i - f
+```
+
+**Kinematic Wave Assumption:**
+```
+q = alpha * h^beta
+```
+
+Where:
+- `h` = flow depth (m)
+- `q` = discharge per unit width (m^2/s)
+- `i` = lateral inflow (rainfall excess)
+- `f` = infiltration/losses
+- `alpha` = flow coefficient
+- `beta` = flow exponent (typically 0.6 for overland, variable for channels)
+
+### Parameters
+
+| Parameter | Name | Description | Typical Range | Units |
+|-----------|------|-------------|---------------|-------|
+| `UNDER` | Interflow Coefficient | Subsurface flow speed coefficient | 0.001 - 0.1 | - |
+| `LEAKI` | Interflow Leakage | Fraction of interflow released per timestep | 0.0 - 1.0 | - |
+| `TH` | Threshold | Flow accumulation threshold for channels | 1 - 1000 | cells |
+| `ISU` | Initial Underground Storage | Initial interflow reservoir depth | 0.0 - 100.0 | mm |
+| `ALPHA` | Channel Alpha | Channel flow coefficient (alpha) | 0.1 - 10.0 | - |
+| `BETA` | Channel Beta | Channel flow exponent (beta) | 0.3 - 1.0 | - |
+| `ALPHA0` | Hillslope Alpha | Overland flow coefficient | 0.01 - 1.0 | - |
+
+### State Variables
+
+| State | Name | Description |
+|-------|------|-------------|
+| `STATE_KW_PQ` | Previous Channel Flow | Channel discharge from previous timestep |
+| `STATE_KW_PO` | Previous Overland Flow | Overland flow from previous timestep |
+| `STATE_KW_IR` | Interflow Reservoir | Current interflow storage |
+
+### Data Structure
+
+```cpp
+struct KWGridNode {
+    float params[7];              // Kinematic wave parameters
+    double states[3];             // State variables [PQ, PO, IR]
+    double slopeSqrt;             // Pre-calculated sqrt(slope)
+    double alpha[2];              // Alpha for [hillslope, channel]
+    double beta[2];               // Beta for [hillslope, channel]
+    double incomingWater[2];      // Incoming water [FASTFLOW, INTERFLOW]
+};
+```
+
+### Algorithm
+
+#### For Hillslope Cells (`RouteInt`)
+
+1. **Setup Right-Hand Side (RHS)**:
+   ```
+   RHS = (dt/dx) * upstream_flow + precip_excess + previous_overland
+   ```
+
+2. **Solve Kinematic Wave** using Newton-Raphson iteration (10 iterations):
+   ```
+   For i = 1 to 10:
+       f(q) = (dt/dx)*q + alpha*q^beta - RHS
+       f'(q) = (dt/dx) + alpha*beta*q^(beta-1)
+       q_new = q - f(q)/f'(q)
+   ```
+
+3. **Update States**:
+   ```
+   previous_overland = q
+   Route q to downstream cell
+   ```
+
+#### For Channel Cells (`RouteInt`)
+
+Channel cells process both hillslope contributions and channel flow:
+
+1. **Accumulate Hillslope Contributions**:
+   ```
+   lateral_inflow = sum of overland flow from adjacent hillslope cells
+   ```
+
+2. **Process Interflow Reservoir**:
+   ```
+   interflow_leak = reservoir * LEAKI
+   reservoir = reservoir * (1 - LEAKI) + incoming_interflow
+   ```
+
+3. **Solve Channel Kinematic Wave**:
+   ```
+   RHS = (dt/dx)*upstream_channel_flow + lateral_inflow + interflow_leak + precip
+
+   Solve: (dt/dx)*Q + alpha*Q^beta = RHS
+   Using Newton-Raphson iteration
+   ```
+
+4. **Update Channel State**:
+   ```
+   previous_channel_flow = Q
+   discharge = Q * channel_width
+   ```
+
+### Newton-Raphson Solution
+
+The nonlinear kinematic wave equation is solved iteratively:
+
+```cpp
+double SolveKinematicWave(double rhs, double alpha, double beta, double dtdx) {
+    double q = rhs / (dtdx + alpha);  // Initial guess
+
+    for (int i = 0; i < 10; i++) {
+        double f = dtdx * q + alpha * pow(q, beta) - rhs;
+        double fp = dtdx + alpha * beta * pow(q, beta - 1);
+        q = q - f / fp;
+        if (q < 0) q = 0;
+    }
+
+    return q;
+}
+```
+
+### Cell Type Classification
+
+Cells are classified based on flow accumulation:
+
+```
+If flowAccumulation >= TH:
+    cell_type = CHANNEL
+Else:
+    cell_type = HILLSLOPE
+```
+
+Channel cells use different alpha/beta parameters and accumulate flow from surrounding hillslope cells.
+
+### State Persistence
+
+The Kinematic Wave model supports saving and loading states for:
+- Warm starts (continuing from previous simulation)
+- Data assimilation
+- Checkpoint/restart capability
+
+States are stored as GeoTIFF grids with the same resolution as the model grid.
+
+### Strengths and Limitations
+
+**Strengths:**
+- More physically based than linear reservoir
+- Captures nonlinear flow dynamics
+- Better representation of flood wave propagation
+- Separate treatment of hillslope and channel processes
+
+**Limitations:**
+- More computationally intensive
+- Requires more parameters
+- Kinematic assumption invalid for:
+  - Backwater effects
+  - Very mild slopes
+  - Tidal influences
+- Iterative solution may not converge for extreme conditions
+
+---
+
+## Comparison of Routing Models
+
+| Aspect | Linear Reservoir | Kinematic Wave |
+|--------|-----------------|----------------|
+| **Physical Basis** | Conceptual | Semi-physical |
+| **Computation Speed** | Fast | Moderate |
+| **Parameter Count** | 8 | 7 |
+| **State Variables** | 2 reservoirs | 3 states |
+| **Channel/Hillslope** | Same treatment | Different treatment |
+| **Nonlinearity** | Linear outflow | Nonlinear (power law) |
+| **Backwater Effects** | Not captured | Not captured |
+| **Best For** | Real-time forecasting | Process studies |
+| **Initialization** | Simple (reservoir levels) | Requires previous flows |
+
+---
+
+## Integration with Water Balance Models
+
+Both routing models interface with water balance models through a common interface:
+
+```cpp
+// Water balance model outputs
+float fastFlow;   // Surface runoff (mm/hr)
+float slowFlow;   // Interflow/baseflow (mm/hr)
+
+// Routing model receives these as inputs
+routingModel->Route(fastFlow, slowFlow);
+
+// Routing model outputs
+float discharge;  // Streamflow at gauges (m^3/s or cfs)
+```
+
+### Flow Path Calculation
+
+Both routing models use the D8 flow direction algorithm:
+1. Each cell has a single downstream cell
+2. Flow follows steepest descent
+3. Cells are processed in topological order (upstream first)
+
+---
+
+## Configuration Example
+
+```xml
+<Route>
+    <Type>LR</Type>  <!-- or "KW" for Kinematic Wave -->
+    <Parameters>
+        <Gauge name="outlet">
+            <COEM>0.05</COEM>
+            <RIVER>0.03</RIVER>
+            <UNDER>0.01</UNDER>
+            <LEAKO>0.5</LEAKO>
+            <LEAKI>0.3</LEAKI>
+            <TH>100</TH>
+            <ISO>0.0</ISO>
+            <ISU>10.0</ISU>
+        </Gauge>
+    </Parameters>
+</Route>
+```
+
+---
+
+## References
+
+1. Flamig, Z. L., et al. (2020). "The Ensemble Framework For Flash Flood Forecasting (EF5)." *Journal of Hydrometeorology*.
+
+2. Singh, V. P. (1996). *Kinematic Wave Modeling in Water Resources: Surface-Water Hydrology*. John Wiley & Sons.
+
+3. Chow, V. T., Maidment, D. R., & Mays, L. W. (1988). *Applied Hydrology*. McGraw-Hill.
+
+4. Henderson, F. M. (1966). *Open Channel Flow*. Macmillan.
diff --git a/docs/WATER_BALANCE_MODELS.md b/docs/WATER_BALANCE_MODELS.md
new file mode 100644
index 0000000..218b662
--- /dev/null
+++ b/docs/WATER_BALANCE_MODELS.md
@@ -0,0 +1,760 @@
+# Water Balance Models in EF5
+
+This document provides detailed technical documentation for the water balance models implemented in the Ensemble Framework For Flash Flood Forecasting (EF5). Water balance models simulate the vertical water budget at each grid cell, partitioning precipitation into evapotranspiration, soil moisture storage, and runoff components.
+
+## Overview
+
+EF5 implements four water balance models with varying complexity:
+
+| Model | Full Name | Complexity | States | Parameters | Type |
+|-------|-----------|------------|--------|------------|------|
+| **SAC** | Sacramento Soil Moisture Accounting | High | 6 | 23 | Distributed |
+| **CREST** | Coupled Routing and Excess Storage | Moderate | 1 | 6 | Distributed |
+| **HyMOD** | Hydrological Model | Moderate | 4-6 | 9 | Lumped |
+| **HP** | Hydrophobic | Simple | 0 | 2 | Distributed |
+
+All models output two flow components:
+- **Fast Flow (Surface Runoff)**: Water that flows quickly over or near the surface
+- **Slow Flow (Interflow/Baseflow)**: Water that moves slowly through subsurface pathways
+
+---
+
+## Sacramento Soil Moisture Accounting Model (SAC-SMA)
+
+**Source Files:** `src/SAC.h`, `src/SAC.cpp`
+
+### Overview
+
+The Sacramento model is a conceptual rainfall-runoff model developed by the U.S. National Weather Service. It is one of the most widely used operational hydrologic models in the United States. SAC-SMA represents the soil column as two zones (upper and lower), each with tension water and free water storage components.
+
+### Conceptual Structure
+
+```
+                    Precipitation
+                         |
+                         v
+    +--------------------------------------------+
+    |              UPPER ZONE                     |
+    |  +------------------+  +------------------+ |
+    |  | Tension Water    |  | Free Water       | |
+    |  | (UZTWC/UZTWM)    |  | (UZFWC/UZFWM)    | |
+    |  +------------------+  +------------------+ |
+    |         |                    |             |
+    |         v                    v             |
+    |        ET              Interflow --------> Slow Flow
+    |                              |             |
+    +------------------------------|-------------+
+                                   | Percolation
+                                   v
+    +--------------------------------------------+
+    |              LOWER ZONE                     |
+    |  +------------------+                       |
+    |  | Tension Water    |                       |
+    |  | (LZTWC/LZTWM)    |                       |
+    |  +--------+---------+                       |
+    |           |                                 |
+    |  +--------v---------+  +------------------+ |
+    |  | Supplemental     |  | Primary          | |
+    |  | Free Water       |  | Free Water       | |
+    |  | (LZFSC/LZFSM)    |  | (LZFPC/LZFPM)    | |
+    |  +------------------+  +------------------+ |
+    |         |                    |             |
+    |         v                    v             |
+    |    Baseflow (fast)     Baseflow (slow) --> Slow Flow
+    +--------------------------------------------+
+
+    +--------------------------------------------+
+    |         IMPERVIOUS AREAS                   |
+    |  - Direct Runoff (PCTIM)                   |
+    |  - Additional Impervious (ADIMP)      ---> Fast Flow
+    +--------------------------------------------+
+```
+
+### Parameters
+
+| Parameter | Name | Description | Typical Range | Units |
+|-----------|------|-------------|---------------|-------|
+| `UZTWM` | Upper Zone Tension Water Max | Maximum upper zone tension water storage | 10 - 300 | mm |
+| `UZFWM` | Upper Zone Free Water Max | Maximum upper zone free water storage | 10 - 150 | mm |
+| `UZK` | Upper Zone Coefficient | Upper zone free water lateral outflow rate | 0.1 - 0.75 | 1/day |
+| `PCTIM` | Percent Impervious | Permanent impervious area fraction | 0.0 - 0.1 | - |
+| `ADIMP` | Additional Impervious | Additional impervious area when saturated | 0.0 - 0.4 | - |
+| `RIVA` | Riparian Area | Riparian vegetation area fraction | 0.0 - 0.2 | - |
+| `ZPERC` | Percolation Demand | Maximum percolation rate multiplier | 5 - 350 | - |
+| `REXP` | Percolation Exponent | Percolation equation exponent | 1.0 - 5.0 | - |
+| `LZTWM` | Lower Zone Tension Water Max | Maximum lower zone tension water storage | 50 - 500 | mm |
+| `LZFSM` | Lower Zone Supplemental Max | Maximum lower zone supplemental free water | 10 - 400 | mm |
+| `LZFPM` | Lower Zone Primary Max | Maximum lower zone primary free water | 50 - 1000 | mm |
+| `LZSK` | Lower Zone Supplemental K | Lower zone supplemental drainage rate | 0.01 - 0.35 | 1/day |
+| `LZPK` | Lower Zone Primary K | Lower zone primary drainage rate | 0.001 - 0.05 | 1/day |
+| `PFREE` | Percolation to Free | Fraction of percolation going to free water | 0.0 - 0.8 | - |
+| `SIDE` | Side Flow Ratio | Ratio of deep percolation to streamflow | 0.0 - 0.5 | - |
+| `RSERV` | Lower Zone Minimum | Minimum lower zone storage as fraction | 0.0 - 0.4 | - |
+| `UZTWC` | Initial UZTW Content | Initial upper zone tension water | 0 - UZTWM | mm |
+| `UZFWC` | Initial UZFW Content | Initial upper zone free water | 0 - UZFWM | mm |
+| `LZTWC` | Initial LZTW Content | Initial lower zone tension water | 0 - LZTWM | mm |
+| `LZFSC` | Initial LZFS Content | Initial lower zone supplemental free water | 0 - LZFSM | mm |
+| `LZFPC` | Initial LZFP Content | Initial lower zone primary free water | 0 - LZFPM | mm |
+| `ADIMC` | Initial ADIM Content | Initial additional impervious area content | 0 - (UZTWM+LZTWM) | mm |
+
+### State Variables
+
+| State | Description | Bounds |
+|-------|-------------|--------|
+| `UZTWC` | Upper zone tension water content | 0 to UZTWM |
+| `UZFWC` | Upper zone free water content | 0 to UZFWM |
+| `LZTWC` | Lower zone tension water content | 0 to LZTWM |
+| `LZFSC` | Lower zone supplemental free water content | 0 to LZFSM |
+| `LZFPC` | Lower zone primary free water content | 0 to LZFPM |
+| `ADIMC` | Additional impervious area content | 0 to (UZTWM+LZTWM) |
+
+### Algorithm (`WaterBalanceInt`)
+
+The SAC algorithm operates in two main phases:
+
+#### Phase 1: Evapotranspiration
+
+```
+1. Calculate potential ET demand: E1 = PET for this timestep
+
+2. ET from upper zone tension water:
+   E1 = min(E1, UZTWC)
+   UZTWC = UZTWC - E1
+   RED = E1_original - E1  // Residual demand
+
+3. ET from upper zone free water (if demand remains):
+   E2 = min(RED, UZFWC)
+   UZFWC = UZFWC - E2
+   RED = RED - E2
+
+4. ET from lower zone tension water (if demand remains):
+   SAVED = RSERV * (LZFPM + LZFSM)  // Protected storage
+   E3 = RED * (LZTWC / (UZTWM + LZTWM))
+   LZTWC = LZTWC - E3
+
+5. ET from additional impervious area:
+   E5 = E1 * (ADIMC / UZTWM)
+   ADIMC = ADIMC - E5 - (E1 - E3) * (ADIMC / LZTWM)
+
+6. Rebalance upper zone storages to maintain equilibrium
+```
+
+#### Phase 2: Runoff Generation
+
+```
+1. Calculate excess water available for runoff:
+   TWX = precip - (UZTWM - UZTWC)  // Excess above tension capacity
+
+2. If TWX > 0 (excess exists):
+   // Fill upper zone tension water
+   UZTWC = UZTWM
+
+   // Add to additional impervious area
+   ADIMC = ADIMC + precip - TWX
+
+3. Determine sub-timestep increments:
+   NINC = floor(1 + 0.2 * (UZFWC + TWX))
+   DINC = TWX / NINC
+
+4. For each sub-increment:
+   a. Calculate baseflow from lower zone:
+      BF_primary = LZFPC * (1 - exp(-LZPK * dt))
+      BF_supplemental = LZFSC * (1 - exp(-LZSK * dt))
+      LZFPC = LZFPC - BF_primary
+      LZFSC = LZFSC - BF_supplemental
+
+   b. Calculate percolation from upper to lower zone:
+      DEFR = 1 - (LZTWC + LZFPC + LZFSC) / (LZTWM + LZFPM + LZFSM)
+      PERC = UZFWC * (1 + ZPERC * DEFR^REXP)
+      PERC = min(PERC, UZFWC)
+
+   c. Distribute percolation:
+      // To lower zone tension water
+      PERCT = PERC * (1 - PFREE) * min(1, (LZTWM - LZTWC) / available)
+      LZTWC = LZTWC + PERCT
+
+      // To lower zone free water (split between primary and supplemental)
+      PERCF = PERC - PERCT
+      // Distribute based on relative deficits
+
+   d. Calculate interflow:
+      INTERFLOW = UZFWC * UZK
+
+   e. Calculate surface runoff:
+      SURF = (DINC - infiltration) * (1 - PCTIM - ADIMP)
+
+   f. Calculate direct runoff from impervious areas:
+      DIRECT = DINC * PCTIM
+      ADDRO = DINC * ADIMP * (ADIMC / (UZTWM + LZTWM))^2
+
+5. Total outputs:
+   Fast Flow (SURF) = Surface runoff + Direct runoff + Additional impervious runoff
+   Slow Flow (GRND) = Interflow + Primary baseflow + Supplemental baseflow
+```
+
+### Key Equations
+
+**Percolation Rate:**
+```
+PERC = PERCM * (UZFWC/UZFWM) * [1 + ZPERC * DEFR^REXP]
+
+Where:
+- PERCM = maximum percolation rate
+- DEFR = lower zone deficiency ratio = 1 - (LZTWC+LZFPC+LZFSC)/(LZTWM+LZFPM+LZFSM)
+- ZPERC, REXP = percolation shape parameters
+```
+
+**Baseflow Recession:**
+```
+BF = Storage * [1 - exp(-K * dt)]
+
+Where:
+- K = recession coefficient (LZPK or LZSK)
+- dt = time increment
+```
+
+**Additional Impervious Runoff:**
+```
+ADDRO = Precip * ADIMP * (ADIMC / (UZTWM + LZTWM))^2
+```
+
+### Strengths and Limitations
+
+**Strengths:**
+- Well-tested operational model
+- Physically meaningful parameters
+- Good representation of baseflow processes
+- Handles dry and wet antecedent conditions well
+
+**Limitations:**
+- Many parameters require calibration
+- Complex parameter interactions
+- May be over-parameterized for some applications
+
+---
+
+## CREST Model
+
+**Source Files:** `src/CRESTModel.h`, `src/CRESTModel.cpp`
+
+### Overview
+
+The Coupled Routing and Excess Storage (CREST) model is a distributed hydrologic model developed specifically for flash flood prediction. It uses a simplified soil moisture representation with a variable infiltration capacity approach derived from the VIC model.
+
+### Conceptual Structure
+
+```
+              Precipitation
+                   |
+                   v
+    +------------------------------+
+    |        ATMOSPHERE            |
+    |   +-----------------------+  |
+    |   | PET * KE (ET demand) |  |
+    |   +-----------------------+  |
+    +--------------|---------------+
+                   v
+    +------------------------------+
+    |      LAND SURFACE            |
+    |  +---------+  +----------+   |
+    |  |Impervious|  | Pervious |   |
+    |  | (IM)     |  | (1-IM)   |   |
+    |  +----+----+  +-----+----+   |
+    |       |             |        |
+    |       v             v        |
+    |  Direct RO    Infiltration   |
+    +--------------|---------------+
+                   v
+    +------------------------------+
+    |         SOIL ZONE            |
+    |  +------------------------+  |
+    |  |    Soil Moisture (SM)  |  |
+    |  |    Capacity: WM        |  |
+    |  +------------------------+  |
+    |         |           |        |
+    |         v           v        |
+    |     Overflow    Interflow    |
+    +--------------|---------------+
+                   v
+         Fast Flow + Slow Flow
+```
+
+### Parameters
+
+| Parameter | Name | Description | Typical Range | Units |
+|-----------|------|-------------|---------------|-------|
+| `WM` | Water Capacity Max | Maximum soil water storage capacity | 50 - 500 | mm |
+| `B` | Shape Parameter | Infiltration capacity distribution shape | 0.1 - 2.0 | - |
+| `IM` | Impervious Fraction | Fraction of impervious area | 0.0 - 0.3 | - |
+| `KE` | ET Coefficient | Evapotranspiration scaling factor | 0.5 - 1.5 | - |
+| `FC` | Infiltration Capacity | Mean infiltration capacity coefficient | 0.5 - 2.0 | - |
+| `IWU` | Initial Water Use | Initial soil moisture as % of WM | 0 - 100 | % |
+
+### State Variables
+
+| State | Description | Bounds |
+|-------|-------------|--------|
+| `SM` | Soil Moisture | 0 to WM |
+
+### Algorithm (`WaterBalanceInt`)
+
+CREST uses a variable infiltration capacity (VIC) approach:
+
+#### Step 1: Preprocessing
+
+```
+// Adjust PET by crop coefficient
+adjPET = PET * KE
+
+// Direct runoff from impervious area
+ROIMP = precip * IM
+
+// Precipitation reaching soil
+precipSoil = (precip - adjPET) * (1 - IM)
+```
+
+#### Step 2: Wet Condition (precip > PET)
+
+```
+// Maximum water holding capacity across distribution
+Wmaxm = WM * (1 + B)
+
+// Calculate infiltration capacity at current soil moisture
+A = Wmaxm * (1 - (1 - SM/WM)^(1/(1+B)))
+
+// Case 1: Precipitation fills remaining capacity
+If (A + precipSoil) >= Wmaxm:
+    infiltration = WM - SM
+    excess = precipSoil - infiltration
+
+// Case 2: Partial infiltration
+Else:
+    infiltration = WM * [(1 - A/Wmaxm)^(1+B) - (1 - (A+precipSoil)/Wmaxm)^(1+B)]
+    excess = precipSoil - infiltration
+
+// Update soil moisture
+SM = SM + infiltration
+
+// Split excess into surface runoff and interflow
+// Based on infiltration capacity distribution
+interflow = excess * (FC / (FC + 1))
+surfaceRunoff = excess - interflow
+```
+
+#### Step 3: Dry Condition (precip <= PET)
+
+```
+// ET limited by available soil moisture
+actualET = min(adjPET, SM + precip)
+SM = SM + precip - actualET
+
+// Any excess becomes interflow
+If SM > WM:
+    interflow = SM - WM
+    SM = WM
+```
+
+#### Step 4: Output
+
+```
+Fast Flow = surfaceRunoff + ROIMP
+Slow Flow = interflow
+```
+
+### Variable Infiltration Capacity
+
+The VIC approach assumes infiltration capacity varies spatially within each grid cell following a power distribution:
+
+```
+Cumulative distribution: F(i) = 1 - (1 - i/im)^B
+
+Where:
+- i = infiltration capacity at a point
+- im = maximum infiltration capacity = WM * (1 + B)
+- B = shape parameter (higher B = more spatial variability)
+```
+
+This allows the model to generate runoff even when the average soil moisture is below capacity, representing partial area contribution.
+
+### Strengths and Limitations
+
+**Strengths:**
+- Simple with only 6 parameters
+- Computationally efficient
+- Good for flash flood prediction
+- Physically-based infiltration excess mechanism
+
+**Limitations:**
+- Single-layer soil representation
+- Limited baseflow simulation capability
+- May not perform well in dry climates
+
+---
+
+## HyMOD
+
+**Source Files:** `src/HyMOD.h`, `src/HyMOD.cpp`
+
+### Overview
+
+HyMOD is a conceptual lumped rainfall-runoff model that combines a soil moisture module with a series of linear reservoirs for flow routing. Unlike SAC and CREST, HyMOD operates on basin-averaged values rather than individual grid cells.
+
+### Conceptual Structure
+
+```
+         Precipitation
+              |
+              v
+    +-------------------+
+    |   SOIL MOISTURE   |
+    |   (Beta dist.)    |
+    |   HUZ, B params   |
+    +--------+----------+
+             |
+             v
+    +-------------------+
+    |   FLOW SPLITTING  |
+    |   ALP fraction    |
+    +--------+----------+
+             |
+     +-------+-------+
+     |               |
+     v               v
++----------+    +---------+
+|  QUICK   |    |  SLOW   |
+|  FLOW    |    |  FLOW   |
+| NQ tanks |    | 1 tank  |
+|  KQ rate |    | KS rate |
++----+-----+    +----+----+
+     |               |
+     v               v
+   Fast Flow      Slow Flow
+         \         /
+          \       /
+           v     v
+         Total Q
+```
+
+### Parameters
+
+| Parameter | Name | Description | Typical Range | Units |
+|-----------|------|-------------|---------------|-------|
+| `HUZ` | Upper Zone Height | Maximum soil moisture storage | 50 - 500 | mm |
+| `B` | Beta Parameter | Soil moisture distribution shape | 0.1 - 2.0 | - |
+| `ALP` | Alpha | Fraction routed to quick flow | 0.0 - 1.0 | - |
+| `NQ` | Number Quick Tanks | Number of quick flow reservoirs | 2 - 4 | - |
+| `KQ` | Quick Tank Rate | Quick flow reservoir recession | 0.1 - 1.0 | 1/day |
+| `KS` | Slow Tank Rate | Slow flow reservoir recession | 0.001 - 0.1 | 1/day |
+| `XCUZ` | Initial CUZ | Initial soil moisture fraction | 0 - 1 | - |
+| `XQ` | Initial Quick | Initial quick reservoir storage | 0 - 100 | mm |
+| `XS` | Initial Slow | Initial slow reservoir storage | 0 - 500 | mm |
+| `PRECIP` | Precip Multiplier | Precipitation adjustment factor | 0.8 - 1.2 | - |
+
+### State Variables
+
+| State | Description |
+|-------|-------------|
+| `XHuz` | Current upper zone height |
+| `XCuz` | Current upper zone storage |
+| `Xq[NQ]` | Quick flow reservoir storages (array) |
+| `Xs` | Slow flow reservoir storage |
+
+### Algorithm (`WaterBalanceInt`)
+
+#### Step 1: Soil Moisture Accounting
+
+```
+// Beta distribution parameters
+b = log(1 - B/2) / log(0.5)
+CPar = HUZ / (1 + b)
+
+// Current storage from height
+C = CPar * (1 - (1 - h/HUZ)^(1+b))
+
+// Add precipitation
+h_new = h + precip
+
+// Calculate overflow
+If h_new > HUZ:
+    overflow = h_new - HUZ
+    h_new = HUZ
+Else:
+    overflow = Cmax - C_new
+    Where C_new accounts for distribution
+```
+
+#### Step 2: Evapotranspiration
+
+```
+// ET proportional to relative storage
+actualET = PET * (h / HUZ)
+h = h - actualET
+```
+
+#### Step 3: Flow Splitting
+
+```
+// Split overflow between quick and slow pathways
+quickInflow = ALP * overflow
+slowInflow = (1 - ALP) * overflow
+```
+
+#### Step 4: Quick Flow Routing
+
+Quick flow passes through NQ identical linear reservoirs in series (Nash cascade):
+
+```
+For each reservoir i from 1 to NQ:
+    outflow[i] = KQ * storage[i]
+    storage[i] = storage[i] - outflow[i] + inflow[i]
+    inflow[i+1] = outflow[i]
+
+Quick Flow = outflow[NQ]
+```
+
+#### Step 5: Slow Flow Routing
+
+Slow flow passes through a single linear reservoir:
+
+```
+Slow Flow = KS * Xs
+Xs = Xs - Slow Flow + slowInflow
+```
+
+### Linear Reservoir Theory
+
+Each reservoir follows the linear storage-outflow relationship:
+
+```
+Q = K * S
+
+Where:
+- Q = outflow rate
+- K = recession coefficient
+- S = storage
+
+Solution for timestep dt:
+Q(t+dt) = Q(t) * exp(-K*dt) + inflow * (1 - exp(-K*dt))
+```
+
+The Nash cascade of NQ reservoirs produces a gamma distribution response, smoothing and delaying the input signal.
+
+### Lumped vs Distributed Operation
+
+HyMOD in EF5 operates as a **lumped model**:
+- Precipitation is averaged over the basin
+- Single set of state variables for entire basin
+- Output represents basin outlet discharge
+
+This is different from SAC and CREST which operate on each grid cell independently.
+
+### Strengths and Limitations
+
+**Strengths:**
+- Simple and parsimonious (fewer parameters)
+- Computationally very fast
+- Good performance for many basins
+- Clear physical interpretation of parameters
+
+**Limitations:**
+- Lumped representation ignores spatial variability
+- Limited for large heterogeneous basins
+- Single soil layer may miss vertical dynamics
+- Fixed recession characteristics
+
+---
+
+## Hydrophobic (HP) Model
+
+**Source Files:** `src/HPModel.h`, `src/HPModel.cpp`
+
+### Overview
+
+The Hydrophobic model is the simplest water balance model in EF5. It acts essentially as a pass-through, splitting precipitation directly into fast and slow flow components without any soil moisture accounting. This is useful for testing routing models in isolation or for situations where soil moisture dynamics are not important.
+
+### Conceptual Structure
+
+```
+    Precipitation
+         |
+         v
+    +---------+
+    |  SPLIT  |
+    |         |
+    +---------+
+     |       |
+     v       v
+  Fast    Slow
+  Flow    Flow
+```
+
+### Parameters
+
+| Parameter | Name | Description | Typical Range | Units |
+|-----------|------|-------------|---------------|-------|
+| `PRECIP` | Precipitation Multiplier | Scale factor for precipitation | 0.5 - 2.0 | - |
+| `SPLIT` | Split Fraction | Fraction going to fast flow | 0.0 - 1.0 | - |
+
+### Algorithm (`WaterBalanceInt`)
+
+```cpp
+// Scale precipitation
+effectivePrecip = precipitation * PRECIP
+
+// Split into fast and slow components
+fastFlow = effectivePrecip * SPLIT
+slowFlow = effectivePrecip * (1.0 - SPLIT)
+
+// Convert to flow rates (mm/s)
+fastFlow = fastFlow / (stepHours * 3600)
+slowFlow = slowFlow / (stepHours * 3600)
+```
+
+### Use Cases
+
+1. **Routing Model Testing**: Test routing algorithms without water balance interference
+2. **Impervious Urban Areas**: Where all precipitation becomes runoff
+3. **Baseline Comparison**: Compare against more complex models
+4. **Sensitivity Studies**: Isolate routing model behavior
+
+### Strengths and Limitations
+
+**Strengths:**
+- Extremely simple and fast
+- No state variables to initialize
+- Useful for debugging and testing
+
+**Limitations:**
+- No physical realism
+- No soil moisture representation
+- No evapotranspiration
+- Not suitable for actual hydrologic prediction
+
+---
+
+## Model Comparison Summary
+
+| Feature | SAC | CREST | HyMOD | HP |
+|---------|-----|-------|-------|-----|
+| **Soil Zones** | 2 (Upper + Lower) | 1 | 1 | 0 |
+| **State Variables** | 6 | 1 | 4-6 | 0 |
+| **Parameters** | 23 | 6 | 9 | 2 |
+| **Spatial Type** | Distributed | Distributed | Lumped | Distributed |
+| **ET Process** | Multi-zone | Simple | Simple | None |
+| **Baseflow** | Dual reservoir | Interflow only | Single reservoir | Direct split |
+| **Complexity** | High | Moderate | Moderate | Minimal |
+| **Computation** | Moderate | Fast | Very Fast | Very Fast |
+| **Best For** | Operational forecasting | Flash floods | Research | Testing |
+
+---
+
+## Integration with Routing Models
+
+All water balance models output to routing models through a common interface:
+
+```cpp
+// Water balance calculates excess water
+waterBalanceModel->WaterBalance(currentNode, precipRate, petRate);
+
+// Results stored in grid node
+float fastFlow = currentNode->excess[LAYER_OVERLAND];   // mm/hr
+float slowFlow = currentNode->excess[LAYER_INTERFLOW];  // mm/hr
+
+// Routing model receives these flows
+routingModel->Route(fastFlow, slowFlow);
+```
+
+### Flow Component Mapping
+
+| Model | Fast Flow Source | Slow Flow Source |
+|-------|-----------------|------------------|
+| SAC | Surface + Direct + Impervious | Interflow + Baseflow |
+| CREST | Surface runoff + Impervious | Interflow |
+| HyMOD | Quick flow cascade | Slow flow reservoir |
+| HP | SPLIT * Precip | (1-SPLIT) * Precip |
+
+---
+
+## Configuration Example
+
+```xml
+<WaterBalance>
+    <Type>SAC</Type>  <!-- or CREST, HyMOD, HP -->
+    <Parameters>
+        <Gauge name="outlet">
+            <!-- SAC Parameters -->
+            <UZTWM>50.0</UZTWM>
+            <UZFWM>40.0</UZFWM>
+            <UZK>0.3</UZK>
+            <PCTIM>0.01</PCTIM>
+            <ADIMP>0.1</ADIMP>
+            <RIVA>0.0</RIVA>
+            <ZPERC>100.0</ZPERC>
+            <REXP>2.0</REXP>
+            <LZTWM>150.0</LZTWM>
+            <LZFSM>30.0</LZFSM>
+            <LZFPM>100.0</LZFPM>
+            <LZSK>0.1</LZSK>
+            <LZPK>0.01</LZPK>
+            <PFREE>0.4</PFREE>
+            <SIDE>0.0</SIDE>
+            <RSERV>0.3</RSERV>
+            <!-- Initial conditions -->
+            <UZTWC>25.0</UZTWC>
+            <UZFWC>20.0</UZFWC>
+            <LZTWC>75.0</LZTWC>
+            <LZFSC>15.0</LZFSC>
+            <LZFPC>50.0</LZFPC>
+            <ADIMC>100.0</ADIMC>
+        </Gauge>
+    </Parameters>
+</WaterBalance>
+```
+
+---
+
+## Calibration Considerations
+
+### Parameter Sensitivity
+
+**SAC Model:**
+- Most sensitive: UZTWM, UZFWM, LZTWM, ZPERC
+- Moderate: UZK, LZPK, LZSK
+- Less sensitive: PCTIM, ADIMP, RSERV
+
+**CREST Model:**
+- Most sensitive: WM, B, FC
+- Moderate: KE, IM
+- Less sensitive: IWU (initial condition)
+
+**HyMOD:**
+- Most sensitive: HUZ, ALP, KQ
+- Moderate: B, KS
+- Less sensitive: NQ (often fixed at 3)
+
+### Calibration Strategies
+
+1. **Sequential**: Calibrate water balance first, then routing
+2. **Joint**: Calibrate all parameters together
+3. **Multi-objective**: Balance multiple performance metrics
+
+### Common Objective Functions
+
+- Nash-Sutcliffe Efficiency (NSE)
+- Root Mean Square Error (RMSE)
+- Percent Bias (PBIAS)
+- Kling-Gupta Efficiency (KGE)
+
+---
+
+## References
+
+1. Burnash, R. J. C., Ferral, R. L., & McGuire, R. A. (1973). "A generalized streamflow simulation system: Conceptual models for digital computers." Joint Federal-State River Forecast Center, Sacramento, CA.
+
+2. Wang, J., et al. (2011). "The coupled routing and excess storage (CREST) distributed hydrological model." *Hydrological Sciences Journal*, 56(1), 84-98.
+
+3. Boyle, D. P. (2000). "Multicriteria calibration of hydrological models." PhD Dissertation, University of Arizona.
+
+4. Moore, R. J. (1985). "The probability-distributed principle and runoff production at point and basin scales." *Hydrological Sciences Journal*, 30(2), 273-297.
+
+5. Flamig, Z. L., et al. (2020). "The Ensemble Framework For Flash Flood Forecasting (EF5)." *Journal of Hydrometeorology*.

From 1b86c2e0ba2141a299b8f7d0cc62862e5a73ea53 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 17 Dec 2025 22:43:24 +0000
Subject: [PATCH 2/3] docs: Add GPU parallelization proposal for routing models

Add comprehensive proposal document for parallelizing EF5's water balance
and routing models for GPU execution:

Key contents:
- Analysis of current algorithm data dependencies
- Wavefront parallelism strategy for kinematic wave routing
- Topological level assignment algorithm
- Structure of Arrays (SoA) data layout for GPU efficiency
- CUDA kernel implementation examples
- Optimization strategies (level coalescing, atomic handling)
- Alternative algorithms (tree contraction, domain decomposition)
- Implementation roadmap with time estimates
- Performance projections (10-30x speedup expected)

The proposal identifies that water balance is embarrassingly parallel
while routing requires wavefront/level-based parallelism due to
upstream-to-downstream dependencies in the flow network.
---
 docs/GPU_PARALLELIZATION_PROPOSAL.md | 674 +++++++++++++++++++++++++++
 1 file changed, 674 insertions(+)
 create mode 100644 docs/GPU_PARALLELIZATION_PROPOSAL.md

diff --git a/docs/GPU_PARALLELIZATION_PROPOSAL.md b/docs/GPU_PARALLELIZATION_PROPOSAL.md
new file mode 100644
index 0000000..f980204
--- /dev/null
+++ b/docs/GPU_PARALLELIZATION_PROPOSAL.md
@@ -0,0 +1,674 @@
+# GPU Parallelization Proposal for EF5
+
+This document proposes strategies for parallelizing the water balance and routing models in EF5 for GPU execution, focusing on the Kinematic Wave routing algorithm.
+
+## Executive Summary
+
+EF5's computational pipeline consists of two main stages with different parallelization characteristics:
+
+| Stage | Parallelism Type | GPU Suitability | Expected Speedup |
+|-------|-----------------|-----------------|------------------|
+| Water Balance | Embarrassingly Parallel | Excellent | 10-100x |
+| Routing | Wavefront Parallel | Good | 5-20x |
+
+The water balance can achieve near-linear speedup with GPU cores. Routing requires more sophisticated algorithms but can still achieve significant speedup using **wavefront parallelism** or **parallel graph algorithms**.
+
+---
+
+## 1. Current Algorithm Analysis
+
+### 1.1 Water Balance (Fully Parallel)
+
+The water balance computation at each grid cell is **completely independent**:
+
+```cpp
+// Current implementation - sequential but independent
+for (size_t i = 0; i < numNodes; i++) {
+    waterBalance->WaterBalance(node[i], precip[i], pet[i]);
+}
+```
+
+Each cell only reads its local precipitation and PET, updates its local state, and outputs fastFlow/slowFlow. **No inter-cell dependencies exist**.
+
+### 1.2 Kinematic Wave Routing (Data Dependencies)
+
+The routing has **upstream-to-downstream dependencies**:
+
+```cpp
+// Current implementation - must process upstream first
+for (long i = numNodes - 1; i >= 0; i--) {  // Reverse topological order
+    RouteInt(node[i], fastFlow[i], slowFlow[i]);
+    // Writes to downstream cell:
+    // downstream.incomingWaterChannel += thisCell.outflow
+}
+```
+
+**Key Data Dependencies:**
+1. `incomingWaterOverland` - accumulated from upstream hillslope cells
+2. `incomingWaterChannel` - accumulated from upstream channel cells
+3. `incomingWater[KW_LAYER_INTERFLOW]` - accumulated interflow from upstream
+
+A cell cannot compute its outflow until ALL upstream cells have contributed their inflow.
+
+### 1.3 Dependency Graph Structure
+
+The flow network forms a **Directed Acyclic Graph (DAG)** with tree structure:
+- Multiple headwaters (sources, no incoming edges)
+- Single outlet (sink, no outgoing edge)
+- Each cell has exactly one downstream neighbor (D8 flow direction)
+- Cells can have 0-8 upstream neighbors
+
+```
+Level 0:  [H1] [H2] [H3] [H4] [H5] [H6]    <- Headwaters (parallel)
+             \   |   /       \   |
+Level 1:      [C1]            [C2]          <- Can process in parallel
+                 \            /
+Level 2:           [C3]----[C4]             <- Can process in parallel
+                       \  /
+Level 3:               [O]                   <- Outlet
+```
+
+---
+
+## 2. Proposed Parallelization Strategy
+
+### 2.1 Wavefront Parallelism (Recommended)
+
+**Concept:** Group cells into "levels" based on their topological distance from headwaters. All cells at the same level can be processed in parallel.
+
+**Algorithm: Level Assignment (Preprocessing)**
+
+```cpp
+// Run once during initialization
+void AssignTopologicalLevels(vector<GridNode>& nodes, vector<int>& levels) {
+    int numNodes = nodes.size();
+    vector<int> inDegree(numNodes, 0);
+
+    // Count upstream neighbors for each cell
+    for (int i = 0; i < numNodes; i++) {
+        if (nodes[i].downStreamNode != INVALID_DOWNSTREAM_NODE) {
+            inDegree[nodes[i].downStreamNode]++;
+        }
+    }
+
+    // Initialize headwaters (no upstream neighbors) at level 0
+    queue<int> frontier;
+    for (int i = 0; i < numNodes; i++) {
+        if (inDegree[i] == 0) {
+            levels[i] = 0;
+            frontier.push(i);
+        }
+    }
+
+    // BFS to assign levels
+    int maxLevel = 0;
+    while (!frontier.empty()) {
+        int current = frontier.front();
+        frontier.pop();
+
+        int downstream = nodes[current].downStreamNode;
+        if (downstream != INVALID_DOWNSTREAM_NODE) {
+            levels[downstream] = max(levels[downstream], levels[current] + 1);
+            maxLevel = max(maxLevel, levels[downstream]);
+
+            inDegree[downstream]--;
+            if (inDegree[downstream] == 0) {
+                frontier.push(downstream);
+            }
+        }
+    }
+}
+```
+
+**GPU Execution Strategy:**
+
+```cpp
+// Preprocessing: Group cells by level
+vector<vector<int>> levelGroups(maxLevel + 1);
+for (int i = 0; i < numNodes; i++) {
+    levelGroups[levels[i]].push_back(i);
+}
+
+// Runtime: Process level by level
+for (int level = 0; level <= maxLevel; level++) {
+    // Launch GPU kernel for all cells at this level (parallel)
+    int numCellsAtLevel = levelGroups[level].size();
+
+    routeKernel<<<numBlocks, threadsPerBlock>>>(
+        d_nodes, d_levelGroups[level], numCellsAtLevel,
+        d_fastFlow, d_slowFlow, d_discharge
+    );
+
+    cudaDeviceSynchronize();  // Wait for level to complete before next
+}
+```
+
+**GPU Kernel (CUDA-style pseudocode):**
+
+```cuda
+__global__ void routeKernel(
+    KWGridNode* nodes,
+    int* cellIndices,
+    int numCells,
+    float* fastFlow,
+    float* slowFlow,
+    float stepSeconds
+) {
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= numCells) return;
+
+    int cellIdx = cellIndices[tid];
+    KWGridNode* cNode = &nodes[cellIdx];
+
+    // Solve kinematic wave equation (Newton-Raphson)
+    float newQ = solveKinematicWave(cNode, fastFlow[cellIdx], slowFlow[cellIdx]);
+
+    // Update state
+    cNode->states[STATE_KW_PQ] = newQ;
+
+    // Atomic add to downstream cell's incoming water
+    int downstream = nodes[cellIdx].downStreamNode;
+    if (downstream != INVALID_DOWNSTREAM_NODE) {
+        atomicAdd(&nodes[downstream].incomingWaterChannel, newQ);
+    }
+}
+```
+
+### 2.2 Expected Parallelism Analysis
+
+For a typical watershed:
+
+| Metric | Typical Value | Impact |
+|--------|--------------|--------|
+| Total cells | 100,000 - 10,000,000 | More cells = better GPU utilization |
+| Number of levels | 100 - 5,000 | More levels = more synchronization overhead |
+| Cells per level (avg) | 1,000 - 100,000 | Higher = better parallelism |
+| Max cells at one level | 10,000 - 500,000 | Determines peak parallelism |
+
+**Parallelism efficiency:**
+- Headwater regions: Excellent (many independent cells)
+- Channel network: Lower (fewer cells, linear dependency chains)
+- Overall: 50-90% parallel efficiency typical
+
+---
+
+## 3. Data Structure Modifications for GPU
+
+### 3.1 Structure of Arrays (SoA) Layout
+
+Current **Array of Structures (AoS)** layout is cache-inefficient for GPU:
+
+```cpp
+// Current: AoS (poor GPU memory coalescing)
+struct KWGridNode {
+    float params[7];
+    double states[3];
+    double incomingWater[2];
+    // ...
+};
+vector<KWGridNode> kwNodes;
+```
+
+**Proposed: Structure of Arrays (SoA)**
+
+```cpp
+// Better for GPU: SoA
+struct KWGridNodesSoA {
+    // Parameters (read-only after init)
+    float* alpha;         // [numNodes]
+    float* alpha0;        // [numNodes]
+    float* beta;          // [numNodes]
+    float* leaki;         // [numNodes]
+
+    // States (read-write)
+    float* state_PQ;      // [numNodes]
+    float* state_PO;      // [numNodes]
+    float* state_IR;      // [numNodes]
+
+    // Incoming water (atomic updates)
+    float* incomingChannel;   // [numNodes]
+    float* incomingOverland;  // [numNodes]
+    float* incomingInterflow; // [numNodes]
+
+    // Topology (read-only)
+    int* downstreamNode;  // [numNodes]
+    int* level;           // [numNodes]
+    bool* isChannel;      // [numNodes]
+};
+```
+
+### 3.2 Memory Access Pattern
+
+```
+GPU Memory Layout:
+┌─────────────────────────────────────────────┐
+│ Parameters (read-only, cached in L2)        │
+│ ├── alpha[0..N-1]                           │
+│ ├── beta[0..N-1]                            │
+│ └── ...                                     │
+├─────────────────────────────────────────────┤
+│ States (read-write, main working set)       │
+│ ├── state_PQ[0..N-1]                        │
+│ ├── state_PO[0..N-1]                        │
+│ └── state_IR[0..N-1]                        │
+├─────────────────────────────────────────────┤
+│ Incoming (atomic updates, high contention)  │
+│ ├── incomingChannel[0..N-1]                 │
+│ └── incomingOverland[0..N-1]                │
+└─────────────────────────────────────────────┘
+```
+
+---
+
+## 4. Optimization Strategies
+
+### 4.1 Reducing Synchronization Overhead
+
+**Problem:** Processing level-by-level requires `cudaDeviceSynchronize()` between levels.
+
+**Solution 1: Level Coalescing**
+Merge consecutive levels if they have few cells:
+
+```cpp
+// If level has < MIN_CELLS_PER_KERNEL cells, merge with next level
+// This reduces kernel launch overhead
+vector<LevelBatch> batches = coalesceLevels(levelGroups, MIN_CELLS_PER_KERNEL);
+```
+
+**Solution 2: Persistent Kernels with Barriers**
+Use CUDA cooperative groups for grid-wide synchronization:
+
+```cuda
+#include <cooperative_groups.h>
+
+__global__ void persistentRouteKernel(/* ... */) {
+    namespace cg = cooperative_groups;
+    cg::grid_group grid = cg::this_grid();
+
+    for (int level = 0; level <= maxLevel; level++) {
+        // Process cells at this level
+        if (myLevel == level) {
+            processCell(myCellIdx);
+        }
+
+        // Grid-wide barrier (all threads sync)
+        grid.sync();
+    }
+}
+```
+
+### 4.2 Handling Atomic Contention
+
+**Problem:** Multiple upstream cells writing to same downstream cell causes contention.
+
+**Analysis:** Most cells have 1-3 upstream neighbors, so contention is moderate.
+
+**Solution 1: Hardware Atomics (Simple)**
+Modern GPUs have fast atomic operations:
+```cuda
+atomicAdd(&downstream.incomingWater, myOutflow);
+```
+
+**Solution 2: Thread-Local Accumulation**
+For high-contention scenarios:
+```cuda
+// Each warp accumulates locally first
+__shared__ float localAccum[32][MAX_DOWNSTREAM];
+// Then one thread per downstream does final atomic
+```
+
+**Solution 3: Separate Read/Write Buffers (Double Buffering)**
+```cpp
+// Level N writes to buffer A, Level N+1 reads from buffer A and writes to buffer B
+float* incomingRead;   // Previous level's output
+float* incomingWrite;  // Current level's output
+swap(incomingRead, incomingWrite);  // After each level
+```
+
+### 4.3 Newton-Raphson Convergence
+
+The iterative solver can have variable iteration counts:
+
+```cuda
+// Current: Fixed 10 iterations (may be wasteful or insufficient)
+for (int itr = 0; itr < 10; itr++) { ... }
+
+// Better: Early exit with warp voting
+for (int itr = 0; itr < MAX_ITER; itr++) {
+    float error = computeError();
+    // Check if all threads in warp have converged
+    if (__all_sync(0xFFFFFFFF, error < TOLERANCE)) break;
+    // Update estimate
+}
+```
+
+---
+
+## 5. Alternative Algorithms
+
+### 5.1 Parallel Tree Contraction
+
+For strictly tree-structured networks (no braiding), the **Euler tour technique** and **parallel tree contraction** can achieve O(log N) depth:
+
+1. Convert flow tree to Euler tour representation
+2. Use parallel prefix sum to compute accumulated flows
+3. Complexity: O(N/P + log N) for P processors
+
+**Best for:** Simple dendritic networks without complex routing physics.
+
+### 5.2 Domain Decomposition
+
+Split the watershed into sub-basins processed independently:
+
+```
+┌─────────────────────────────────────┐
+│         Sub-basin 1 (GPU 1)         │
+│    ┌───────┐     ┌───────┐          │
+│    │ H1-H3 │────→│  C1   │──┐       │
+│    └───────┘     └───────┘  │       │
+├─────────────────────────────│───────┤
+│         Sub-basin 2 (GPU 2) │       │
+│    ┌───────┐     ┌───────┐  │       │
+│    │ H4-H6 │────→│  C2   │──│       │
+│    └───────┘     └───────┘  │       │
+├─────────────────────────────▼───────┤
+│         Junction (sync point)       │
+│                  │                  │
+│                  ▼                  │
+│               [Outlet]              │
+└─────────────────────────────────────┘
+```
+
+**Advantages:**
+- Can use multiple GPUs
+- Reduces global synchronization
+
+**Disadvantages:**
+- Requires ghost cell exchange
+- Load balancing challenges
+
+### 5.3 Algebraic Multigrid (for implicit methods)
+
+If switching to an implicit kinematic wave solver, algebraic multigrid can be parallelized effectively:
+
+1. Formulate as sparse linear system: **Ax = b**
+2. Use GPU-accelerated AMG solver (e.g., AmgX, hypre)
+3. Good for stiff systems and large timesteps
+
+---
+
+## 6. Implementation Roadmap
+
+### Phase 1: Water Balance on GPU (2-4 weeks)
+- Port water balance models to CUDA kernels
+- Implement SoA data structures
+- Expected speedup: **10-50x** for water balance alone
+
+### Phase 2: Level-Based Routing (4-6 weeks)
+- Implement topological level assignment
+- Create GPU routing kernel with atomic updates
+- Add synchronization between levels
+- Expected speedup: **5-15x** for routing
+
+### Phase 3: Optimization (2-4 weeks)
+- Profile and optimize memory access patterns
+- Implement level coalescing
+- Tune kernel launch parameters
+- Target: **15-30x** overall speedup
+
+### Phase 4: Multi-GPU Support (optional, 4-6 weeks)
+- Domain decomposition for multiple GPUs
+- Peer-to-peer memory access for ghost cells
+- Useful for continental-scale simulations
+
+---
+
+## 7. Performance Estimates
+
+### 7.1 Baseline Comparison
+
+| Configuration | Time per Timestep | Relative Speed |
+|--------------|-------------------|----------------|
+| CPU Single-threaded (current) | 100ms | 1x |
+| CPU Multi-threaded (8 cores) | 15ms | 6.7x |
+| GPU Naive (water balance only) | 5ms | 20x |
+| GPU Optimized (full pipeline) | 3ms | 33x |
+
+*Estimates for 1M cell watershed on RTX 3080*
+
+### 7.2 Scaling Analysis
+
+```
+Speedup vs. Grid Size (estimated):
+
+Speedup │                         ████
+  50x   │                    ████ ████
+        │               ████ ████ ████
+  30x   │          ████ ████ ████ ████
+        │     ████ ████ ████ ████ ████
+  10x   │████ ████ ████ ████ ████ ████
+        │████ ████ ████ ████ ████ ████
+   1x   └────────────────────────────────
+        10K  100K  500K   1M   5M   10M
+                   Grid Cells
+```
+
+GPU parallelization becomes increasingly beneficial at larger scales.
+
+---
+
+## 8. Technology Recommendations
+
+### 8.1 GPU Framework
+
+| Framework | Pros | Cons | Recommendation |
+|-----------|------|------|----------------|
+| **CUDA** | Best performance, mature | NVIDIA-only | Primary target |
+| **HIP** | Portable CUDA | Less mature | Secondary for AMD |
+| **OpenCL** | Cross-platform | Verbose, less performant | Not recommended |
+| **SYCL** | Modern C++, portable | Young ecosystem | Future option |
+
+**Recommendation:** Start with CUDA, use HIP for AMD portability.
+
+### 8.2 Libraries
+
+| Library | Purpose |
+|---------|---------|
+| **Thrust** | GPU-accelerated STL (sorting, reduction) |
+| **CUB** | Low-level GPU primitives (scan, sort) |
+| **cuSPARSE** | Sparse matrix operations (if needed) |
+| **RAPIDS cuGraph** | Graph algorithms (alternative approach) |
+
+---
+
+## 9. Code Example: GPU Wavefront Routing
+
+```cuda
+// gpu_kinematic_route.cu
+
+#include <cuda_runtime.h>
+#include <cooperative_groups.h>
+
+namespace cg = cooperative_groups;
+
+// SoA data structure
+struct KWDataGPU {
+    // Parameters
+    float* alpha;
+    float* alpha0;
+    float* beta;
+    float* leaki;
+    float* horLen;
+
+    // States
+    float* state_PQ;
+    float* state_PO;
+    float* state_IR;
+
+    // Flow accumulators
+    float* incomingChannel;
+    float* incomingOverland;
+
+    // Topology
+    int* downstream;
+    int* level;
+    bool* isChannel;
+
+    // Level structure
+    int* levelStart;    // Start index for each level
+    int* levelSize;     // Number of cells at each level
+    int* cellOrder;     // Cells sorted by level
+    int maxLevel;
+};
+
+__device__ float solveKinematicWave(
+    float incoming, float prevQ,
+    float lateralInflow, float alpha, float beta,
+    float dt, float dx
+) {
+    float rhs = (dt/dx) * incoming +
+                alpha * powf(prevQ, beta) +
+                dt * lateralInflow;
+
+    // Initial estimate
+    float q = rhs / ((dt/dx) + alpha);
+
+    // Newton-Raphson iterations
+    for (int i = 0; i < 10; i++) {
+        float f = (dt/dx) * q + alpha * powf(q, beta) - rhs;
+        if (fabsf(f) < 0.01f) break;
+
+        float fp = (dt/dx) + alpha * beta * powf(q, beta - 1.0f);
+        q = fmaxf(0.0f, q - f / fp);
+    }
+
+    return q;
+}
+
+__global__ void routeLevelKernel(
+    KWDataGPU data,
+    float* fastFlow,
+    float* slowFlow,
+    float dt,
+    int level
+) {
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int levelStart = data.levelStart[level];
+    int levelSize = data.levelSize[level];
+
+    if (tid >= levelSize) return;
+
+    int cellIdx = data.cellOrder[levelStart + tid];
+
+    float ff = fastFlow[cellIdx] / 1000.0f;  // mm to m
+    float sf = slowFlow[cellIdx] / 1000.0f;
+
+    float alpha, beta, newQ;
+
+    if (!data.isChannel[cellIdx]) {
+        // Hillslope routing
+        alpha = data.alpha0[cellIdx];
+        beta = 0.6f;
+
+        newQ = solveKinematicWave(
+            data.incomingOverland[cellIdx],
+            data.state_PQ[cellIdx],
+            ff, alpha, beta, dt, data.horLen[cellIdx]
+        );
+
+        data.state_PQ[cellIdx] = newQ;
+
+        // Interflow handling
+        data.state_IR[cellIdx] += sf;
+        float leak = data.state_IR[cellIdx] * data.leaki[cellIdx];
+        data.state_IR[cellIdx] -= leak;
+
+        // Atomic update to downstream
+        int ds = data.downstream[cellIdx];
+        if (ds >= 0) {
+            atomicAdd(&data.incomingOverland[ds], newQ);
+        }
+
+    } else {
+        // Channel routing
+        alpha = data.alpha[cellIdx];
+        beta = data.beta[cellIdx];
+
+        // First hillslope contribution
+        float overlandQ = solveKinematicWave(
+            data.incomingOverland[cellIdx],
+            data.state_PO[cellIdx],
+            ff + sf, data.alpha0[cellIdx], 0.6f,
+            dt, data.horLen[cellIdx]
+        );
+        data.state_PO[cellIdx] = overlandQ;
+
+        // Then channel routing
+        newQ = solveKinematicWave(
+            data.incomingChannel[cellIdx],
+            data.state_PQ[cellIdx],
+            overlandQ, alpha, beta, dt, data.horLen[cellIdx]
+        );
+        data.state_PQ[cellIdx] = newQ;
+
+        int ds = data.downstream[cellIdx];
+        if (ds >= 0) {
+            atomicAdd(&data.incomingChannel[ds], newQ);
+        }
+    }
+}
+
+// Host function to run routing
+void gpuRoute(KWDataGPU& d_data, float* d_fastFlow, float* d_slowFlow,
+              float dt, cudaStream_t stream) {
+
+    const int BLOCK_SIZE = 256;
+
+    for (int level = 0; level <= d_data.maxLevel; level++) {
+        int numCells = d_data.levelSize[level];
+        int numBlocks = (numCells + BLOCK_SIZE - 1) / BLOCK_SIZE;
+
+        routeLevelKernel<<<numBlocks, BLOCK_SIZE, 0, stream>>>(
+            d_data, d_fastFlow, d_slowFlow, dt, level
+        );
+    }
+
+    cudaStreamSynchronize(stream);
+}
+```
+
+---
+
+## 10. References
+
+1. **Wavefront Parallelism:**
+   - Merrill, D., & Garland, M. (2016). "Single-pass Parallel Prefix Scan with Decoupled Look-back." *GPU Technology Conference*.
+
+2. **Parallel Graph Algorithms:**
+   - Harish, P., & Narayanan, P. J. (2007). "Accelerating Large Graph Algorithms on the GPU Using CUDA." *High Performance Computing*.
+
+3. **Hydrological Model GPU Acceleration:**
+   - Lacasta, A., et al. (2014). "GPU Implementation of the 2D Shallow Water Equations." *Journal of Hydroinformatics*.
+   - Ming, Y., et al. (2020). "A GPU-Accelerated Distributed Hydrological Model." *Environmental Modelling & Software*.
+
+4. **CUDA Optimization:**
+   - NVIDIA CUDA Best Practices Guide (2024).
+   - Harris, M. (2013). "Optimizing Parallel Reduction in CUDA." NVIDIA Developer.
+
+---
+
+## Appendix A: Complexity Analysis
+
+| Operation | Sequential | Parallel (P processors) |
+|-----------|-----------|------------------------|
+| Water Balance | O(N) | O(N/P) |
+| Level Assignment | O(N) | O(N/P + D) |
+| Routing (per timestep) | O(N) | O(N/P + D) |
+| **Total per timestep** | **O(N)** | **O(N/P + D)** |
+
+Where:
+- N = number of grid cells
+- P = number of parallel processors
+- D = depth of flow network (number of levels)
+
+For typical watersheds: D ≈ O(√N), so parallel complexity ≈ O(N/P + √N)

From 9859eaf40206852d17ef293d37484a4dad7fb404 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 17 Dec 2025 22:53:27 +0000
Subject: [PATCH 3/3] docs: Add implicit vs explicit solver analysis for GPU
 routing

Add comprehensive analysis comparing implicit and explicit kinematic wave
solvers and their GPU implementation strategies:

Key contents:
- Analysis of current semi-implicit scheme (local implicit, global explicit)
- Fully implicit formulation with sparse Jacobian matrix structure
- GPU implementation options:
  - cuSPARSE triangular solve (exploits tree structure)
  - Algebraic Multigrid (AmgX) for better parallelism
  - Iterative Jacobi/Gauss-Seidel approaches
- Trade-off analysis: stability, timestep, accuracy, computational cost
- Hybrid approach: semi-implicit hillslopes + implicit channels
- Diffusive wave extension for backwater effects
- Performance projections and library recommendations
- Implementation roadmap with three phases

Conclusion: Semi-implicit (current approach) is better for flash flood
forecasting; fully implicit advantageous only when timesteps can be
4-6x larger (daily routing, stiff systems).
---
 docs/IMPLICIT_VS_EXPLICIT_SOLVERS.md | 507 +++++++++++++++++++++++++++
 1 file changed, 507 insertions(+)
 create mode 100644 docs/IMPLICIT_VS_EXPLICIT_SOLVERS.md

diff --git a/docs/IMPLICIT_VS_EXPLICIT_SOLVERS.md b/docs/IMPLICIT_VS_EXPLICIT_SOLVERS.md
new file mode 100644
index 0000000..0d1e6a2
--- /dev/null
+++ b/docs/IMPLICIT_VS_EXPLICIT_SOLVERS.md
@@ -0,0 +1,507 @@
+# Implicit vs Explicit Kinematic Wave Solvers: GPU Implementation Analysis
+
+This document analyzes the trade-offs between implicit and explicit kinematic wave solvers and their implications for GPU parallelization in EF5.
+
+## 1. Current Implementation Analysis
+
+### 1.1 What EF5 Currently Uses: Semi-Implicit Scheme
+
+The current EF5 kinematic wave router uses a **local implicit, global explicit** scheme:
+
+```
+Current Discretization (Backward-Difference in Space, Mixed in Time):
+
+(Δt/Δx) * Q[i,n+1] + α * Q[i,n+1]^β = (Δt/Δx) * Q[i-1,n+1] + α * Q[i,n]^β + Δt * q_lateral
+
+Where:
+- Q[i,n+1] = discharge at cell i, new timestep (UNKNOWN)
+- Q[i-1,n+1] = discharge at upstream cell, new timestep (KNOWN - already computed)
+- Q[i,n] = discharge at cell i, previous timestep (KNOWN)
+- q_lateral = lateral inflow from water balance
+```
+
+**Key Characteristics:**
+- **Local implicit:** Each cell solves a nonlinear equation for its own Q using Newton-Raphson
+- **Global explicit:** Upstream cells must be computed before downstream cells
+- **Upstream dependency:** Requires wavefront processing order
+
+### 1.2 Numerical Scheme Classification
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    SOLVER CLASSIFICATION                         │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  EXPLICIT          SEMI-IMPLICIT           FULLY IMPLICIT        │
+│  (Forward Euler)   (Current EF5)          (Proposed)            │
+│                                                                  │
+│  Q[n+1] = f(Q[n])  Q[i,n+1] = f(Q[i-1,n+1], Q[i,n])            │
+│                                            AQ[n+1] = b          │
+│                                                                  │
+│  ✗ CFL limited     ✓ Larger Δt possible   ✓ Unconditionally     │
+│  ✗ Small timesteps ✓ Sequential sweep         stable            │
+│  ✓ No iteration    ✓ Local Newton-Raphson ✗ Global matrix solve │
+│  ✓ Trivially       ~ Wavefront parallel   ✓ Fully parallel      │
+│    parallel                                  (different algo)   │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 2. Fully Implicit Kinematic Wave Formulation
+
+### 2.1 The Coupled System
+
+A fully implicit scheme treats ALL cells simultaneously:
+
+```
+For each cell i, the kinematic wave equation becomes:
+
+(Δt/Δx) * Q[i]^(n+1) - (Δt/Δx) * Q[upstream(i)]^(n+1) + α * Q[i]^(n+1)^β = RHS[i]
+
+This creates a coupled system of N nonlinear equations:
+F(Q) = 0
+
+Where Q = [Q₁, Q₂, ..., Qₙ]ᵀ is the vector of all cell discharges.
+```
+
+### 2.2 Linearization (Newton's Method)
+
+At each Newton iteration k:
+
+```
+J(Q^k) * δQ = -F(Q^k)
+Q^(k+1) = Q^k + δQ
+
+Where J is the Jacobian matrix:
+J[i,j] = ∂F[i]/∂Q[j]
+
+For kinematic wave on a tree network:
+- J[i,i] = (Δt/Δx) + α*β*Q[i]^(β-1)     (diagonal)
+- J[i,upstream(i)] = -(Δt/Δx)            (off-diagonal, upstream connection)
+- J[i,j] = 0 for non-connected cells
+```
+
+### 2.3 Matrix Structure
+
+The Jacobian matrix has a **sparse tree structure**:
+
+```
+For a simple 6-cell network:
+
+    [1] → [2] → [4]
+              ↘
+    [3] ────→ [5] → [6]
+
+Jacobian structure (• = non-zero):
+       1   2   3   4   5   6
+    ┌                         ┐
+  1 │ •                       │
+  2 │ •   •                   │
+  3 │         •               │
+  4 │     •       •           │
+  5 │     •   •       •       │
+  6 │                 •   •   │
+    └                         ┘
+
+Properties:
+- Lower triangular (if ordered upstream → downstream)
+- Sparse: only 2N-1 non-zeros for N cells
+- Tree structure: no cycles
+```
+
+---
+
+## 3. GPU Implementation Strategies
+
+### 3.1 Strategy Comparison
+
+| Aspect | Semi-Implicit (Current) | Fully Implicit |
+|--------|------------------------|----------------|
+| **Parallelism** | Wavefront (level-by-level) | Matrix solve (all cells) |
+| **Sync points** | O(D) barriers (D = depth) | O(k) iterations (k = Newton steps) |
+| **Memory** | O(N) states | O(N) + sparse matrix storage |
+| **Timestep** | Limited by accuracy | Unconditionally stable |
+| **GPU algorithm** | Custom wavefront kernel | Sparse linear solver |
+| **Libraries** | Custom CUDA | cuSPARSE, AmgX, PETSc |
+
+### 3.2 Fully Implicit GPU Implementation
+
+#### Option A: Direct Sparse Triangular Solve
+
+Since the Jacobian is **lower triangular** (for properly ordered nodes), we can use:
+
+```cpp
+// Using cuSPARSE for triangular solve
+#include <cusparse.h>
+
+void implicitKinematicStep(
+    cusparseHandle_t handle,
+    cusparseSpMatDescr_t jacobian,  // CSR format
+    cusparseDnVecDescr_t rhs,
+    cusparseDnVecDescr_t solution
+) {
+    // Sparse triangular solve: J * δQ = -F
+    cusparseSpSV_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,
+                       &alpha, jacobian, rhs, solution, ...);
+}
+```
+
+**Complexity:** O(nnz/P + D) where nnz = non-zeros, P = processors, D = depth
+
+**Challenge:** Triangular solves have limited parallelism (same wavefront issue!)
+
+#### Option B: Algebraic Multigrid (AMG)
+
+For more general cases or when reordering isn't optimal:
+
+```cpp
+// Using NVIDIA AmgX
+#include <amgx_c.h>
+
+AMGX_config_handle cfg;
+AMGX_config_create(&cfg, "config_version=2, "
+    "solver(main)=FGMRES, "
+    "main:preconditioner(amg)=AMG, "
+    "main:max_iters=100, "
+    "main:tolerance=1e-6");
+
+AMGX_solver_handle solver;
+AMGX_solver_create(&solver, resources, mode, cfg);
+
+// Setup matrix (once per simulation)
+AMGX_matrix_upload_all(matrix, n, nnz, 1, 1,
+                       row_ptrs, col_indices, values, NULL);
+
+// Solve each timestep
+AMGX_solver_setup(solver, matrix);
+AMGX_solver_solve(solver, rhs, solution);
+```
+
+**Complexity:** O(N/P) per V-cycle, O(log N) cycles typical
+
+**Advantage:** Better parallelism than triangular solve
+
+#### Option C: Iterative Jacobi/Gauss-Seidel on GPU
+
+```cuda
+__global__ void jacobiIteration(
+    float* Q_new,
+    float* Q_old,
+    float* rhs,
+    int* upstream,
+    float* alpha,
+    float* beta,
+    float dt_dx,
+    int N
+) {
+    int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= N) return;
+
+    float diag = dt_dx + alpha[i] * beta[i] * powf(Q_old[i], beta[i] - 1.0f);
+    float upstream_contrib = (upstream[i] >= 0) ? dt_dx * Q_old[upstream[i]] : 0.0f;
+
+    Q_new[i] = (rhs[i] + upstream_contrib) / diag;
+}
+
+// Host code: iterate until convergence
+for (int iter = 0; iter < MAX_ITER; iter++) {
+    jacobiIteration<<<blocks, threads>>>(Q_new, Q_old, rhs, ...);
+    swap(Q_new, Q_old);
+
+    if (checkConvergence(Q_new, Q_old) < tolerance) break;
+}
+```
+
+**Issue:** Jacobi converges slowly; Gauss-Seidel has wavefront dependency
+
+---
+
+## 4. Trade-offs Analysis
+
+### 4.1 Stability and Timestep
+
+```
+COURANT-FRIEDRICHS-LEWY (CFL) CONDITION:
+
+Explicit:     Δt ≤ Δx / c_max        (c = wave celerity)
+Semi-Implicit: Δt ≤ K * Δx / c_max   (K ≈ 2-5, less restrictive)
+Fully Implicit: Δt unlimited          (unconditionally stable)
+
+Example for 1km cells, c_max = 2 m/s:
+- Explicit:      Δt ≤ 500 seconds
+- Semi-Implicit: Δt ≤ 1000-2500 seconds
+- Fully Implicit: Δt = any (limited by accuracy, not stability)
+```
+
+### 4.2 Accuracy vs Efficiency Trade-off
+
+```
+                    ACCURACY
+                       ↑
+                       │
+        Implicit       │     Explicit
+        Small Δt       │     Small Δt
+        (overkill)     │     (optimal)
+                       │
+        ─ ─ ─ ─ ─ ─ ─ ─┼─ ─ ─ ─ ─ ─ ─ → EFFICIENCY
+                       │
+        Implicit       │     Explicit
+        Large Δt       │     Large Δt
+        (good)         │     (UNSTABLE!)
+                       │
+                       ↓
+
+Sweet spot: Implicit with moderate Δt (accuracy-limited, not stability-limited)
+```
+
+### 4.3 Computational Cost Breakdown
+
+| Component | Semi-Implicit | Fully Implicit |
+|-----------|--------------|----------------|
+| **Matrix assembly** | None | O(N) per outer iteration |
+| **Solve per timestep** | O(N) wavefront | O(N·k) where k = Newton iters |
+| **Newton iterations** | 10 local per cell | 3-5 global |
+| **Inner iterations** | None | 10-50 (linear solver) |
+| **Memory bandwidth** | Low | High (matrix ops) |
+| **Total FLOPS** | ~100N per timestep | ~500-2000N per timestep |
+
+### 4.4 When Each Approach Wins
+
+**Semi-Implicit (Current) is Better When:**
+- Timestep is already small (e.g., 5-15 minute intervals for flash flood)
+- Network is deep (many levels → limited implicit parallelism)
+- Memory is constrained (no matrix storage needed)
+- Real-time forecasting (predictable, low latency)
+
+**Fully Implicit is Better When:**
+- Very large timesteps desired (e.g., daily routing)
+- Stiff systems (steep slopes, high velocities)
+- Backwater effects included (diffusive wave, not just kinematic)
+- Using existing GPU sparse solver infrastructure
+
+---
+
+## 5. Hybrid Approach: Best of Both Worlds
+
+### 5.1 Proposed Hybrid Strategy
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    HYBRID APPROACH                               │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  1. HILLSLOPE CELLS: Semi-Implicit (Wavefront)                  │
+│     - Many cells, shallow network                                │
+│     - Fast local Newton-Raphson                                  │
+│     - Excellent GPU parallelism                                  │
+│                                                                  │
+│  2. CHANNEL CELLS: Fully Implicit (Sparse Solve)                │
+│     - Fewer cells, deeper network                                │
+│     - May need larger timesteps                                  │
+│     - Better stability for steep channels                        │
+│                                                                  │
+│  3. COUPLING: Channel receives hillslope contribution           │
+│     - Hillslope → Channel at each timestep                       │
+│     - One-way coupling (no backwater)                            │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 5.2 Implementation Sketch
+
+```cpp
+void hybridRoute(float dt) {
+    // Phase 1: Hillslope routing (semi-implicit, wavefront parallel)
+    for (int level = 0; level <= maxHillslopeLevel; level++) {
+        hillslopeKernel<<<...>>>(hillslopeCells[level], ...);
+        cudaDeviceSynchronize();
+    }
+
+    // Phase 2: Accumulate hillslope contributions to channels
+    accumulateToChannels<<<...>>>(hillslopeOutflow, channelLateralInflow);
+
+    // Phase 3: Channel routing (fully implicit)
+    assembleChannelJacobian<<<...>>>(channelJacobian, channelRHS);
+
+    for (int newton = 0; newton < MAX_NEWTON; newton++) {
+        // Sparse triangular solve for channel network
+        cusparseSpSV_solve(..., channelJacobian, channelRHS, channelQ);
+
+        if (newtonConverged()) break;
+
+        updateJacobian<<<...>>>(channelQ, channelJacobian);
+    }
+}
+```
+
+---
+
+## 6. Recommended GPU Libraries for Implicit Solvers
+
+### 6.1 Library Comparison
+
+| Library | Type | Strengths | GPU Support |
+|---------|------|-----------|-------------|
+| **cuSPARSE** | Direct | Fast triangular solve | NVIDIA native |
+| **AmgX** | AMG | Excellent for large systems | NVIDIA native |
+| **PETSc** | General | Flexible, many algorithms | CUDA, HIP, Kokkos |
+| **Trilinos** | General | Scalable, MPI+GPU | Kokkos backend |
+| **hypre** | AMG | Proven at scale | CUDA, HIP |
+| **MAGMA** | Dense/Sparse | High performance | CUDA, HIP |
+
+### 6.2 Recommended Stack
+
+```
+For EF5 Implicit Solver:
+
+Primary:   cuSPARSE (triangular solve for tree networks)
+           └── Fast, native CUDA, exploits tree structure
+
+Fallback:  AmgX (when triangular solve is slow)
+           └── Better parallelism for complex networks
+
+Future:    PETSc with CUDA (if diffusive wave needed)
+           └── Supports more general equation systems
+```
+
+---
+
+## 7. Diffusive Wave Extension
+
+### 7.1 Why Consider Diffusive Wave?
+
+The kinematic wave assumes `∂h/∂x = S₀` (bed slope). This fails for:
+- Backwater effects (downstream control)
+- Mild slopes
+- Tidal boundaries
+- Dam breaks
+
+The **diffusive wave** adds a pressure gradient term:
+
+```
+Kinematic:  S_f = S_0                    (friction slope = bed slope)
+Diffusive:  S_f = S_0 - ∂h/∂x           (includes water surface slope)
+
+Diffusive wave equation:
+∂Q/∂t + c·∂Q/∂x = D·∂²Q/∂x²
+
+Where D = Q / (2·B·S_f) is diffusion coefficient
+```
+
+### 7.2 GPU Implementation for Diffusive Wave
+
+The diffusive wave creates a **tridiagonal** or **pentadiagonal** system:
+
+```
+For 1D channel:
+[b₁ c₁  0  0  0 ] [Q₁]   [r₁]
+[a₂ b₂ c₂  0  0 ] [Q₂]   [r₂]
+[ 0 a₃ b₃ c₃  0 ] [Q₃] = [r₃]
+[ 0  0 a₄ b₄ c₄] [Q₄]   [r₄]
+[ 0  0  0 a₅ b₅] [Q₅]   [r₅]
+
+GPU algorithm: Cyclic Reduction or Parallel Cyclic Reduction (PCR)
+Complexity: O(N/P + log N)
+```
+
+**CUDA Library:** cuSPARSE `gtsv2` (tridiagonal solver)
+
+```cpp
+cusparseSgtsv2_bufferSizeExt(handle, n, 1, dl, d, du, B, ldb, &bufferSize);
+cusparseSgtsv2(handle, n, 1, dl, d, du, B, ldb, pBuffer);
+```
+
+---
+
+## 8. Performance Projections
+
+### 8.1 Estimated GPU Performance
+
+| Method | 100K cells | 1M cells | 10M cells |
+|--------|-----------|----------|-----------|
+| **Semi-Implicit (wavefront)** | 2 ms | 15 ms | 120 ms |
+| **Fully Implicit (triangular)** | 3 ms | 25 ms | 200 ms |
+| **Fully Implicit (AMG)** | 5 ms | 20 ms | 80 ms |
+| **Hybrid (hillslope + channel)** | 2.5 ms | 12 ms | 70 ms |
+
+*Estimates for RTX 3080, single timestep*
+
+### 8.2 Timestep Trade-off
+
+```
+Total computation time = (Simulation Duration / Δt) × (Time per step)
+
+Example: 24-hour simulation, 1M cells
+
+Semi-Implicit, Δt=5min:   (1440/5) × 15ms  = 4.3 seconds
+Fully Implicit, Δt=30min: (1440/30) × 80ms = 3.8 seconds
+Fully Implicit, Δt=60min: (1440/60) × 100ms = 2.4 seconds
+
+Conclusion: Implicit wins only if timestep can be ~4-6x larger
+```
+
+---
+
+## 9. Recommendations
+
+### 9.1 Short-term (Recommended First Step)
+
+**Keep semi-implicit, implement wavefront GPU parallelization:**
+- Lower implementation complexity
+- Good for flash flood forecasting (small timesteps anyway)
+- Predictable performance
+- No external library dependencies
+
+### 9.2 Medium-term (If Larger Timesteps Needed)
+
+**Add fully implicit option for channel routing:**
+- Use cuSPARSE triangular solve
+- Keep semi-implicit for hillslopes
+- Hybrid approach balances parallelism and timestep flexibility
+
+### 9.3 Long-term (For Advanced Applications)
+
+**Implement diffusive wave with AMG solver:**
+- Required for backwater effects
+- Use AmgX or PETSc
+- Enables new applications (dam break, tidal, reservoirs)
+
+---
+
+## 10. Implementation Checklist
+
+### Phase 1: Semi-Implicit GPU (Current Approach)
+- [ ] Implement topological level assignment
+- [ ] Create SoA data structures
+- [ ] Write wavefront CUDA kernel
+- [ ] Benchmark and optimize
+
+### Phase 2: Fully Implicit Option
+- [ ] Implement Jacobian assembly kernel
+- [ ] Integrate cuSPARSE triangular solve
+- [ ] Add Newton iteration loop
+- [ ] Compare performance vs semi-implicit
+
+### Phase 3: Diffusive Wave Extension
+- [ ] Modify governing equations
+- [ ] Implement tridiagonal solver for channels
+- [ ] Add boundary condition handling
+- [ ] Validate against analytical solutions
+
+---
+
+## References
+
+1. Cunge, J.A., Holly, F.M., & Verwey, A. (1980). *Practical Aspects of Computational River Hydraulics*. Pitman.
+
+2. Sanders, B.F. (2008). "Integration of a shallow water model with a local time step." *Journal of Hydraulic Research*, 46(4), 466-475.
+
+3. Caviedes-Voullième, D., et al. (2020). "GPU-accelerated shallow water flow solvers." *Journal of Hydroinformatics*, 22(4), 784-802.
+
+4. Bell, N., & Garland, M. (2008). "Efficient sparse matrix-vector multiplication on CUDA." NVIDIA Technical Report NVR-2008-004.
+
+5. NVIDIA cuSPARSE Documentation (2024). https://docs.nvidia.com/cuda/cusparse/
+
+6. NVIDIA AmgX Documentation (2024). https://developer.nvidia.com/amgx