💬 Questions or feedback? Start a Discussion. Found it useful? A ⭐ helps others find it.
Distribution-free prediction intervals for insurance GBM and GLM pricing models — for pricing actuaries who need uncertainty quantification that holds regardless of model specification, without the coverage failures that parametric intervals produce on heterogeneous motor books.
50,000 synthetic UK motor policies. CatBoost Tweedie(p=1.5) point forecast. Heteroskedastic Gamma DGP where high-mean risks are more dispersed than Tweedie(1.5) predicts. Temporal 60/20/20 split. 90% target coverage. Run on Databricks serverless, seed=42.
| Metric | Parametric Tweedie | Conformal (pearson_weighted) | LW Conformal |
|---|---|---|---|
| Aggregate coverage @ 90% | 0.931 (over-wide) | 0.902 | 0.903 |
| Worst-decile coverage | 0.904 | 0.879 | 0.906 |
| Mean interval width (£) | 4,393 | 3,806 (-13.4%) | 3,881 (-11.7%) |
| Calibration time | ~0s | 0.01s | ~1.5s |
| Width adapts to risk segment | No | Partial | Yes |
| Distribution-free guarantee | No | Yes (marginal) | Yes (marginal) |
The parametric approach estimates a single sigma on the calibration set. When high-mean risks are genuinely more dispersed — as they are here — it over-covers low-risk policies (widening intervals to compensate) while just barely meeting the target in the top decile. Conformal intervals are 13-14% narrower and meet the target on aggregate. The locally-weighted variant also meets it in the top decile. See the full benchmark at benchmarks/benchmark_gbm.py.
Your Tweedie GBM gives point estimates. A pricing actuary needs to know the uncertainty around those estimates - not as a parametric confidence interval that depends on distributional assumptions, but as a guarantee: this interval will contain the actual loss at least 90% of the time, for any data distribution.
Conformal prediction provides that guarantee. The catch is that the choice of non-conformity score determines interval width. Most conformal implementations use the raw absolute residual |y - yhat|. For insurance data, that is wrong: it treats a 1-unit error on a £100 risk identically to a 1-unit error on a £10,000 risk, producing intervals that are too wide on low-risk policies and too narrow on large risks.
For Tweedie/Poisson models, Var(Y) ~ mu^p. The correct non-conformity score is the locally-weighted Pearson residual:
score(y, yhat) = |y - yhat| / yhat^(p/2)
This accounts for the inherent heteroscedasticity of insurance claims. The result: 13-14% narrower intervals with identical coverage guarantees in the CatBoost Tweedie(p=1.5) GBM benchmark (pearson_weighted: -13.4%, LW Conformal: -11.7%; 50k synthetic UK motor policies, heteroskedastic Gamma DGP, temporal 60/20/20 split, seed=42). Based on Manna et al. (2025, preprint) and arXiv 2507.06921.
Conformal Prediction Intervals for Insurance Pricing Models
uv add insurance-conformal
# With CatBoost support:
uv add "insurance-conformal[catboost]"
# With LightGBM support:
uv add "insurance-conformal[lightgbm]"
# With everything (CatBoost, LightGBM, plotting):
uv add "insurance-conformal[all]"Dependencies: polars and pandas are both required. Polars is the primary output format — all prediction and diagnostic methods return pl.DataFrame. Pandas is required for binning utilities (pd.qcut/pd.cut) and for accepting pandas DataFrame inputs. Both install automatically.
import numpy as np
from insurance_conformal import InsuranceConformalPredictor
# Synthetic data: 50k training, 10k calibration, 10k test
rng = np.random.default_rng(42)
n_train, n_cal, n_test = 50_000, 10_000, 10_000
n_features = 6
X_train = rng.standard_normal((n_train, n_features))
X_cal = rng.standard_normal((n_cal, n_features))
X_test = rng.standard_normal((n_test, n_features))
y_train = rng.gamma(shape=1.5, scale=500, size=n_train)
y_cal = rng.gamma(shape=1.5, scale=500, size=n_cal)
y_test = rng.gamma(shape=1.5, scale=500, size=n_test)
# Fit your model however you normally would
import catboost
model = catboost.CatBoostRegressor(
loss_function="Tweedie:variance_power=1.5",
iterations=300,
learning_rate=0.05,
depth=6,
verbose=0,
)
model.fit(X_train, y_train)
# Wrap it
cp = InsuranceConformalPredictor(
model=model,
nonconformity="pearson_weighted", # default, recommended for insurance
distribution="tweedie",
tweedie_power=1.5,
)
# Calibrate on held-out data (must not overlap with training set)
cp.calibrate(X_cal, y_cal)
# Generate 90% prediction intervals
intervals = cp.predict_interval(X_test, alpha=0.10)
# DataFrame with columns: lower, point, upper
print(intervals.head())
# shape: (5, 3)
# ┌───────┬────────────┬─────────────┐
# │ lower ┆ point ┆ upper │
# │ --- ┆ --- ┆ --- │
# │ f64 ┆ f64 ┆ f64 │
# ╞═══════╪════════════╪═════════════╡
# │ 0.0 ┆ 787.800176 ┆ 1629.240867 │
# │ 0.0 ┆ 652.927728 ┆ 1383.831645 │
# │ 0.0 ┆ 741.107597 ┆ 1544.860221 │
# │ 0.0 ┆ 763.402341 ┆ 1585.222083 │
# │ 0.0 ┆ 734.043618 ┆ 1532.043552 │
# └───────┴────────────┴─────────────┘
# Note: lower=0.0 is expected — insurance losses are non-negative and the predictor clips at zero.conformal_prediction_intervals.py compares Tweedie conformal prediction intervals against a parametric bootstrap baseline on a synthetic motor book, then drills into per-segment coverage analysis across risk deciles and vehicle groups. It shows exactly where the bootstrap fails to meet its stated 90% coverage target — and confirms that the conformal approach holds by construction.
A Databricks-importable version is also available: Databricks notebook.
The marginal coverage guarantee means P(y in interval) >= 1 - alpha averaged over all observations. In insurance, you also need to check that coverage is uniform across risk deciles - a model can achieve 90% overall while only covering 65% of high-risk policies.
# THE key diagnostic
diag = cp.coverage_by_decile(X_test, y_test, alpha=0.10)
print(diag)
# decile mean_predicted n_obs coverage target_coverage
# 0 1 0.0234 400 0.923 0.90
# 1 2 0.0512 400 0.910 0.90
# ...
# 9 10 2.3410 400 0.905 0.90
# Full summary: marginal coverage + decile breakdown
cp.summary(X_test, y_test, alpha=0.10)
# Matplotlib plots - use CoverageDiagnostics for coverage_plot and interval_width_distribution
from insurance_conformal import CoverageDiagnostics
intervals_for_diag = cp.predict_interval(X_test, alpha=0.10)
diag_tool = CoverageDiagnostics(
y_true=y_test,
y_lower=intervals_for_diag["lower"].to_numpy(),
y_upper=intervals_for_diag["upper"].to_numpy(),
y_pred=intervals_for_diag["point"].to_numpy(),
alpha=0.10,
)
fig = diag_tool.coverage_plot()
fig.savefig("coverage_by_decile.png", dpi=150)
# Interval width distribution
fig = diag_tool.interval_width_distribution()| Score | Formula | When to use |
|---|---|---|
pearson_weighted |
|y - yhat| / yhat^(p/2) |
Default. Tweedie/Poisson pricing models. |
pearson |
|y - yhat| / sqrt(yhat) |
Pure Poisson frequency models (p=1). |
deviance |
Deviance residual | When you want exact statistical optimality; slower. |
anscombe |
Anscombe transform | Variance-stabilising alternative to deviance. |
raw |
|y - yhat| |
Baseline only. Not appropriate for insurance data. |
The score hierarchy for interval width (narrowest first, coverage identical):
pearson_weighted <= deviance <= anscombe < pearson < raw
Note: ordering is approximate and depends on Tweedie power. At p=1 (Poisson), pearson and pearson_weighted converge. At p=2 (Gamma), deviance and pearson are nearly equivalent. Treat the hierarchy as a guide for p in the range 1.0–2.0.
In insurance, you should calibrate on recent data to capture current loss trends, not a random subsample of all years:
from insurance_conformal.utils import temporal_split
# Split by date - calibration gets the most recent 20%
X_train, X_cal, y_train, y_cal, _, _ = temporal_split(
X, y,
calibration_frac=0.20,
date_col="accident_year", # column in X DataFrame
)
model.fit(X_train, y_train)
cp.calibrate(X_cal, y_cal)Use insurance-cv if you need full walk-forward cross-validation respecting IBNR development structure.
Split conformal prediction provides the following guarantee for exchangeable data:
P(y_test in [lower, upper]) >= 1 - alpha
This is distribution-free — it holds regardless of the true data distribution or model misspecification. The core assumption is exchangeability: calibration and test observations must be drawn from the same distribution and be interchangeable in order. Temporal covariate shift — where the risk profile of test data differs from calibration data — violates this assumption and can degrade coverage in practice. Use temporal calibration splits (calibrate on the most recent accident year before the test period) to minimise the distribution gap. The temporal_split utility is provided for this purpose.
"Exchangeable" means the joint distribution of calibration and test data is invariant to the order of observations — roughly, no systematic distributional shift between calibration and test. For insurance, this means you should not calibrate on year 5 and test on year 1. Use temporal splits.
For stable interval widths, target n_cal >= 2,000. The coverage guarantee holds with smaller calibration sets — split conformal is valid for any n_cal >= 1 — but with n_cal < 500 the quantile estimate has high variance and intervals will be materially wider and more variable than at larger sizes. With n_cal = 100, the interval width fluctuates by 20-30% across random seeds on realistic insurance data. Pricing teams working with recent 6-month calibration windows on thin books should check the cp.summary() output for the quantile stability diagnostics.
Split conformal, not cross-conformal. Cross-conformal is more statistically efficient but requires refitting the model on each calibration fold. For GBMs that take hours to train, this is not practical. Split conformal trains once, calibrates once.
No MAPIE dependency. MAPIE is excellent but it does not expose the insurance-specific scores implemented here. The split conformal algorithm is simple enough to own: 20 lines of code for conformal_quantile() plus the score functions.
LightGBM or CatBoost for the spread model. LocallyWeightedConformal now supports both. CatBoost is the default; pass backend="lightgbm" to use LightGBM instead (requires uv add "insurance-conformal[lightgbm]"). The Manna et al. arXiv:2507.06921 paper originally used LightGBM, so this option closes that gap. Both backends take the same spread_model_params override. There is no material coverage difference between the two — pick whichever is already in your stack.
Lower bound clipped at 0. Insurance losses are non-negative. Prediction intervals with negative lower bounds are nonsensical. We clip at 0 unconditionally.
Auto-detection of Tweedie power. For CatBoost, the power parameter is read from the loss function string. For sklearn TweedieRegressor, from model.power. If detection fails, we warn and default to p=1.5. Pass tweedie_power= explicitly if you know the correct value.
Standard conformal prediction controls coverage probability: P(Y in C(X)) >= 1 - alpha. That guarantees a fraction of intervals contain the true outcome — but says nothing about how badly wrong the misses are. For insurance pricing, the question that matters is different: how much are we underpriced, in expectation?
The insurance_conformal.risk subpackage implements Conformal Risk Control (CRC, Angelopoulos et al., ICLR 2024), which controls expected loss directly:
E[L(C_lambda(X), Y)] <= alpha
for any bounded monotone loss L. No parametric assumptions. Finite-sample valid.
Given a GBM that outputs predicted pure premium p(X), find the smallest loading factor lambda* such that the expected shortfall from underpriced policies is bounded:
from insurance_conformal.risk import PremiumSufficiencyController
psc = PremiumSufficiencyController(alpha=0.05, B=5.0)
psc.calibrate(y_cal, premium_cal) # calibrate on held-out year
result = psc.predict(premium_new) # apply to next year's book
# result["upper_bound"]: risk-controlled loading factor per policy
# result["lambda_hat"]: the single lambda* that achieves E[shortfall] <= 5%| Controller | Use case |
|---|---|
PremiumSufficiencyController |
Bound expected underpricing shortfall: E[max(claim - lambda * premium, 0) / premium] <= alpha |
IntervalWidthController |
Find the most efficient conformal quantile level that still bounds expected interval width |
SelectiveRiskController |
Accept/reject risks to bound expected loss on the accepted book |
from insurance_conformal.risk import (
PremiumSufficiencyController,
IntervalWidthController,
SelectiveRiskController,
conformal_risk_calibration,
shortfall_loss,
premium_sufficiency_report,
)- Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., & Schuster, T. (2024). Conformal Risk Control. ICLR 2024. arXiv:2208.02814.
- Selective CRC: arXiv:2512.12844 (2025).
New in v0.5.1. Conformal prediction intervals for frequency-severity insurance models, based on Graziadei et al. (arXiv:2307.13124). Import from insurance_conformal.claims.
The frequency-severity decomposition is standard in non-life pricing: total loss = E[frequency] × E[severity | claim]. The conformal subtlety is what to feed into the severity model at calibration time. Using the observed claim count would create a distributional mismatch between calibration scores and test scores, breaking the coverage guarantee. The correct approach — as established by Graziadei et al. — is to feed the predicted frequency from the frequency model into the severity model at both calibration and test time. The resulting conformity scores are exchangeable with the test-time prediction, so the coverage guarantee holds.
from sklearn.linear_model import PoissonRegressor, GammaRegressor
from insurance_conformal.claims import FrequencySeverityConformal
fs = FrequencySeverityConformal(
freq_model=PoissonRegressor(),
sev_model=GammaRegressor(),
# spread_model defaults to CatBoostRegressor if not specified
)
# d_train = observed claim counts; y_train = observed aggregate losses
fs.fit(X_train, d_train, y_train)
# d_cal is passed for validation only; scores use mu_hat(x), not d_cal
fs.calibrate(X_cal, d_cal, y_cal)
# 90% prediction intervals
intervals = fs.predict_interval(X_test, alpha=0.10)
# DataFrame with columns: lower, point, upperThe variability model sigma_hat is fitted on training residuals |y_i - psi_hat(x_i, d_i)| for observed-claim observations, analogous to the spread model in LocallyWeightedConformal. Pass spread_model= to override the default CatBoost variability model.
Coverage guarantee: P(Y in C(X)) in [1-alpha, 1-alpha + 1/(n_cal+1)] — the same finite-sample valid guarantee as standard split conformal, provided calibration and test data are exchangeable.
Reference: Graziadei, H., Janett, C., Embrechts, P. & Bucher, A. (2023). Conformal Prediction for Insurance Data. arXiv:2307.13124.
SCRReport wraps a calibrated conformal predictor and produces per-risk 99.5% upper bounds suitable for internal stress-testing and model validation.
Disclaimer: SCRReport is an internal stress-testing tool. Solvency II SCR calculations for regulatory purposes require sign-off under an approved internal model or the standard formula. Do not use this output in regulatory returns without appropriate actuarial review, governance sign-off, and alignment with your firm's approved methodology.
from insurance_conformal.scr import SCRReport
scr = SCRReport(predictor=cp)
scr_bounds = scr.solvency_capital_requirement(X_test, alpha=0.005)
val_table = scr.coverage_validation_table(X_test, y_test)
print(scr.to_markdown())The primary use case for this library is pricing uncertainty — but conformal prediction has a secondary application in internal model validation that is worth knowing about.
PRA SS1/23 (model risk management, effective May 2023) requires firms to validate that models perform as stated, including checking that stated confidence levels are actually achieved in out-of-sample data. For reserve and capital models that produce prediction intervals — whether under Solvency II internal model approval or as part of ORSA stress testing — the question "does this model's stated 90% interval actually contain the true outcome 90% of the time?" is a model validation question, not a pricing question.
Conformal prediction answers that question without assuming a specific loss distribution. cp.coverage_by_decile() and scr.coverage_validation_table() produce the empirical coverage evidence that a model validation function needs to challenge whether a model's stated confidence levels hold in practice. This is a distribution-free check: if your internal capital model claims its 99.5th percentile bound is £X, you can use historical out-of-sample data to test whether that claim holds — and document the result for your SS1/23 model validation pack. We are not claiming this replaces the statistical framework required for Solvency II internal model approval; it is one empirical validation tool among several.
Standard conformal prediction with a static calibration set handles exchangeable data well, but insurance books are not static. Mid-year claims inflation (UK motor: +30% in 2021-2022), Ogden rate changes, and CAT events all create abrupt distributional shifts. ACI (Adaptive Conformal Inference) adapts by nudging the miscoverage level alpha_t one step at a time. At the default gamma=0.005, ACI needs O(1/gamma) = 200 steps to fully reprice — about 17 years of monthly data. That is not adaptation; it is drift.
RetroAdj (Jun & Ohn 2025, arXiv:2511.04275) fixes this by retroactively correcting all leave-one-out residuals in the active window simultaneously at each step. The correction uses rank-one updates to the inverse kernel matrix Q = (K + lambda*I)^{-1}, so no additional model fitting is required. After an abrupt shift, the jackknife+ interval responds within 1-3 steps.
Hard constraint: The base model must be kernel ridge regression (KRR) or another self-stable linear smoother. GLMs and GBMs do not qualify. For pricing teams with an existing model, use residual-only mode.
from insurance_conformal import RetroAdj
# Features should be pre-standardised
model = RetroAdj(
bandwidth=1.0, # RBF kernel bandwidth
lambda_reg=0.1, # KRR regularisation
window_size=250, # sliding window length (paper default)
gamma=0.005, # ACI step size
alpha_update="aci", # 'aci' or 'sfogd'
)
model.fit(y_train, X_train)
lower, upper = model.predict_interval(y_test, X_test, alpha=0.10)When you have a pre-fitted external model, pass residuals instead:
resid_train = y_train - glm.predict(X_train)
resid_test = y_test - glm.predict(X_test)
model = RetroAdj(window_size=250)
model.fit(resid_train) # X=None: kernel degenerates to ridge-mean
lower_r, upper_r = model.predict_interval(resid_test, alpha=0.10)
# Shift back to original scale
lower_claims = lower_r + glm.predict(X_test)
upper_claims = upper_r + glm.predict(X_test)With X=None the kernel degenerates (K = ones-matrix + lambda*I) so KRR reduces to a ridge-regularised mean. This retains the jackknife+ interval and improved alpha tracking but is an approximation of the full method. Alternatively, use X = np.arange(len(y)).reshape(-1, 1) as a time index to let KRR fit a smooth trend.
| Mode | When to use |
|---|---|
alpha_update="aci" |
Default. Fixed step size gamma. Fast response to abrupt shifts. |
alpha_update="sfogd" |
AdaGrad-style (Algorithm 5 of Jun & Ohn). Better for slowly-varying shifts. Step size scales down as gradients accumulate. |
After many rank-one updates, Q can lose symmetry or positive definiteness due to floating-point accumulation. RetroAdj handles this with:
- Symmetry enforcement:
Q = (Q + Q.T) / 2after every update. - Periodic reset: Full recomputation of Q from scratch every
reset_freqsteps (default 500). O(w^3) per reset — for w=250 that is ~15M flops, negligible. - Instability detection: If the rank-one update denominator goes non-positive (impossible in exact arithmetic), the method resets Q for that step and continues.
| Parameter | Default | Notes |
|---|---|---|
bandwidth |
1.0 | RBF bandwidth. Pre-standardise features or tune this. |
lambda_reg |
0.1 | KRR regularisation. Larger = smoother, more biased. |
window_size |
250 | Sliding window length. Paper default. |
gamma |
0.005 | ACI/SFOGD step size. |
alpha_update |
"aci" |
"aci" or "sfogd". |
symmetric |
False |
If True, use |R_loo| for symmetric intervals. Signed residuals (default) give asymmetric intervals more appropriate for right-skewed claims. |
reset_freq |
500 | Steps between full Q recomputation. |
Reference: Jun, J. & Ohn, I. (2025). "Online Conformal Inference with Retrospective Adjustment for Faster Adaptation to Distribution Shift." arXiv:2511.04275.
Scenario: 2000-step online stream of synthetic UK motor total loss estimates. At timestep 1000, all true claim values inflate by 30% (the UK motor 2021-2022 scenario). The base model is NOT updated — its predictions remain on the pre-inflation scale. Both methods must adapt their intervals online to recover the 90% coverage target.
Methods compared:
- RetroAdj — jackknife+ intervals over KRR with rank-one LOO retroactive recalibration (this library)
- ACI — Adaptive Conformal Inference (Gibbs & Candes 2021): sliding-window quantile intervals with additive alpha_t update. Same window size, same gamma, no retroactive correction.
Parameters: gamma=0.05, window_size=200, target coverage 90%, seed=42.
Expected results (placeholder — run notebooks/benchmark_retroadj.py on Databricks for actual figures):
| Metric | RetroAdj | ACI |
|---|---|---|
| Pre-shift coverage | ~90% | ~90% |
| Post-shift coverage (full 1000-step window) | ~88-91% | ~80-87% |
| Steps to recover 90% coverage after shift | ~15-30 | ~80-150 |
| Post-shift mean interval width | comparable | comparable |
| Speedup vs ACI | 3–8x faster recovery | baseline |
Why RetroAdj wins on recovery speed: When the first post-inflation residual enters the window, RetroAdj recomputes all leave-one-out residuals simultaneously via the updated kernel matrix Q. The jackknife+ interval at the very next step already reflects the new distribution level. ACI must wait for old residuals to age out of the sliding window — one step at a time. At gamma=0.05 this is ~20 steps; at the more common gamma=0.005 it is ~200 steps (~17 years of monthly data).
When the advantage disappears: for gradual drift (no abrupt step change), both methods perform comparably. RetroAdj's advantage is specifically for abrupt shifts. It also requires more computation: O(w^2) per step vs O(w log w) for ACI. For w=200 this is still fast (milliseconds per step).
See notebooks/benchmark_retroadj.py for the full benchmark. Run on Databricks serverless.
Reference: Jun, J. & Ohn, I. (2025). arXiv:2511.04275.
Exchangeability assumption. Split conformal requires calibration and test data to be exchangeable. Temporal covariate shift — changes in portfolio mix, inflation, or risk profile between calibration and test periods — weakens this assumption. Use temporal calibration splits and monitor coverage drift over time.
IBNR on recent accident years. For severity and pure premium models, calibrating on the most recent accident year means calibrating on incomplete claims. IBNR (incurred but not reported) development causes non-conformity scores to be computed on understated y_cal values, producing intervals that are too narrow for open development periods. Recommend using only fully-developed accident years (typically 3+ years prior) for calibration, or applying a development factor to y_cal before calibration.
Marginal vs. conditional coverage. The conformal guarantee is marginal: it holds on average across all observations. High-risk subgroups can still be systematically under-covered if the non-conformity score does not fully account for heteroscedasticity. Always check coverage_by_decile() after calibration.
Score choice matters. The raw score produces valid but very wide intervals on insurance data. Use pearson_weighted for Tweedie/Poisson models. If you switch scores, recalibrate.
- Manna, S. et al. (2025). "Distribution-free prediction sets for Tweedie regression." arXiv:2507.06921 (preprint; not yet peer-reviewed as of March 2026).
- Angelopoulos, A. N., & Bates, S. (2023). "Conformal prediction: A gentle introduction." Foundations and Trends in Machine Learning, 16(4), 494-591.
- Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. Springer.
| Library | Description |
|---|---|
| insurance-monitoring | Model drift detection — track coverage stability over time |
| insurance-conformal-ts | Conformal prediction for non-exchangeable claims time series |
| insurance-severity | Spliced severity models and EVT — conformal intervals for tail risk quantification |
The main benchmark uses CatBoost Tweedie(p=1.5) as the point forecast and a heteroskedastic Gamma DGP where variance grows faster than Tweedie(1.5) predicts in the high-mean tail. This is the scenario that motivates conformal prediction: the parametric assumption breaks, and only distribution-free methods give a valid coverage guarantee.
50,000 synthetic UK motor policies. Features: vehicle_age, driver_age, mileage, ncd_years, area_risk. Nonlinear mean structure (young driver + old vehicle interaction). Gamma shape parameter drops from ~2.0 at median predicted mean to ~0.8 at the 90th percentile — high-mean risks have CV ~1.16 vs ~0.95 for low-mean risks. Temporal 60/20/20 split: 30,000 train, 10,000 calibration, 10,000 test. Run on Databricks serverless (2026-03-21, seed=42). Benchmark time: 4s. Run: benchmarks/benchmark_gbm.py.
Parametric Tweedie baseline — global sigma from Pearson residuals on calibration set, intervals as yhat ± z × sigma × yhat^(p/2):
| Decile | Avg predicted (£) | Coverage |
|---|---|---|
| 1 | 1,035 | 0.955 |
| 2 | 1,184 | 0.953 |
| 3 | 1,292 | 0.938 |
| 4 | 1,390 | 0.945 |
| 5 | 1,487 | 0.924 |
| 6 | 1,596 | 0.925 |
| 7 | 1,714 | 0.921 |
| 8 | 1,850 | 0.919 |
| 9 | 2,026 | 0.925 |
| 10 | 2,344 | 0.904 |
Conformal (pearson_weighted score, CatBoost forecast):
| Decile | Coverage |
|---|---|
| 1 | 0.929 |
| 2 | 0.924 |
| 3 | 0.913 |
| 4 | 0.908 |
| 5 | 0.895 |
| 6 | 0.900 |
| 7 | 0.886 |
| 8 | 0.895 |
| 9 | 0.890 |
| 10 | 0.879 |
Locally-weighted conformal (secondary CatBoost spread model):
| Decile | Coverage |
|---|---|
| 1 | 0.907 |
| 2 | 0.913 |
| 3 | 0.900 |
| 4 | 0.901 |
| 5 | 0.897 |
| 6 | 0.899 |
| 7 | 0.895 |
| 8 | 0.903 |
| 9 | 0.910 |
| 10 | 0.906 |
Summary:
| Metric | Parametric | Conformal (pearson_weighted) | LW Conformal |
|---|---|---|---|
| Aggregate coverage @ 90% | 0.931 | 0.902 | 0.903 |
| Aggregate coverage @ 95% | 0.950 | 0.953 | 0.952 |
| Worst-decile coverage @ 90% | 0.904 | 0.879 | 0.906 |
| Mean interval width @ 90% (£) | 4,393 | 3,806 | 3,881 |
| Width vs parametric | ref | -13.4% | -11.7% |
| Distribution-free guarantee | No | Yes (marginal) | Yes (marginal) |
| Width adapts to risk segment | No | Partial | Yes |
- The parametric Tweedie approach estimates a single sigma on the calibration set. Because the DGP has genuinely higher dispersion at higher means, the single sigma overestimates uncertainty for low-risk policies (unnecessary width) while barely meeting the 90% target for the top decile (90.4%). The aggregate coverage of 93.1% signals the over-width problem.
- Conformal pearson_weighted: 90.2% aggregate — correct. Intervals are 13.4% narrower than parametric. The top-decile coverage of 87.9% is a 2.1pp miss, consistent with the marginal guarantee (it holds on average, not per-decile). If per-decile coverage matters, use LW conformal.
- LW conformal: the secondary spread model learns which features predict large residuals. The result: 90.6% in the top decile (slightly above target), 11.7% narrower than parametric, 2.0% wider than standard conformal. If you have the training data available, LW conformal dominates on the metrics that matter for reinsurance attachment decisions.
- The conformal coverage guarantee is marginal, not conditional. Always check
coverage_by_decile()after calibration.
The original benchmark (2026-03-16) uses Ridge regression on log(y) as the baseline model. With a well-matched log-normal DGP, both parametric and conformal intervals achieve near-uniform coverage across deciles. Conformal wins on interval width (-13% vs raw) but the coverage argument is less compelling. This is the scenario where conformal is not needed — but it still helps with width.
Run: benchmarks/benchmark.py
| Metric | Naive parametric (Ridge) | Conformal (pearson_weighted) |
|---|---|---|
| Aggregate coverage @ 90% | 0.917 | 0.901 |
| Worst-decile coverage | 0.917 | 0.714 |
| Mean interval width (£) | 6,445 | 4,675 |
| Distribution-free guarantee | No | Yes (marginal) |
Note: conformal undercovers the top decile at 71.4% here — a known limitation of the pearson_weighted score with a poor point forecast. The score divides by yhat^0.75, compressing scores for high-predicted-value policies and producing intervals that are too narrow for them. This failure mode is exactly why you should use coverage_by_decile() in practice, and why the GBM benchmark above uses a well-calibrated CatBoost forecast.
Practical guidance: conformal prediction is most valuable when (a) your point forecast is well-calibrated (GBM, not Ridge), and (b) the residual distribution is genuinely more complex than a single parametric family can describe — which is the common case for heterogeneous UK motor books. The LW conformal variant is the recommendation for production use.
Model building
| Library | Description |
|---|---|
| shap-relativities | Extract rating relativities from GBMs using SHAP |
| insurance-interactions | Automated GLM interaction detection via CANN and NID scores |
| insurance-cv | Walk-forward cross-validation respecting IBNR structure |
Uncertainty quantification
| Library | Description |
|---|---|
| bayesian-pricing | Hierarchical Bayesian models for thin-data segments |
| insurance-credibility | Bühlmann-Straub credibility weighting |
| insurance-distributional | Full conditional distribution per risk: mean, variance, CoV |
Deployment and optimisation
| Library | Description |
|---|---|
| insurance-optimise | Constrained rate change optimisation with FCA PS21/5 compliance |
| insurance-demand | Conversion, retention, and price elasticity modelling |
Governance
| Library | Description |
|---|---|
| insurance-fairness | Proxy discrimination auditing for UK insurance models |
| insurance-causal | Double Machine Learning for causal pricing inference |
| insurance-monitoring | Model monitoring: PSI, A/E ratios, Gini drift test |
Spatial
| Library | Description |
|---|---|
| insurance-spatial | BYM2 spatial territory ratemaking for UK personal lines |
MIT. See LICENSE.
Issues and pull requests welcome at github.com/burning-cost/insurance-conformal.