Skip to content

burning-cost/insurance-multilevel

Repository files navigation

insurance-multilevel

Tests

Two-stage CatBoost + REML random effects for insurance pricing with high-cardinality group factors.

Blog post: Your Group Factors Are Not All Worth Modelling

The Problem

UK personal lines pricing teams face a specific structural problem: portfolios are distributed through hundreds of brokers, schemes, and affinity partners. These groups differ systematically — broker A has an older, lower-risk customer base; broker B has young drivers; scheme C operates in flood-prone postcodes. But you cannot capture this by throwing broker IDs into a GBM.

The reasons one-hot encoding fails at scale:

  • 500 brokers means 500 extra features, most with sparse data
  • A GBM with 300 trees will overfit to the largest brokers and ignore the rest
  • New brokers at prediction time have no training data

What you actually need is shrinkage: for a new or low-volume broker, trust the book-wide average. For a high-volume broker with years of data, trust their own experience. The crossover point is determined by how variable brokers are relative to within-group noise.

This is classical Bühlmann-Straub credibility theory, reimplemented as a statistically principled REML random effects model. If you want the classical closed-form version — for scheme pricing or geographic rating without a GBM stage — see insurance-credibility, which implements BuhlmannStraub and HierarchicalBuhlmannStraub directly.

The Solution

Two stages, run sequentially:

Stage 1: CatBoost on individual risk factors Group columns (broker, scheme) are deliberately excluded. CatBoost learns age bands, vehicle classes, postcode sectors, NCB — everything about the individual policy. Output: f̂ᵢ (CatBoost predicted premium).

Stage 2: REML random intercepts on log-ratio residuals Compute rᵢ = log(yᵢ / f̂ᵢ) — the log-ratio of observed to CatBoost-predicted. Fit a one-way random effects model:

rᵢ = μ + b_g(i) + εᵢ
b_g ~ N(0, τ²)      (between-group variation)
εᵢ  ~ N(0, σ²)      (within-group noise)

REML estimates σ² and τ², then computes BLUPs for each group:

b̂_g = Z_g × (r̄_g - μ̂)         (shrunk group mean)
Z_g = τ² / (τ² + σ²/n_g)        (Bühlmann credibility weight)

Final premium: f̂(x) × exp(b̂_g)

Installation

uv add insurance-multilevel

💬 Questions or feedback? Start a Discussion. Found it useful? A ⭐ helps others find it.

Quick Start

import polars as pl
from insurance_multilevel import MultilevelPricingModel

model = MultilevelPricingModel(
    catboost_params={"loss_function": "RMSE", "iterations": 500},
    random_effects=["broker_id", "scheme_id"],
    min_group_size=5,
)

model.fit(X_train, y_train, weights=exposure, group_cols=["broker_id", "scheme_id"])

premiums = model.predict(X_test, group_cols=["broker_id", "scheme_id"])

Credibility Summary

summary = model.credibility_summary()
print(summary)
# shape: (47, 11) — one row per (level, group) combination.
# With random_effects=["broker_id", "scheme_id"] and 30 brokers + 17 schemes,
# you get 47 rows (30 broker-level rows + 17 scheme-level rows), each with its
# own variance components estimated at that level.
shape: (47, 11)
┌───────────┬─────────────┬────────┬────────────┬────────┬────────────┬───────────────────┬───────┬────────┬──────┬──────────┐
│ level     ┆ group       ┆ n_obs  ┆ group_mean ┆ blup   ┆ multiplier ┆ credibility_weight┆ tau2  ┆ sigma2 ┆ k    ┆ eligible │
│ ---       ┆ ---         ┆ ---    ┆ ---        ┆ ---    ┆ ---        ┆ ---               ┆ ---   ┆ ---    ┆ ---  ┆ ---      │
│ str       ┆ str         ┆ f64    ┆ f64        ┆ f64    ┆ f64        ┆ f64               ┆ f64   ┆ f64    ┆ f64  ┆ bool     │
╞═══════════╪═════════════╪════════╪════════════╪════════╪════════════╪═══════════════════╪═══════╪════════╪══════╪══════════╡
│ broker_id ┆ broker_02   ┆ 342.0  ┆ 0.1823     ┆ 0.1691 ┆ 1.184      ┆ 0.928             ┆ 0.103 ┆ 0.248  ┆ 2.41 ┆ true     │
│ broker_id ┆ broker_07   ┆ 289.0  ┆ -0.1354    ┆-0.1231 ┆ 0.884      ┆ 0.919             ┆ 0.103 ┆ 0.248  ┆ 2.41 ┆ true     │
│ ...       ┆ ...         ┆ ...    ┆ ...        ┆ ...    ┆ ...        ┆ ...               ┆ ...   ┆ ...    ┆ ...  ┆ ...      │

The multiplier column is what pricing teams use. broker_02 has consistently worse-than-expected experience; apply a 1.184 loading on top of the base premium for policies written through that broker.

The column names — tau2 (τ², between-group variance), sigma2 (σ², within-group variance), k (Bühlmann's k), credibility_weight (Z) — deliberately mirror the notation in classical Bühlmann-Straub theory. See insurance-credibility for the closed-form version of the same model, where these parameters are explained in detail.

Variance Components

vc = model.variance_components["broker_id"]
print(vc)
# VarianceComponents(sigma2=0.2481, tau2={broker_id=0.1032},
#                   k={broker_id=2.41}, log_likelihood=-412.3,
#                   converged=True, iterations=23)

The Bühlmann k tells you the crossover point: a broker needs k=2.41 claim-years of data before their own experience gets more than 50% credibility. With σ²=0.25 and τ²=0.10, this is a reasonable portfolio — brokers do vary, but not outrageously.

Diagnostics

from insurance_multilevel import (
    icc,
    variance_decomposition,
    high_credibility_groups,
    groups_needing_data,
    lift_from_random_effects,
)

# Intraclass Correlation Coefficient
# "10% of total variance is between-broker"
print(icc(vc, "broker_id"))  # 0.294

# Which brokers have enough data to trust?
hc = high_credibility_groups(model.credibility_summary(), min_z=0.7)

# How much more data does each broker need to reach 80% credibility?
needs = groups_needing_data(model.credibility_summary(), target_z=0.8)

# Does Stage 2 actually help?
stage1 = model.stage1_predict(X_test)
final = model.predict(X_test)
lift = lift_from_random_effects(y_test, stage1, final, weights=exposure_test)
print(lift["malr_improvement_pct"])  # e.g., 4.2% improvement

Design Choices

Why two stages instead of joint estimation? Joint approaches (GPBoost, MERF) are mathematically cleaner but have identifiability problems when group IDs are high-cardinality. If broker_id is in the CatBoost feature set, the tree can absorb some of the group signal, leaving underestimated τ² in Stage 2. Two-stage with group exclusion is simpler and avoids this.

Why REML instead of ML? Maximum likelihood underestimates variance components because it doesn't account for the degrees of freedom consumed by fixed effects (the grand mean μ). REML conditions out μ first. For small numbers of groups (m < 30), the difference is material. For m > 100, ML and REML converge.

Why log-ratio residuals? Insurance premia are multiplicative. A broker loading of 1.15 applies regardless of whether the base premium is £300 or £1,200. By working on the log scale, we get additive random effects that translate cleanly to multiplicative adjustments.

Why min_group_size=5? With n=1, you cannot separate the group random effect from within-group noise. The BLUP for a singleton group would be either 0 (correct: no information) or dominated by that single extreme observation (wrong: no shrinkage possible without group-level data). We exclude singletons from variance estimation and give them Z=0. This is conservative and actuarially correct.

API Reference

MultilevelPricingModel

MultilevelPricingModel(
    catboost_params: dict | None = None,
    random_effects: list[str] | None = None,
    min_group_size: int = 5,
    reml: bool = True,
)

Methods:

  • fit(X, y, weights, group_cols) — fit two-stage model
  • predict(X, group_cols, allow_new_groups) — return premium predictions
  • credibility_summary(group_col) — Bühlmann-Straub summary DataFrame
  • stage1_predict(X) — CatBoost predictions only (no random effects)
  • log_ratio_residuals(X, y) — log(y / f_hat) for diagnostics
  • variance_components — dict of VarianceComponents per group level
  • feature_importances — Stage 1 CatBoost feature importances

RandomEffectsEstimator

Lower-level class if you want to use REML variance components without CatBoost.

est = RandomEffectsEstimator(reml=True, min_group_size=5)
vc = est.fit(residuals, group_ids, weights)
blups = est.predict_blup(group_ids)

VarianceComponents

Dataclass holding σ², τ², k (Bühlmann), log-likelihood, convergence info.

Scope (V1)

  • Random intercepts only (no random slopes)
  • Gaussian residuals on log-transformed response
  • Two-stage estimation (not joint)
  • Nested hierarchy supported: fit separate estimators per group level
  • Crossed effects excluded (broker × territory combinations)
  • One-way random effects per group column

Performance

Benchmarked on Databricks (2026-03-16) on a synthetic UK motor portfolio: 8,000 policies across 200 occupation codes (40 thick groups / 160 thin groups), true sigma_u=0.30 (between-group std), true sigma_e=0.50 (within-group std), ICC=0.36. Three approaches compared on a held-out 20% test set. See benchmarks/benchmark.py for the full script.

Variance component recovery:

Parameter True Estimated
sigma2 (within-group) 0.2500 0.1532
tau2 (between-group) 0.0900 0.0761
ICC 0.360 0.332

REML under-estimates both variance components on this dataset. The within-group sigma2 is substantially lower than the truth, likely because the Stage 1 CatBoost model absorbs much of the variance before the REML step. The estimates converge better when Stage 1 is a weak model or the fixed effects are excluded. Despite this attenuation, the Stage 2 lift is positive and substantial.

Holdout gamma deviance (lower = better):

Method Gamma Deviance
No group effect (CatBoost on X only) 0.338092
One-hot encoding of group 0.272967
MultilevelPricingModel (CatBoost + REML) 0.272280
Stage 1 only (CatBoost) vs Full model lift +15.93% reduction

MultilevelPricingModel beats one-hot encoding on deviance (marginally) and outperforms no-group by 19.4%. The Stage 2 lift of 15.93% is large because ICC=0.33 — when a third of the variance is attributable to group membership, ignoring group structure is expensive.

MAPE by group thickness:

Method Thin groups MAPE Thick groups MAPE
No group effect 68.35% 62.46%
One-hot encoding 66.09% 51.11%
MultilevelPricingModel (REML) 63.55% 52.85%

MultilevelPricingModel achieves the lowest MAPE on thin groups (63.55% vs 66.09% for one-hot). One-hot encoding slightly wins on thick groups because with enough data the one-hot coefficients converge to the correct values, and the REML shrinkage marginally over-regularises. The difference is 1.7pp — within noise. On thin groups the gap is decisive: one-hot overfits noisy 10-policy estimates; REML shrinks them appropriately.

Group effect recovery: Pearson r(BLUP, true effect)=0.7291 (target: >0.6). BLUP adjustments correlate strongly with the planted true occupation effects.

Credibility weight distribution:

  • Thin groups (<15 exposure): mean Z=0.38 (heavily shrunk to portfolio mean)
  • Medium groups (15-80 exposure): mean Z=0.97
  • Thick groups (>=80 exposure): mean Z=0.98

ICC=0.33. The lift is substantial whenever occupation code effects are real. On a portfolio where certain occupations over-perform by 15-20%, this pays for itself.

When to use: Portfolios where group-level (broker, scheme, territory, occupation) effects are material but group membership cannot simply be included as a GBM feature due to cardinality, sparsity, or new groups at prediction time.

When NOT to use: When group effects are small (ICC < 0.05), or when you have fewer than 3-5 observations per group — BLUPs collapse to the grand mean and Stage 2 adds no lift. If you have 10+ observations per group and no new-group problem, one-hot encoding is a reasonable substitute.

Databricks Notebook

A ready-to-run Databricks notebook benchmarking this library against standard approaches is available in burning-cost-examples.

Related libraries

Library Why it's relevant
insurance-credibility Bühlmann-Straub closed-form implementation — use this when you don't need a GBM Stage 1
shap-relativities Extract rating relativities from the Stage 1 CatBoost model
insurance-spatial BYM2 spatial random effects for territory — the geographic equivalent of broker random effects
insurance-cv Walk-forward cross-validation respecting IBNR structure, needed for validating the Stage 1 model

All Burning Cost libraries →

Read more

Your Broker Adjustments Are Guesswork — why ad hoc broker loadings fail and how REML random effects give you defensible, data-driven credibility weights.

Licence

MIT

About

Two-stage CatBoost + REML random effects for high-cardinality group factors in insurance pricing — Bühlmann-Straub credibility, ICC diagnostics

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages