End-to-end insurance risk scoring pipeline using actuarial frequency–severity modeling.
Motor insurance providers must quantify policy-level risk to support underwriting, pricing, and portfolio risk segmentation decisions. The objective of this project is to build an interpretable policy-level insurance risk scoring pipeline that estimates expected annual loss per policy using historical claims data.
The resulting risk scores enable insurers to:
- Differentiate high-risk vs low-risk policies
- Support pricing and underwriting decisions
- Perform portfolio-level risk monitoring and backtesting
The project uses motor insurance policy and claims data, including:
- Policy attributes (driver, vehicle, region)
- Exposure (policy duration)
- Individual claim records with claim amounts
Claims were aggregated at the policy level to construct:
- Claim frequency: number of claims per unit exposure
- Claim severity: total claim cost, conditional on at least one claim
Exposure-adjusted targets ensure comparability across policies with different coverage durations.
Note on Data Availability
Due to size constraints, raw datasets are not included in this repository.
The data used in this project can be obtained from the original public source
and placed in the data/ directory to reproduce results.
Highly skewed claim count distribution, validating the use of Poisson modeling for claim frequency.
Right-skewed claim severity distribution motivating log-scale severity modeling.
The modeling framework follows industry-standard actuarial frequency–severity decomposition.
- Poisson Generalized Linear Model (GLM)
- Target: claim count per policy
- Offset:
log(exposure) - Purpose: estimate how often claims occur
Key assumptions such as exposure normalization and equidispersion were explicitly considered.
Calibration of predicted vs actual claim frequency across deciles.
- Log-linked regression on positive claim amounts
- Trained only on policies with at least one claim
- Target: total claim amount (conditional severity)
- Purpose: estimate expected claim cost when a claim occurs
This conditional approach avoids zero-inflation bias and aligns with actuarial best practices.
- Expected Annual Loss = Predicted Frequency × Predicted Severity
- Expected losses normalized to a 0–100 risk score
- Policies grouped into Low / Medium / High risk bands
Monotonic increase in realized losses across predicted risk deciles, confirming ranking effectiveness.
- Policy ID
- Predicted claim frequency
- Predicted claim severity
- Expected annual loss
- Risk score (0–100)
- Risk band (Low / Medium / High)
Used directly for underwriting and pricing workflows.
- Expected vs actual losses
- Risk deciles
- Loss ratios by decile
Used for portfolio backtesting and model validation.
Column definitions for both output files are documented within the outputs/ directory.
- Claim frequency is the dominant driver of total policy risk.
- Severity is strongly influenced by vehicle characteristics and regional factors.
- High-frequency, moderate-severity policies contribute disproportionately to portfolio loss volatility.
- Risk decile analysis shows clear monotonic separation between predicted risk and realized losses.
Model performance was evaluated using:
- Poisson deviance and calibration checks for frequency
- Error analysis and decile stability for severity
- Portfolio-level decile backtesting comparing expected vs actual losses
Directional alignment across deciles confirms the model’s ranking effectiveness, which is critical for underwriting and pricing decisions.
- Assumes historical claim patterns remain stable over time
- Does not explicitly model inflation or repair cost escalation
- Assumes claim independence and Poisson equidispersion
- Limited to available features; fraud indicators not included
- Results are specific to the geographic scope of the data
- Python (pandas, numpy)
- Statsmodels (Poisson GLM, log-linked regression)
- Exposure-adjusted actuarial modeling
- Portfolio decile analysis and backtesting
- CSV-based production outputs
To reproduce the results of this project:
- Install dependencies listed in
requirements.txt - Download the original public dataset and place it in a local
data/directory - Run
INSURANCE_RISK_SCORING.ipynbend to end
Raw datasets are intentionally excluded from version control due to size considerations.
This project demonstrates a production-style, interpretable insurance risk scoring pipeline aligned with actuarial best practices. By combining frequency–severity modeling with portfolio-level validation, the resulting risk scores are suitable for real-world underwriting, pricing, and risk management applications.




