Skip to content

saikrishna64/insurance-risk-scoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insurance Risk Scoring using Frequency–Severity Modeling

Insurance Risk Scoring Pipeline

End-to-end insurance risk scoring pipeline using actuarial frequency–severity modeling.

1. Business Problem

Motor insurance providers must quantify policy-level risk to support underwriting, pricing, and portfolio risk segmentation decisions. The objective of this project is to build an interpretable policy-level insurance risk scoring pipeline that estimates expected annual loss per policy using historical claims data.

The resulting risk scores enable insurers to:

  • Differentiate high-risk vs low-risk policies
  • Support pricing and underwriting decisions
  • Perform portfolio-level risk monitoring and backtesting

2. Data Overview

The project uses motor insurance policy and claims data, including:

  • Policy attributes (driver, vehicle, region)
  • Exposure (policy duration)
  • Individual claim records with claim amounts

Claims were aggregated at the policy level to construct:

  • Claim frequency: number of claims per unit exposure
  • Claim severity: total claim cost, conditional on at least one claim

Exposure-adjusted targets ensure comparability across policies with different coverage durations.

Note on Data Availability
Due to size constraints, raw datasets are not included in this repository. The data used in this project can be obtained from the original public source and placed in the data/ directory to reproduce results.

3. Exploratory Analysis

Claim Frequency Distribution

Highly skewed claim count distribution, validating the use of Poisson modeling for claim frequency.

Claim Severity Distribution

Right-skewed claim severity distribution motivating log-scale severity modeling.

4. Methodology

The modeling framework follows industry-standard actuarial frequency–severity decomposition.

Claim Frequency Modeling

  • Poisson Generalized Linear Model (GLM)
  • Target: claim count per policy
  • Offset: log(exposure)
  • Purpose: estimate how often claims occur

Key assumptions such as exposure normalization and equidispersion were explicitly considered.

Frequency Model Calibration

Calibration of predicted vs actual claim frequency across deciles.

Claim Severity Modeling

  • Log-linked regression on positive claim amounts
  • Trained only on policies with at least one claim
  • Target: total claim amount (conditional severity)
  • Purpose: estimate expected claim cost when a claim occurs

This conditional approach avoids zero-inflation bias and aligns with actuarial best practices.

Risk Score Construction

  • Expected Annual Loss = Predicted Frequency × Predicted Severity
  • Expected losses normalized to a 0–100 risk score
  • Policies grouped into Low / Medium / High risk bands

Risk Decile Validation

Monotonic increase in realized losses across predicted risk deciles, confirming ranking effectiveness.

5. Outputs

Business Output — insurance_risk_scores.csv

  • Policy ID
  • Predicted claim frequency
  • Predicted claim severity
  • Expected annual loss
  • Risk score (0–100)
  • Risk band (Low / Medium / High)

Used directly for underwriting and pricing workflows.

Validation Output — insurance_risk_validation.csv

  • Expected vs actual losses
  • Risk deciles
  • Loss ratios by decile

Used for portfolio backtesting and model validation.

Column definitions for both output files are documented within the outputs/ directory.

6. Key Insights

  • Claim frequency is the dominant driver of total policy risk.
  • Severity is strongly influenced by vehicle characteristics and regional factors.
  • High-frequency, moderate-severity policies contribute disproportionately to portfolio loss volatility.
  • Risk decile analysis shows clear monotonic separation between predicted risk and realized losses.

7. Model Validation

Model performance was evaluated using:

  • Poisson deviance and calibration checks for frequency
  • Error analysis and decile stability for severity
  • Portfolio-level decile backtesting comparing expected vs actual losses

Directional alignment across deciles confirms the model’s ranking effectiveness, which is critical for underwriting and pricing decisions.

8. Limitations

  • Assumes historical claim patterns remain stable over time
  • Does not explicitly model inflation or repair cost escalation
  • Assumes claim independence and Poisson equidispersion
  • Limited to available features; fraud indicators not included
  • Results are specific to the geographic scope of the data

9. Tech Stack

  • Python (pandas, numpy)
  • Statsmodels (Poisson GLM, log-linked regression)
  • Exposure-adjusted actuarial modeling
  • Portfolio decile analysis and backtesting
  • CSV-based production outputs

9.1 Reproducibility

To reproduce the results of this project:

  1. Install dependencies listed in requirements.txt
  2. Download the original public dataset and place it in a local data/ directory
  3. Run INSURANCE_RISK_SCORING.ipynb end to end

Raw datasets are intentionally excluded from version control due to size considerations.

10. Conclusion

This project demonstrates a production-style, interpretable insurance risk scoring pipeline aligned with actuarial best practices. By combining frequency–severity modeling with portfolio-level validation, the resulting risk scores are suitable for real-world underwriting, pricing, and risk management applications.

About

Actuarial insurance risk scoring using frequency–severity modelling to estimate expected loss for underwriting and pricing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors