Core concepts

PolicyEngine.py is a Python package for tax-benefit microsimulation analysis. It provides a unified interface for running policy simulations, analysing distributional impacts, and visualising results across different countries.

Architecture overview

The package is organised around several core concepts:

Tax-benefit models: Country-specific implementations (UK, US) that define tax and benefit rules
Datasets: Microdata representing populations at entity level (person, household, etc.)
Simulations: Execution environments that apply tax-benefit models to datasets
Outputs: Analysis tools for extracting insights from simulation results
Policies: Parametric reforms that modify tax-benefit system parameters

Tax-benefit models

Tax-benefit models define the rules and calculations for a country's tax and benefit system. Each model version contains:

Variables: Calculated values (e.g., income tax, universal credit)
Parameters: System settings (e.g., personal allowance, benefit rates)
Parameter values: Time-bound values for parameters

Using a tax-benefit model

from policyengine.tax_benefit_models.uk import uk_latest
from policyengine.tax_benefit_models.us import us_latest

# UK model includes variables like:
# - income_tax, national_insurance, universal_credit
# - Parameters like personal allowance, NI thresholds

# US model includes variables like:
# - income_tax, payroll_tax, eitc, ctc, snap
# - Parameters like standard deduction, EITC rates

Datasets

Datasets contain microdata representing a population. Each dataset has:

Entity-level data: Separate dataframes for person, household, and other entities
Weights: Survey weights for population representation
Join keys: Relationships between entities (e.g., which household each person belongs to)

Dataset structure

from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset

dataset = PolicyEngineUKDataset(
    name="FRS 2023-24",
    description="Family Resources Survey microdata",
    filepath="./data/frs_2023_24_year_2026.h5",
    year=2026,
)

# Access entity-level data
person_data = dataset.data.person      # MicroDataFrame
household_data = dataset.data.household
benunit_data = dataset.data.benunit    # Benefit unit (UK only)

Creating custom datasets

You can create custom datasets for scenario analysis:

import pandas as pd
from microdf import MicroDataFrame
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, UKYearData

# Create person data
person_df = MicroDataFrame(
    pd.DataFrame({
        "person_id": [0, 1, 2],
        "person_household_id": [0, 0, 1],
        "person_benunit_id": [0, 0, 1],
        "age": [35, 8, 40],
        "employment_income": [30000, 0, 50000],
        "person_weight": [1.0, 1.0, 1.0],
    }),
    weights="person_weight"
)

# Create household data
household_df = MicroDataFrame(
    pd.DataFrame({
        "household_id": [0, 1],
        "region": ["LONDON", "SOUTH_EAST"],
        "rent": [15000, 12000],
        "household_weight": [1.0, 1.0],
    }),
    weights="household_weight"
)

# Create benunit data
benunit_df = MicroDataFrame(
    pd.DataFrame({
        "benunit_id": [0, 1],
        "would_claim_uc": [True, True],
        "benunit_weight": [1.0, 1.0],
    }),
    weights="benunit_weight"
)

dataset = PolicyEngineUKDataset(
    name="Custom scenario",
    description="Single parent vs single adult",
    filepath="./custom.h5",
    year=2026,
    data=UKYearData(
        person=person_df,
        household=household_df,
        benunit=benunit_df,
    )
)

Data loading

Before running simulations, you need representative microdata. The package provides three functions for managing datasets:

ensure_datasets(): Load from disk if available, otherwise download and compute (recommended)
create_datasets(): Always download from HuggingFace and compute from scratch
load_datasets(): Load previously saved HDF5 files from disk

from policyengine.tax_benefit_models.us import ensure_datasets

# First run: downloads from HuggingFace, computes variables, saves to ./data/
# Subsequent runs: loads from disk instantly
datasets = ensure_datasets(
    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
    years=[2026],
    data_folder="./data",
)
dataset = datasets["enhanced_cps_2024_2026"]

from policyengine.tax_benefit_models.uk import ensure_datasets

datasets = ensure_datasets(
    datasets=["hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5"],
    years=[2026],
    data_folder="./data",
)
dataset = datasets["enhanced_frs_2023_24_2026"]

All datasets are stored as HDF5 files on disk. No database server is required.

Simulations

Simulations apply tax-benefit models to datasets, calculating all variables for the specified year.

Running a simulation

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)
simulation.run()

# Access output data
output_person = simulation.output_dataset.data.person
output_household = simulation.output_dataset.data.household

# Check calculated variables
print(output_household[["household_id", "household_net_income", "household_tax"]])

Simulation lifecycle: `run()` vs `ensure()`

The Simulation class provides two methods for computing results:

Method	Behaviour
`simulation.run()`	Always recomputes from scratch. No caching.
`simulation.ensure()`	Checks in-memory LRU cache, then tries loading from disk, then falls back to `run()` + `save()`.

# One-off computation (no caching)
simulation.run()

# Cache-or-compute (preferred for production use)
simulation.ensure()

ensure() uses a module-level LRU cache (max 100 simulations) and saves output datasets as HDF5 files alongside the input dataset. On repeated calls, it returns cached results instantly. For baseline-vs-reform comparisons, economic_impact_analysis() calls ensure() internally, so you rarely need to call it yourself.

Accessing calculated variables

After running a simulation, you can access the calculated variables from the output dataset:

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)
simulation.run()

# Access specific variables
output = simulation.output_dataset.data
person_data = output.person[["person_id", "age", "employment_income", "income_tax"]]
household_data = output.household[["household_id", "household_net_income"]]
benunit_data = output.benunit[["benunit_id", "universal_credit", "child_benefit"]]

Policies

Policies modify tax-benefit system parameters through parametric reforms.

Creating a policy

from policyengine.core import Policy, Parameter, ParameterValue
import datetime

# Define parameter to modify
parameter = Parameter(
    name="gov.hmrc.income_tax.allowances.personal_allowance.amount",
    tax_benefit_model_version=uk_latest,
    description="Personal allowance for income tax",
    data_type=float,
)

# Set new value
parameter_value = ParameterValue(
    parameter=parameter,
    start_date=datetime.date(2026, 1, 1),
    end_date=datetime.date(2026, 12, 31),
    value=15000,  # Increase from ~£12,570 to £15,000
)

policy = Policy(
    name="Increased personal allowance",
    description="Raises personal allowance to £15,000",
    parameter_values=[parameter_value],
)

Running a reform simulation

# Baseline simulation
baseline = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)
baseline.run()

# Reform simulation
reform = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
    policy=policy,
)
reform.run()

Combining policies

Policies can be combined using the + operator:

combined = policy_a + policy_b
# Concatenates parameter_values and chains simulation_modifiers

Simulation modifiers

For reforms that cannot be expressed as parameter value changes, Policy accepts a simulation_modifier callable that directly manipulates the underlying policyengine_core simulation:

def my_modifier(sim):
    """Custom reform logic applied to the core simulation object."""
    p = sim.tax_benefit_system.parameters
    # Modify parameters programmatically
    return sim

policy = Policy(
    name="Custom reform",
    simulation_modifier=my_modifier,
)

Note: the UK model supports simulation_modifier. The US model currently only uses the parameter_values path.

Dynamic behavioural responses

The Dynamic class is structurally identical to Policy and represents behavioural responses to policy changes (e.g., labour supply elasticities). It is applied after the policy in the simulation pipeline.

from policyengine.core.dynamic import Dynamic

dynamic = Dynamic(
    name="Labour supply response",
    parameter_values=[...],  # Same format as Policy
)

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
    policy=policy,
    dynamic=dynamic,
)

Dynamic responses can also be combined using the + operator and support simulation_modifier callables.

Outputs

Output classes provide structured analysis of simulation results.

Aggregate

Calculate aggregate statistics (sum, mean, count) for any variable:

from policyengine.outputs.aggregate import Aggregate, AggregateType

# Total universal credit spending
agg = Aggregate(
    simulation=simulation,
    variable="universal_credit",
    aggregate_type=AggregateType.SUM,
    entity="benunit",  # Map to benunit level
)
agg.run()
print(f"Total UC spending: £{agg.result / 1e9:.1f}bn")

# Mean household income in top decile
agg = Aggregate(
    simulation=simulation,
    variable="household_net_income",
    aggregate_type=AggregateType.MEAN,
    filter_variable="household_net_income",
    quantile=10,
    quantile_eq=10,  # 10th decile
)
agg.run()
print(f"Mean income in top decile: £{agg.result:,.0f}")

ChangeAggregate

Analyse impacts of policy reforms:

from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType

# Count winners and losers
winners = ChangeAggregate(
    baseline_simulation=baseline,
    reform_simulation=reform,
    variable="household_net_income",
    aggregate_type=ChangeAggregateType.COUNT,
    change_geq=1,  # Gain at least £1
)
winners.run()
print(f"Winners: {winners.result / 1e6:.1f}m households")

losers = ChangeAggregate(
    baseline_simulation=baseline,
    reform_simulation=reform,
    variable="household_net_income",
    aggregate_type=ChangeAggregateType.COUNT,
    change_leq=-1,  # Lose at least £1
)
losers.run()
print(f"Losers: {losers.result / 1e6:.1f}m households")

# Revenue impact
revenue = ChangeAggregate(
    baseline_simulation=baseline,
    reform_simulation=reform,
    variable="household_tax",
    aggregate_type=ChangeAggregateType.SUM,
)
revenue.run()
print(f"Revenue change: £{revenue.result / 1e9:.1f}bn")

Entity mapping

The package automatically handles entity mapping when variables are defined at different entity levels.

Entity hierarchy

UK:

household
    └── benunit (benefit unit)
            └── person

US:

household
    ├── tax_unit
    ├── spm_unit
    ├── family
    └── marital_unit
            └── person

Automatic mapping

When you request a person-level variable (like ssi) at household level, the package:

Sums person-level values within each household (aggregation)
Returns household-level data with proper weights

# SSI is defined at person level, but we want household-level totals
agg = Aggregate(
    simulation=simulation,
    variable="ssi",  # Person-level variable
    entity="household",  # Target household level
    aggregate_type=AggregateType.SUM,
)
# Internally maps person → household by summing SSI for all persons in each household

When you request a household-level variable at person level:

Replicates household values to all persons in that household (expansion)

Direct entity mapping

You can also map data between entities directly using the map_to_entity method:

# Map person income to household level (sum)
household_income = dataset.data.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)

# Map household rent to person level (project/broadcast)
person_rent = dataset.data.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["rent"],
    how="project"
)

Mapping with custom values

You can map custom value arrays instead of existing columns:

# Map custom per-person values to household level
import numpy as np

# Create custom values (e.g., imputed data)
custom_values = np.array([100, 200, 150, 300])

household_totals = dataset.data.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=custom_values,
    how="sum"
)

Aggregation methods

The how parameter controls how values are mapped:

Person → Group (aggregation):

how='sum' (default): Sum values within each group
how='first': Take first person's value in each group

# Sum person incomes to household level
household_income = data.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)

# Take first person's age as household reference
household_age = data.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["age"],
    how="first"
)

Group → Person (expansion):

how='project' (default): Broadcast group value to all members
how='divide': Split group value equally among members

# Broadcast household rent to each person
person_rent = data.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["rent"],
    how="project"
)

# Split household savings equally per person
person_savings = data.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["total_savings"],
    how="divide"
)

Group → Group (via person entity):

how='sum' (default): Sum through person entity
how='first': Take first source group's value
how='project': Broadcast first source group's value
how='divide': Split proportionally based on person counts

# UK: Sum benunit benefits to household level
household_benefits = data.map_to_entity(
    source_entity="benunit",
    target_entity="household",
    columns=["universal_credit"],
    how="sum"
)

# US: Map tax unit income to household, splitting by members
household_from_tax = data.map_to_entity(
    source_entity="tax_unit",
    target_entity="household",
    columns=["taxable_income"],
    how="divide"
)

Visualisation

The package includes utilities for creating PolicyEngine-branded visualisations:

from policyengine.utils.plotting import format_fig, COLORS
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6]))

format_fig(
    fig,
    title="My chart",
    xaxis_title="X axis",
    yaxis_title="Y axis",
    height=600,
    width=800,
)
fig.show()

Brand colours

COLORS = {
    "primary": "#319795",        # Teal
    "success": "#22C55E",        # Green
    "warning": "#FEC601",        # Yellow
    "error": "#EF4444",          # Red
    "info": "#1890FF",           # Blue
    "blue_secondary": "#026AA2", # Dark blue
    "gray": "#667085",           # Gray
}

Common workflows

1. Analyse employment income variation

See UK employment income variation for a complete example of:

Creating custom datasets with varied parameters
Running single simulations
Extracting results with filters
Visualising benefit phase-outs

2. Policy reform analysis

See UK policy reform analysis for:

Applying parametric reforms
Comparing baseline and reform
Analysing winners/losers by decile
Calculating revenue impacts

3. Distributional analysis

See US income distribution for:

Loading representative microdata
Calculating statistics by income decile
Mapping variables across entity levels
Creating interactive visualisations

Best practices

Creating custom datasets

Always set would_claim variables: Benefits won't be claimed unless explicitly enabled
```
"would_claim_uc": [True] * n_households
```

Set disability variables explicitly: Prevents random UC spikes from LCWRA element

"is_disabled_for_benefits": [False] * n_people
"uc_limited_capability_for_WRA": [False] * n_people

Include required join keys: Person data needs entity membership

"person_household_id": household_ids
"person_benunit_id": benunit_ids  # UK only

Set required household fields: Vary by country

# UK
"region": ["LONDON"] * n_households
"tenure_type": ["RENT_PRIVATELY"] * n_households

# US
"state_code": ["CA"] * n_households

Performance optimisation

Single simulation for variations: Create all scenarios in one dataset, run once
Custom variable selection: Only calculate needed variables
Filter efficiently: Use quantile filters for decile analysis
Parallel analysis: Multiple Aggregate calls can run independently

Data integrity

Check weights: Ensure weights sum to expected population
Validate join keys: All persons should link to valid households
Review output ranges: Check calculated values are reasonable
Test edge cases: Zero income, high income, disabled, elderly

Next steps

Economic impact analysis: Full baseline-vs-reform comparison workflow
Advanced outputs: DecileImpact, Poverty, Inequality, IntraDecileImpact
Regions and scoping: Sub-national analysis (states, constituencies, districts)
Country-specific documentation:
- UK tax-benefit model
- US tax-benefit model
Visualisation: Publication-ready charts
Examples: Complete working scripts

FilesExpand file tree

core-concepts.md

Latest commit

History

core-concepts.md

File metadata and controls

Core concepts

Architecture overview

Tax-benefit models

Using a tax-benefit model

Datasets

Dataset structure

Creating custom datasets

Data loading

Simulations

Running a simulation

Simulation lifecycle: run() vs ensure()

Accessing calculated variables

Policies

Creating a policy

Running a reform simulation

Combining policies

Simulation modifiers

Dynamic behavioural responses

Outputs

Aggregate

ChangeAggregate

Entity mapping

Entity hierarchy

Automatic mapping

Direct entity mapping

Mapping with custom values

Aggregation methods

Visualisation

Brand colours

Common workflows

1. Analyse employment income variation

2. Policy reform analysis

3. Distributional analysis

Best practices

Creating custom datasets

Performance optimisation

Data integrity

Next steps

Simulation lifecycle: `run()` vs `ensure()`