This repository contains tools and scripts for harmonising and processing census data from the 2021 and 2022 UK Censuses. The process is described in detail in the paper:
The dataset is the first unified release covering all four UK nations at the smallest available geographic level: Output Areas in England, Wales, and Scotland, and Data Zones in Northern Ireland. The UK’s three census agencies—ONS (England & Wales), NRS (Scotland), and NISRA (Northern Ireland)—release their data separately, each with distinct variables, formats, and disclosure controls. Through a process of matching, standardisation, and aggregation, 190 comparable variables are produced. The dataset is made available as a series of topic tables indexed across all 239,023 of the UK’s small-area geographies. By providing a standardised dataset, this work enables seamless UK-wide analyses, facilitating cross-national comparisons and supporting research and public policy development.
└── 📁UK_Census_Data_21_22/
└── 📁data
└── 📁output_data_set #The produced dataset including unified Census tables for the UK and associated metadata
└── 📁individual_country_census_data # Downloaded Census tables for England & Wales, Scotland and Northern Ireland
└── 📁uk_census_data # Unified Census tables for the United Kingdom
└── 📁uk_matching_output # Outputs from the manual matching process between countries
└── 📁validation_plots # Plots validating the matching for each variable
└── 📁src
└── reproduce_ukdataset_creation.py #script to fully reproduce the creation of the data set.
└── download_census_data_1.py
└── produce_uk_tables_2.py
└── producevalidation_plots_3.py
└── 📁census_download_scripts #scripts for downloading census data from each country
└── 📁utils
└── README.md
└── requirements.txt
- Clone the repository:
git clone git@github.com:ogoodwin505/UK_Census_Data_21_22.git cd UK_Census_Data_21_22 - Install dependencies using pip and a virtual enviroment
python -m venv .venv source .venv/bin/activate # On Windows, use `.venv\Scripts\activate` pip install -r requirements.txt
To reproduce the release dataset run:
python src/reproduce_ukdataset_creation.pythis will perform all stages of the process;
- Download the Census data from each country source,
- Unify that data into UK tables based on the variable matches in
data/uk_matching_output/VariableMatchLookup.csv, - Produce validation plots for each variable in the new dataset.
The dataset is produced in
unified_census_data_set.
The final output of this code can be found in the unified_census_data_set directory. This is the released data product found at Figshare
📁unified_census_data_set
└── 📁topic_tables
└── 📁csv
└── 📁parquet
└── Table_Notes.csv
└── Variable_Metadata.csv
There are 25 unified topic tables available in both csv and parquet format. Table_Notes.csv contains the list of table titles and notes on any features of interest in the harmonisation process.
Variable_Metadata.csv contains the look up between the variable ids and the full variable descriptions.
validation_plots contains boxplots and histograms for each variable in the unified dataset. These show the normalised (divided by the the table total) distrubutions of the variable seperated by country.
- ONS: Office for National Statistics (England & Wales)
- NRS: National Records of Scotland
- NISRA: Northern Ireland Statistics and Research Agency