Improving The Scalability & Performance Of Data Collection Pipelines #364

eveleighoj · 2025-04-23T08:49:49Z

eveleighoj
Apr 23, 2025
Maintainer

To see the background to the programme please go to https://www.planning.data.gov.uk/about/

Problem Statement

A major problem we are currently facing is within the pipelines that deliver data providers data onto the platform. Specifically there are two main areas:

Scalability - for large datasets we are hitting limitations due to the quantity contained across the resources. This initially presented as taking too long to process (hitting 12 hours and timing out). We implemented a much speedier solution using duckdb but it's hitting memory limits of the machines. We know this isn't the largest dataset and it will continue growing so we need to be assured that the technology can scale appropriately. A specific example is title-boundary
Performance - while performance also helps with scalability, it is a key area in its own right. Our collection pipelines range from under 7 minutes to over 3 hours. There is huge scope for performance increases across the whole pipeline but we want to aim to reduce all processing down to minutes rather than hours. This enables data engineers and data managers to get feedback in a much faster timeframe

Background

Data collection pipelines happen as part of our batch process. These are either ran overnight or triggered manually via airflow. The diagram of the collection pipelines is:

For more information on our data collection pipelines you can examine the architecture and infrastructure section of our technical documentation. This contains a lot of information about our project. Some specific pages for this work:

Data Architecture - An overview of our data processing
Batch Processing - Page discussing the main data pipelines and in particular contains information on the data collection pipelines.

The key processes of the pipeline which need optimising are the collect, transform, assemble and package processes.

Functional Requirements

The title boundary collection pipeline must complete and provide data to the site
We have to be able to load larger datasets into the system
Pipeline failures must alert the planning data team

Non-Functional Requirements

The amount of time it takes to run a data collection pipeline must be within minutes not hours

Constraints

Changes must be applicable across all collections and datasets. There is to be no unique solutions for particular collections.
We require the same functionality for data access, any change in files produced will need to have it's effects considered
Our primarily use services that are available in AWS. This is where infra is hosted and what we have funding for.

Previous attempts

We have been looking at tackling this problem for a while and have tried and implemented several solutions that have helped but not permanently solved the problem. These were:

Utilised duckdb to load the data into tables in SQLite rather than loading directly into SQLit. A key. bottleneck we have is after resources are transformed, data is loaded into sqlite. This was by far taking the largest amount of time. This led to a large decrease in time which was great however it began hitting memory limits of the machine for larger datasets. It also doesn't help navigate the limitations of a single machine
Attempted to implement incremental loading - This work was not completed so we are unsure of it's effectiveness but instead of reprocessing all resources and loading all resources into the sqlite file it looks at only processing new resources downloaded and only loading the new transformed resources into our provenance model. This was more complicated than anticipated to only process a group of resources. We also didn't tackle when older resources need processing. NOTE - This will not ensure complete scalability as when code changes take place that affect all historic resources we will need to reprocess everything.

Areas for Investigation

Scalability

**Improve the assemble process ** - The assemble process is where we load and aggregate transformed resources into our provenance model. This is likely to grow a huge amount moving forward as each resource we download is another resource that might need reprocessing. The sqlites are getting large for large datasets. sqlite may not be the format we choose to keep moving forward and instead move to parquet or data lake table where spark can be used to process the data.
Use multiple machines for the transform process - currently one machine is used to process all resources but seeing as the result is passed onto the assemble phase this could easily be done across multiple machines. then no matter how many resources need processing the load can easily be handled. This would fix any scalability issues arising from limited disk space on a single machine.

Performance

incremental loading - by reducing the number of resources that need processing and switching to an assemble phase that can simply update rather than recreate everything we could reduce the amount. of time to process significantly.
improve the transform process - each resource is independently transformed while this processing isn't where the majority of the time taken you can argue than using pandas, polars spark or event converting to rust could give time savings which processing each unique file. Especially fir larger files. Most files are under 1 million rows but there are some on the horizon that are around 14 million rows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving The Scalability & Performance Of Data Collection Pipelines #364

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Improving The Scalability & Performance Of Data Collection Pipelines #364

Uh oh!

Uh oh!

eveleighoj Apr 23, 2025 Maintainer

Problem Statement

Background

Functional Requirements

Non-Functional Requirements

Constraints

Previous attempts

Areas for Investigation

Scalability

Performance

Replies: 0 comments

eveleighoj
Apr 23, 2025
Maintainer