Improving The Scalability & Performance Of Data Collection Pipelines #364
Unanswered
eveleighoj
asked this question in
Statement Of Work
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
To see the background to the programme please go to https://www.planning.data.gov.uk/about/
Problem Statement
A major problem we are currently facing is within the pipelines that deliver data providers data onto the platform. Specifically there are two main areas:
Scalability - for large datasets we are hitting limitations due to the quantity contained across the resources. This initially presented as taking too long to process (hitting 12 hours and timing out). We implemented a much speedier solution using duckdb but it's hitting memory limits of the machines. We know this isn't the largest dataset and it will continue growing so we need to be assured that the technology can scale appropriately. A specific example is
title-boundaryPerformance - while performance also helps with scalability, it is a key area in its own right. Our collection pipelines range from under 7 minutes to over 3 hours. There is huge scope for performance increases across the whole pipeline but we want to aim to reduce all processing down to minutes rather than hours. This enables data engineers and data managers to get feedback in a much faster timeframe
Background
Data collection pipelines happen as part of our batch process. These are either ran overnight or triggered manually via airflow. The diagram of the collection pipelines is:
For more information on our data collection pipelines you can examine the architecture and infrastructure section of our technical documentation. This contains a lot of information about our project. Some specific pages for this work:
The key processes of the pipeline which need optimising are the collect, transform, assemble and package processes.
Functional Requirements
Non-Functional Requirements
Constraints
Previous attempts
We have been looking at tackling this problem for a while and have tried and implemented several solutions that have helped but not permanently solved the problem. These were:
Areas for Investigation
Scalability
**Improve the assemble process ** - The assemble process is where we load and aggregate transformed resources into our provenance model. This is likely to grow a huge amount moving forward as each resource we download is another resource that might need reprocessing. The sqlites are getting large for large datasets. sqlite may not be the format we choose to keep moving forward and instead move to parquet or data lake table where spark can be used to process the data.
Use multiple machines for the transform process - currently one machine is used to process all resources but seeing as the result is passed onto the assemble phase this could easily be done across multiple machines. then no matter how many resources need processing the load can easily be handled. This would fix any scalability issues arising from limited disk space on a single machine.
Performance
incremental loading - by reducing the number of resources that need processing and switching to an assemble phase that can simply update rather than recreate everything we could reduce the amount. of time to process significantly.
improve the transform process - each resource is independently transformed while this processing isn't where the majority of the time taken you can argue than using pandas, polars spark or event converting to rust could give time savings which processing each unique file. Especially fir larger files. Most files are under 1 million rows but there are some on the horizon that are around 14 million rows.
Beta Was this translation helpful? Give feedback.
All reactions