012 - Increasing Scalability And Performance of Data Collection Pipelines #396

eveleighoj · 2025-06-04T15:26:17Z

eveleighoj
Jun 4, 2025
Maintainer

TBD

yogkauCZ · 2025-06-18T10:42:02Z

yogkauCZ
Jun 18, 2025

Hi Owen,

Im adding scalability presentation (DRAFT version) here, so we can all work on this together and resolved questions/doubts.
Uploading MHCLG- Data Collection Improvements.pptx…

0 replies

eveleighoj · 2025-06-18T12:08:44Z

eveleighoj
Jun 18, 2025
Maintainer Author

After looking into how AWS Glue runs Spark jobs it's made very clear that there are limitations on the machine executing the spark jobs. Specifically it appears that only certain python libraries that contain pure python can be used. This has riased a few questions in my mind:

where is code going to be stored for jobs run using AWS Glue spark jobs. Can github be integrated for this?
how can local testing take place for these jobs? with normal spark jobs temporary directories or mocked s3 (via moto3) can be used
what is. the executer needs to use logic fom digital-land-python? I don't think digital land python can be installed or used.

3 replies

mattsan-dev Jun 18, 2025

The actual code for Glue jobs like PySpark scripts could be stored in Amazon S3, and Glue reads from there when it runs. Developers won’t access the AWS Console directly, we can still use GitHub as the source of truth. From there, a CI/CD pipeline (like CodePipeline or Jenkins) can automatically sync the code to S3 and deploy the jobs from GitHub depending on your CI/CD pipeline.
For local development, it is possible to simulate the Glue environment using PySpark and AWS’s open-source Glue libraries (using one of AWS’s prebuilt Docker images). We can also mock S3 using tools like moto for unit testing and use temporary directories to simulate file I/O like with standard Spark jobs. This would require a bit of setup.
If we are using Glue 3.0 it is possible to specify additional python modules directly in the job parameters else we will need to package the library as a .whl or .egg file and upload it to S3, then reference it in the Glue job. We would need test the modules that you are currently using within the POC.

eveleighoj Jun 19, 2025
Maintainer Author

This sounds good, is there a limitation on the python libraries that can be referenced? I saw somewhere that. they had to be pure python so libraries that compile c++ might not be usable. For example pandas?

mattsan-dev Jun 24, 2025

There may be some limitations concerning certain python libraries within Pyspark however, Pyspark often has related libraries for data processing which are often much more efficient and performant as compared to Pandas (pyspark.sql.DataFrame) and others. Therefore, it is likely there will be a requirement to migrate and refactor some libraries to support related Pyspark methods.

eveleighoj · 2025-06-18T12:09:28Z

eveleighoj
Jun 18, 2025
Maintainer Author

Why are we using Glue rather than EMR serverless?

3 replies

yogkauCZ Jun 18, 2025

Yes , we can consider EMR, GLUE is bit easier to use and works well with Athena with Catalogs

eveleighoj Jun 23, 2025
Maintainer Author

Just to add a consideration to this. Likelihood is that our use case of spark will include geospatial requirements, if not now thenn later. it looks like Apache Sedona provides this functionality for spark.

mattsan-dev Jun 24, 2025

Initially we were a uncertain as to how quickly a solution needs to be created however, for a more long-term and flexible approach, using EMR Serverless is a good option for us to consider especially for future Apache Sedona functionality (geospatial data) for spark. Glue offers geospatial support for ArcGIS GeoAnalytics.

eveleighoj · 2025-06-19T10:51:59Z

eveleighoj
Jun 19, 2025
Maintainer Author

Another area I've been looking at and this might affect Glue and spark. but in terms of how we store the data in s3 will it just be as plain parquet datasets (i.e. hive partitioned parquet files) or should we attempt to use a tool like delta lake?

This is supported in spark and gives some options for optimising pipelines using ACID transactions

2 replies

mattsan-dev Jun 24, 2025

Including Delta Lake has a number of useful features including ACID transactions, schema enforcement and evaluation and other related benefits however, this may introduce some complexity into the pipeline development and increase development time. We could opt for this approach initially or we could perhaps start with plain parquet for the POC and evolve to using Delta Lake.

lakshmi-kovvuri1 Jun 24, 2025

Initially we can start POC addressing scalability issues, once we got stable pipeline ,we can add delta lake as next step to POC.

gmatchett · 2025-06-25T16:14:43Z

gmatchett
Jun 25, 2025
Collaborator

Adding myself to this card as Owen is on leave

0 replies

yogkauCZ · 2025-06-26T08:31:38Z

yogkauCZ
Jun 26, 2025

Added previously discussed question/Answers sheet
Uploading MHCLG_client_questions_v1 1 (2).xlsx…

0 replies

yogkauCZ · 2025-07-01T13:15:08Z

yogkauCZ
Jul 1, 2025

We have been looking at metadata for the transport-access-node dataset using the online service (Specification | Planning Data) however, we suspect that this information is likely related to intermediate post processing activities.

Therefore, we are unsure whether the raw data schema matches what is available online. Can you perhaps confirm whether the entity, attribute and related metadata for this dataset is the same within AWS? I'm a bit confused by this question and what you mean by raw data. The raw resources are in random schemas but I believe we have agreed for you not to focus on the raw data to transformed data. The shape of the transformed data is not in the spec right now but I can provide an example transformed resource. The fact, fact_resource and entity tables are in there and are roughly correct. In the sqlite filles and the postgres tables are marginally different you can view the sqlite files through Datasette and the shape of the postgres is in the sqalchemy models in the digital-land.info repo..
For the POC we would like to request a meeting for a full walkthrough of the transport-access-node dataset from where we need to start our processing (intermediary data pipeline step) through to the final landing area/s for the data. Yep can do this there are a few things we'll need to tackle. But a good place to start is this line https://github.com/digital-land/collection-task/blob/104df85861401d6088728039792a75038ee580ca/task/run.sh#L65 where we make datasets then there are some follow on steps.
2. Where should we place the Py Spark scripts within the project pipeline structure?
For example, should they go in a specific subfolder like jobs/ or scripts/?
Based on my understanding of spark and how your planning to use EMR there are likely a couple of options:
You can create a new. Github repo which pushes to a new s3 bucket
You can look at keeping the scripts where airflow dags are kept but. I'm not sure there is any advantage to this
For the poc you may not need to worry about how it ties into digital-land-python but you may want to consider whether any thing is worth putting in there so it can be shared via a python librrary later to other pieces of the infrastruture.

Can we interact with Amazon EMR(development environment) from our local VS Code environment by connecting through a GitHub repository I'm not sure what you mean by this.
Data mappings - how to map column/table with datasets/files
Git repo - should we create python scripts in existing repos
Data flow for one dataset/collection as end to end example yes you can look at creating a DAG for just one collection called like test-title-boundary-collection and then just replace the current colections later
config file to be place in S3- new directory? Config is already in there

0 replies

mattsan-dev · 2025-07-02T11:06:45Z

mattsan-dev
Jul 2, 2025

Following discussions with @eveleighoj on 01/07/2025, it was identified that the current post-transformation data model likely follows an Entity-Attribute-Value (EAV) structure. The rationale for choosing this model remains unclear, and it may not be well-suited to the current data processing requirements. Alternative approaches such as relational models or a fact-dimension (star schema) design may offer better performance and maintainability. Further investigation is needed to assess the feasibility of modifying the existing model.

0 replies

mattsan-dev · 2025-07-02T11:19:21Z

mattsan-dev
Jul 2, 2025

@gmatchett, as requested, I’ve reviewed the Digital Land wiki technical documentation (https://digital-land.github.io/technical-documentation/data-operations-manual/Explanation/Key-Concepts/Data-quality-1-needs/) and identified several areas where improvements could be made, including the following:

There is no search functionality
There are broken links (https://digital-land.github.io/technical-documentation/planning.data.gov.uk)
Pages do not contain last modified dates or version control
The data content provided is mostly at a high level
The table of contents is not collapsible making navigation difficult
There is no quick start or onboarding guide
There is no comments or feedback mechanism
It may be beneficial to align the page layout with a more conventional wiki format to enhance clarity and user experience.

0 replies

mattsan-dev · 2025-07-17T09:17:41Z

mattsan-dev
Jul 17, 2025

@eveleighoj - for benchmarking purposes do you perhaps have some data concerning load runtimes from transform to assemble for transport-access-node for the fact, fact-resource tables and entity tables?

1 reply

eveleighoj Jul 17, 2025
Maintainer Author

Airflow will track task timings for the transport access nodes collection. Unfortunately the current task does collection, planning, transformation, assembling and some baking (we should at some point separate these out). But by looking at the length of time transport access nodes takes and going through some of the logs you should be able to identify roughly how long this step is taking right now and compare that to yours. It's not a permanent fix but can give you some numbers short term.

In the long term you want to examine how long a collection DAG is taking to finish (you may need to exclude que times) as we want to be able to run a collection in a much shorter amount of time.

lakshmi-kovvuri1 · 2025-10-23T11:41:36Z

lakshmi-kovvuri1
Oct 23, 2025

@eveleighoj - please review and let us know if there are any questions.
Ref #410 Update infrastructure diagram to incorporate distributed architecture changes - Added architecture diagram to proposals of technical- documentation , https://github.com/digital-land/technical-documentation/blob/pyspark_scalability_doc/images/proposals/002-data-pipelines-migration/data-collection-pipeline-deployment-v1.1.drawio.png

The diagram reflects the latest updates related to the Distributed Data Processing with new tools including, Elastic Map Reduce (EMR) Serverless as well as the related output areas including Athena for ad-hoc querying and Aurora RDS Postgres for the entity table output.
@mattsancog @yogkauCZ

0 replies

012 - Increasing Scalability And Performance of Data Collection Pipelines #396

Uh oh!

eveleighoj Jun 4, 2025 Maintainer

Replies: 11 comments · 9 replies

Uh oh!

Uh oh!

eveleighoj Jun 18, 2025 Maintainer Author

Uh oh!

Uh oh!

eveleighoj Jun 19, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

eveleighoj Jun 18, 2025 Maintainer Author

Uh oh!

Uh oh!

eveleighoj Jun 23, 2025 Maintainer Author

Uh oh!

Uh oh!

eveleighoj Jun 19, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

gmatchett Jun 25, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eveleighoj Jul 17, 2025 Maintainer Author

Uh oh!

eveleighoj
Jun 4, 2025
Maintainer

Replies: 11 comments 9 replies

eveleighoj
Jun 18, 2025
Maintainer Author

eveleighoj Jun 19, 2025
Maintainer Author

eveleighoj
Jun 18, 2025
Maintainer Author

eveleighoj Jun 23, 2025
Maintainer Author

eveleighoj
Jun 19, 2025
Maintainer Author

gmatchett
Jun 25, 2025
Collaborator

eveleighoj Jul 17, 2025
Maintainer Author