012 - Increasing Scalability And Performance of Data Collection Pipelines #396
Replies: 11 comments 9 replies
-
|
Hi Owen, Im adding scalability presentation (DRAFT version) here, so we can all work on this together and resolved questions/doubts. |
Beta Was this translation helpful? Give feedback.
-
|
After looking into how AWS Glue runs Spark jobs it's made very clear that there are limitations on the machine executing the spark jobs. Specifically it appears that only certain python libraries that contain pure python can be used. This has riased a few questions in my mind:
|
Beta Was this translation helpful? Give feedback.
-
|
Why are we using Glue rather than EMR serverless? |
Beta Was this translation helpful? Give feedback.
-
|
Another area I've been looking at and this might affect Glue and spark. but in terms of how we store the data in s3 will it just be as plain parquet datasets (i.e. hive partitioned parquet files) or should we attempt to use a tool like delta lake? This is supported in spark and gives some options for optimising pipelines using ACID transactions |
Beta Was this translation helpful? Give feedback.
-
|
Adding myself to this card as Owen is on leave |
Beta Was this translation helpful? Give feedback.
-
|
Added previously discussed question/Answers sheet |
Beta Was this translation helpful? Give feedback.
-
Therefore, we are unsure whether the raw data schema matches what is available online. Can you perhaps confirm whether the entity, attribute and related metadata for this dataset is the same within AWS? I'm a bit confused by this question and what you mean by raw data. The raw resources are in random schemas but I believe we have agreed for you not to focus on the raw data to transformed data. The shape of the transformed data is not in the spec right now but I can provide an example transformed resource. The fact, fact_resource and entity tables are in there and are roughly correct. In the sqlite filles and the postgres tables are marginally different you can view the sqlite files through Datasette and the shape of the postgres is in the sqalchemy models in the digital-land.info repo..
|
Beta Was this translation helpful? Give feedback.
-
|
Following discussions with @eveleighoj on 01/07/2025, it was identified that the current post-transformation data model likely follows an Entity-Attribute-Value (EAV) structure. The rationale for choosing this model remains unclear, and it may not be well-suited to the current data processing requirements. Alternative approaches such as relational models or a fact-dimension (star schema) design may offer better performance and maintainability. Further investigation is needed to assess the feasibility of modifying the existing model. |
Beta Was this translation helpful? Give feedback.
-
|
@gmatchett, as requested, I’ve reviewed the Digital Land wiki technical documentation (https://digital-land.github.io/technical-documentation/data-operations-manual/Explanation/Key-Concepts/Data-quality-1-needs/) and identified several areas where improvements could be made, including the following:
|
Beta Was this translation helpful? Give feedback.
-
|
@eveleighoj - for benchmarking purposes do you perhaps have some data concerning load runtimes from transform to assemble for transport-access-node for the fact, fact-resource tables and entity tables? |
Beta Was this translation helpful? Give feedback.
-
|
@eveleighoj - please review and let us know if there are any questions. The diagram reflects the latest updates related to the Distributed Data Processing with new tools including, Elastic Map Reduce (EMR) Serverless as well as the related output areas including Athena for ad-hoc querying and Aurora RDS Postgres for the entity table output. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
TBD
Beta Was this translation helpful? Give feedback.
All reactions