010 – Data Pipeline API #60
stevenjmesser
started this conversation in
Open design proposal
Replies: 1 comment
-
|
First iteration of Pipeline Internal API has been implemented. Closing this design proposal - no objections have been raised. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Open Design Proposal - 010 - Data Pipeline API
Author(s)
Introduction
An internal API is proposed for providing access to data pipeline metadata, which includes:
Logs
Issues
Processed & converted files
Performance aggregations
Specification
Configuration
Status
Detail
Overview
Until the time of writing (October '24), the need for access to information about the data collection pipelines has been largely satisfied through Datasette, which has been configured to ingest SQLite files stored on EFS volumes.
It is no secret to the team that our use of Datasette stretches beyond the purpose for which it was originally designed. Recent advances with the Submit tool have further proven the difficulty of relying upon Datasette for OLAP style queries. We have effectively been using Datasette as an API for data collection pipeline metadata - a task which it doesn't naturally suit. Performance and stability are just two of the problems presented by our attempts to employ Datasette as a drop-in replacement for an API.
Another problem with using Datasette as an API, is the lack of versioning for endpoints. It becomes very brittle and the only method is to create shadow a database.
An internal API, separate from our existing public platform API is proposed to provide access to the data consumed and produced by the data collection pipelines. This metadata includes:
Logs
Issues
Processed & converted files
Performance aggregations
Specification
Configuration
API Spec
An example of the potential shape of the API has been provided in OpenAPI specification format.
Public access
Versioning
For versioning of endpoints and the associated request & response schemas, the root of the API path will contain a version, i.e.
/v1. For example, the path to the logs endpoint would as follows:https://pipeline-api.planning.data.gov.uk/v1/logsIdeally, a maximum of two versions of the same resource would be maintained at any one time, e.g.
https://pipeline-api.planning.data.gov.uk/v1/logshttps://pipeline-api.planning.data.gov.uk/v2/logsImportantly, a deprecation date should be agreed for the older version, and all known API consumers should be notified of such.
Container diagram
Pipeline API only
The following container diagram illustrates how the Pipeline API will be able to communicate across a number of different data sources and formats to provide a single view of pipeline metadata.
Pipeline API within System Context
The following container diagram shows how the Pipeline API interacts within the Data Collection Pipeline system context:
Note that the Pipeline API reads collection and pipeline metadata in Parquet format from the existing Collection Archive bucket. The existing Collection Task will be modified to write data in Parquet format, as well as CSV and SQLite.
Remember that the overall system context diagram is helpful if you're not so familiar with the architecture of the Digital Planning Data service.
Implementation considerations
New code repositories, GitHub pipelines, ECR image repositories and ECS tasks will be needed for:
New AWS resources will need to be provisioned for:
Design Comments/Questions
Leave comments and questions in this discussion.
Beta Was this translation helpful? Give feedback.
All reactions