Workshop to identify tests to improve our data quality #396

psd · 2025-05-14T12:51:33Z

psd
May 14, 2025
Maintainer

@Ben-Hodgkiss, @paris-dp, @psd and @Swati-Dash met yesterday in Darlington to look at improving tests for assessing our data quality

Provision

We start our data collection process with "provision" (the combination of an organisation and a dataset).

every dataset MUST have at least one provision
every provision MUST have a boundary (usually the local-planning-authority, but will be different for datasets and organisations which aren't an LPA)
the provision authoritative boundaries for all of the organisations SHOULD cover the whole of England (not "alternative" provision-reason
current, authorititve boundaries don't overlap
a range of the number of entities expected to be in each provision could help track completeness of a dataset, but can be hard to estimate

The quality of each provision make available in a provision-quality dataset.

Provisions only currently exist to support our service supporting providers. We need to add other provisions.

Endpoint

every provision must have at least one active endpoint
every endpoint must have a documentation URL (this is currently in a Jupyter Notebook). Issue looking at moving Jupyter Notebooks in to a production environment such as Power BI
there must be a link to the endpoint-url from the endpoint documentation url
Faculty work: we need to check source and endpoint pages for significant changes
every endpoint must be covered by a provision

Source (is it legit?)

every provision must have a source page with a documentation-url
every endpoint documentation page must be linked to from the source or the endpoint must be reachable from the source page

Collection Logs

an endpoint which errors for (7 days) is marked with an end date (this is currently a Config Manager Report). Issue looking at moving reports from Config Manager to Power BI

Resource

We shouldn't have more than one resource from an immutable endpoint (runaway resources). Issue looking at building a query using the reporting_historic_endpoints and investigating this and migrating this to Power BI to monitor regularly.
we shouldn't download the same resource again and again (@psd can you finish what this post it says)

something hasn't changed and we have a new resource (Is this more of an expansion of what we were discussing?)

the format must be processable into csv

--------------------------------------------------- PIPLELINE---------------------------------------------------

Mapping Fields

minimum fields must be provided (for example reference). We need to explore this further
can we support more synonyms for field names?
tests to identify when columns are not mapped and not provided
NOTE: No hardcoding in our pipeline code

Datatypes

dates must not be ambiguous etc. Issue around far future dates
we discussed datatypes and agreed overall these are well covered by issues raised

Linking References

Every reference must link to an entity : Referential integrity between linked datasets (ex A4DA and A4D, LB and LBO) - ticket in review for further testing (Referential Integrity - Further Tasks config#589)

Entity Lookup

Every row after filter must have an entity ??

Dataset

remove missing start date
every conservation area must have a name
geometry and points should be within the provision boundary
check document matches entity (AI)

Information

Freshness (the information has changed but the data hasn't changed)
tree points are on trees (faculty- supported using satellite map?)

--------------------------------------------------- OTHER DISCUSSIONS---------------------------------------------------

Source and Endpoint URLs

Charts and discussions around these

how good is our provision? Chart looking at coverage (map of England) and graph showing amount of data collected going up over time
ratio of authoritative provisions versus endpoints
sources and endpoints documented over time
every provision should have a count range. We discussed this and further exploration will be required. There may be instances in which we can be confident of a range expected for a provision such as conservation areas, however for some datasets (for example tree) it is unclear if we can be confident around this. Could AI be harnessed to help with this?
chart to show freshness of data on the platform- every provision to have a TTL (time to live) and endpoints could have a 'mute button'. Discussions around this included the idea that a data provider may not have updated data in a particular dataset for years, however in some cases this may not indicate data is not accurate or up to date. The idea of freshness is something which requires more exploration.
provision quality (table to allow visualisation of this is being built)
more row than expected?
reduction in duplicates
reduction in old entities and endpoints

Site Information Architecture

We discussed that the future aim for information architecture is a page per;

provision
endpoint
source
resource

How do we find people we haven't funded?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digital Land

Workshop to identify tests to improve our data quality #396

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Source and Endpoint URLs

Replies: 0 comments

Select a reply

Uh oh!

Digital Land

Workshop to identify tests to improve our data quality #396

Uh oh!

Uh oh!

psd May 14, 2025 Maintainer

Provision

Endpoint

Source (is it legit?)

Collection Logs

Resource

Mapping Fields

Datatypes

Linking References

Entity Lookup

Dataset

Information

Source and Endpoint URLs

Charts and discussions around these

Site Information Architecture

How do we find people we haven't funded?

Replies: 0 comments

psd
May 14, 2025
Maintainer