diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-1-needs.md b/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-1-needs.md index cc50071..07b1cca 100644 --- a/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-1-needs.md +++ b/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-1-needs.md @@ -49,17 +49,17 @@ The way that LPAs are being funded by DLUHC informs some requirements about supp Different users of the data may also have different quality requirements for it. For example, a service may require a specific set of fields from a dataset to always be present in order to be able to use the data, or another one might need to know that the data is always kept up to date. -## How and where data quality assessments happen +## How and where data quality tests happen -The data processing journey covers the steps of data being collected from endpoints, changed into the required format and then collated and published on the planning data service website. Data quality assessments happen at many different stages throughout this journey and can produce different effects, from making direct changes to data values, to recording alerts that something is not as expected. +The data processing journey covers the steps of data being collected from endpoints, changed into the required format and then collated and published on the planning data service website. Data quality tests happen at many different stages throughout this journey and can produce different effects, from making direct changes to data values, to recording alerts that something is not as expected. -It’s important to note that the quality assessments which take place may be concerned with both validity requirements from the data specifications (e.g. is a field value in the expected format), as well as broader requirements or those which concern data maintenance or operations (e.g. are there duplicated endpoints for an organisation’s dataset). Both of these sides are essential to consider because both affect the quality of data that ends up published on the platform for consumers. +It’s important to note that the quality tests which take place may be concerned with both validity requirements from the data specifications (e.g. is a field value in the expected format), as well as broader requirements or those which concern data maintenance or operations (e.g. are there duplicated endpoints for an organisation’s dataset). Both of these sides are essential to consider because both affect the quality of data that ends up published on the platform for consumers. Throughout this process a number of records are produced to record what’s happened. These artefacts either: -* Directly record the results of automated data quality assessments (and any associated transformations), or manually configured transformations made during processing +* Directly record the results of automated data quality tests (and any associated transformations), or manually configured transformations made during processing -* Or are processing logs which can be used to make further data quality assessments +* Or are processing logs which can be used to make further data quality tests These artefacts provide the input to services or tools which report on or summarise data quality: @@ -69,7 +69,7 @@ These artefacts provide the input to services or tools which report on or summar * Data management team reporting -This process diagram created by Owen outlines some of the key stages in the process and where these artefacts are produced, and the next section describes some of these stages and artefacts in relation to data quality assessments. +This process diagram created by Owen outlines some of the key stages in the process and where these artefacts are produced, and the next section describes some of these stages and artefacts in relation to data quality tests. ![data quality artefacts diagram](/images/data-operations-manual/data-quality-artefacts.png) @@ -80,9 +80,9 @@ This is the point at which the system attempts to collect data from a data provi #### Key artefact: [Logs table](https://datasette.planning.data.gov.uk/digital-land/log) -This table records the date of the attempt to collect data, as well as different status codes or exceptions depending on whether it was successful or not. These attempts aren’t strictly data quality assessments themselves, but the Logs can be used to make them. +This table records the date of the attempt to collect data, as well as different status codes or exceptions depending on whether it was successful or not. These attempts aren’t strictly data quality tests themselves, but the Logs can be used to make them. -*Example of using the Logs table for a data quality assessment* +*Example of using the Logs table for a data quality test* The Logs table can be queried to check how many days it has been since an endpoint had new data published on it which might allow us to understand whether data is meeting a timeliness requirement. @@ -90,17 +90,19 @@ The Logs table can be queried to check how many days it has been since an endpoi The pipeline transforms the collected data into the format required for the platform. It is important to note here that as well as transforming the shape and format of data the pipeline **can** **also transform data values** supplied by a data provider. -The general model for assessments made here is comparing the supplied state of a value to a desired state. When the desired state is not met a *data quality issue* is logged. The data processing pipeline makes many such assessments automatically and in the case of common or expected data errors it may be possible to automatically transform the data to the desired state. The pipeline also allows for the configuration of dataset, endpoint, or resource-specific processing to handle problems which have been identified manually. +The general model for tests made here is comparing the supplied state of a value to a desired state. When the desired state is not met a *data quality issue* is logged. The data processing pipeline makes many such tests automatically and in the case of common or expected data errors it may be possible to automatically transform the data to the desired state. The pipeline also allows for the configuration of dataset, endpoint, or resource-specific processing to handle problems which have been identified manually. -*Example of automated assessment:* +> *Example of automated test:* +> +> The values for the point field in a tree dataset is supplied in the OSGB36 coordinate reference system, rather than the WGS84 required by the specification. The point value is automatically re-projected to WGS84, and a data quality issue is created to record this. -The values for the point field in a tree dataset is supplied in the OSGB36 coordinate reference system, rather than the WGS84 required by the specification. The point value is automatically re-projected to WGS84, and a data quality issue is created to record this. +
-Example of manual configuration: +> *Example of manual configuration:* +> +> When adding a new endpoint for the listed-building-outline dataset it’s noticed that the listed-building-grade field contains the values 1, 2, and 3 rather than the I, II, and III required by the specification. These supplied values are mapped to the desired values by making an addition to the patch.csv file in the listed-building collection configuration, and a data quality issue is automatically created during processing to record this re-mapping. -When adding a new endpoint for the listed-building-outline dataset it’s noticed that the listed-building-grade field contains the values 1, 2, and 3 rather than the I, II, and III required by the specification. These supplied values are mapped to the desired values by making an addition to the patch.csv file in the listed-building collection configuration, and a data quality issue is automatically created during processing to record this re-mapping. - -See our [how to configure an endpoint guide](/docs/data-operations-manual/How-To-Guides/Adding/Configure-an-endpoint.md) for more information on configuration. +See our [how to configure an endpoint guide](../../../How-To-Guides/Adding/Configure-an-endpoint) for more information on configuration. The **severity level** of the data quality issue which is logged during this process indicates whether a transformation was successfully made to the desired state (severity level \= “informational” or “warning”), or whether this was not possible (severity level \= “error”). @@ -118,12 +120,12 @@ This table records how the field names in any supplied resource have been mapped ### Dataset -At this stage transformed resources from data providers are combined into database files and loaded onto the platform. Once the database files have been created there is a further opportunity to make data quality assessments which make use of an entire dataset or datasets, rather than just being able to examine data row-by-row. Assessments made at this stage of the process vary from the pipeline stage in that they do not alter the data and simply report data quality issues. +At this stage transformed resources from data providers are combined into database files and loaded onto the platform. Once the database files have been created there is a further opportunity to make data quality tests which make use of an entire dataset or datasets, rather than just being able to examine data row-by-row. Tests made at this stage of the process vary from the pipeline stage in that they do not alter the data and simply report data quality issues. The method we use to do this is called "expectations" (see our [configure and run expectations guide](../../../How-To-Guides/Testing/Configure-and-run-expectations) for more detail). -*Example Dataset quality assessment* +> *Example Dataset quality test* +> +> The expectation rule "[Check number of entities inside the local planning authority boundary matches the manual count](https://datasette.planning.data.gov.uk/digital-land/expectation?_facet=name&name=Check+number+of+entities+inside+the+local+planning+authority+boundary+matches+the+manual+count)" counts the number of entities we have on the platform for each LPA, and compares the actual number to an expected number created by counting the number published on each LPA's website. This quality test is only possible at this stage as it requires summarising from the entire dataset. -Having access to the whole dataset makes it possible to assess things like whether the values in a reference field are unique, or whether the reference values used across conservation-area and conservation-area-document datasets link correctly. -#### Key artefact: [Expectation issues table](https://datasette.planning.data.gov.uk/digital-land/expectation_issue) +#### Key artefact: [Expectation table](https://datasette.planning.data.gov.uk/digital-land/expectation) -#### Key artefact: [Expectation results table](https://datasette.planning.data.gov.uk/digital-land/expectation_result) diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-2-framework.md b/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-2-framework.md index 3865482..c24548e 100644 --- a/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-2-framework.md +++ b/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-2-framework.md @@ -1,39 +1,49 @@ # Data quality framework -We have a structured approach to how we identify and fix issues with data quality, which we refer to as our *data quality framework*. +## Quality requirements -A key part of this framework is a [list of data quality requirements](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit#gid=2142834080) built and maintained by the data management team. +We have a structured approach to how we identify and fix issues with data quality, which we refer to as our *data quality framework*. The core of this framework is a [list of data quality requirements](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit#gid=2142834080) built and maintained by the data management team. -These documented data quality requirements help the data management team understand *what* needs to be assessed, *why* it needs to be assessed, and plan *how* it can be assessed. We use the process below to go from identifying a data quality need through to being able to actively monitor whether or not it is being met, and then raise the issues to the party responsible for fixing them. +These documented data quality requirements help the team plan and document the tests used to identify whether or not quality requirements are being met, as well as the methods for running the tests. We use the process below to go from identifying a data quality need through to actively monitoring whether or not it is being met, and then raising identified quality issues to the party responsible for fixing them. + +The list also acts as a backlog of known data quality requirements. By working through this backlog to build more tests we can expand the coverage of our data quality monitoring and measurement. ![defining-data-quality-process](/images/data-operations-manual/defining-data-quality-process.png) -1. Quality requirement: documenting a need we have of planning data based on its intended uses. +1. **Quality requirement**: documenting a need we have of planning data based on its intended uses. -1. Issue definition: agreeing the method for systematically identifying data which is not meeting a quality requirement. +1. **Test definition**: agreeing the methods for systematically identify data which is not meeting a quality requirement. -1. Issue check implementation: automating identification of issues, either through a query or report, or through changes to the pipeline code. +1. **Test implementation**: productionising the identification of issues in the pipeline, primarily using the issues or expectations testing frameworks, or possibly other pipeline processing artefacts. -1. Issue check use (monitoring): surfacing information about data quality issues in a structured way so that action can be taken. +1. **Quality monitoring**: surfacing information about data quality issues in a structured way so that action can be taken. **Example** Here's an example using a requirement we have based on the expectation that providers of our ODP datasets should only be providing data within their local planning authority boundary. This helps us identify a quality issue like if Bristol City Council were to supply conservation-area data where a polygon was in Newcastle. -> 1. Quality requirement: geometry data should be within the expected boundary of the provider’s administrative area +> 1. Quality requirement: geometry data should be within the expected boundary of the provider’s administrative area. > -> 1. Issue definition: An ‘out of expected LPA bounds’ issue for ODP datasets is when the supplied geometry does not intersect at all with the provider’s Local Planning Authority boundary +> 1. Test definition: An ‘out of expected LPA bounds’ issue for ODP datasets is when the supplied geometry does not intersect at all with the provider’s Local Planning Authority boundary. > -> 1. Issue check implementation: [expectation rules](https://datasette.planning.data.gov.uk/digital-land/expectation?_facet=name&name=Check+no+entities+are+outside+of+the+local+planning+authority+boundary) which test for any of these issues on all ODP datasets. +> 1. Test implementation: [expectation rules](https://datasette.planning.data.gov.uk/digital-land/expectation?_facet=name&name=Check+no+entities+are+outside+of+the+local+planning+authority+boundary) which test for any of these issues on all ODP datasets. > -> 1. Issue check use (monitoring): surfacing information about out of bounds issues in the Submit service so that LPAs can act on this and fix the issues. +> 1. Quality monitoring: surfacing information about out of bounds issues in the Submit service so that LPAs can act on this and fix the issues. + +## Quality tests + +During the development phase, the data management team might use datasette queries or python code in jupyter notebooks to design the methods for a quality test. Once the method has been proven, it is usually more formally implemented in one of two ways: +* **Issues** - issue logs are raised when quality tests fail as the pipeline is processing individual values from a resource. +* **Expectations** - expectations logs are raised when expectation rules fail, these are run after the entire dataset is built into a sqlite database file. -# Monitoring data quality +Issue and expectations logs are re-produced each night and provide a detailed record of the results of different data quality tests. For more detail on how they work and where they sit in the pipeline process, see the [data quality needs explainer](../Data-quality-1-needs). -Once data quality issues are defined, and checks for them have been implemented, we're able to systematically monitor for any occurances of data quality issues. +## Quality monitoring + +Once data quality requirements are defined, and tests to identify where they're not being met are implemented, we're able to systematically monitor for any occurances of data quality issues. Monitoring is carried out in one of two ways, depending on whether the responsibility for fixing the issue is external (i.e. with the data provider) or internal (i.e. with the data management team): @@ -41,14 +51,14 @@ Monitoring is carried out in one of two ways, depending on whether the responsib * By the **Data Management team**, to resolve data quality issues that can be fixed by a change in configuration -See our [monitoring data quality](../../../Tutorials/Monitoring-Data-Quality) page which gives guidance on the processes we follow to fix quality issues raised by our operational monitoring. These processes go hand-in-hand with our [data quality requirements](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit#gid=2142834080), which defines our full backlog of requirements, issue definitions and monitoring approach. +See our [monitoring data quality](../../../Tutorials/Monitoring-Data-Quality) page which gives guidance on the processes we follow to fix quality issues raised by our operational monitoring. These processes go hand-in-hand with our [data quality requirements](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit#gid=2142834080), which defines our full backlog of requirements, test definitions and monitoring approach. -Note: The Data Management team also has a [defined process](https://docs.google.com/document/d/1YGM8W0E2_qW60k8hlancVWBe0aYPNIfefpctNwQ3MSs/edit) for tackling ad-hoc data quality issues which are raised for data on the platform. This begins with an investigation, followed by one or both of a data fix and root cause resolution. This process may also result in the formal definition of a data quality requirement and issue check so that it can be handled in future through the data quality management framework. +Note: The Data Management team also has a [defined process](https://docs.google.com/document/d/1YGM8W0E2_qW60k8hlancVWBe0aYPNIfefpctNwQ3MSs/edit) for tackling ad-hoc data quality issues which are raised for data on the platform. This begins with an investigation, followed by one or both of a data fix and root cause resolution. This process may also result in the formal definition of a data quality requirement and test so that it can be handled in future through the data quality management framework. -# Measuring data quality +# Quality measurement -With well defined data quality requirements and issues, it's possible to use them to make useful summaries of data quality at different scales, for example assessing whether the data on a particular endpoint meets all of the requirements for a particular purpose. +With well defined data quality requirements and tests, it's possible to use them to make useful summaries of data quality at different scales, for example assessing whether the data on an endpoint meets all of the requirements for a particular purpose. We've created a *data quality measurement framework* to define different data quality levels based on the requirements of ODP software. This measurement framework is used to score data provisions (a dataset from a provider) and create summaries of the number of provisions at each quality level. @@ -58,9 +68,26 @@ The table below visualises the framework: The criteria marked as "true" at each level must be met by a data provision in order for it to be scored at that level. Therefore the framework defines 5 criteria that must be met in order for a data provision to be *good for ODP*. The levels are cumulative, so those same 5 criteria plus 3 more must be met in order for a provision to be scored as *data that is trustworthy*. Where we have data from alternative providers (e.g. Historic England conservation-area data) the first criteria cannot be met so it is scored as the first quality level, *some data*. -Each of the criteria are based around one or more data quality requirements. For example, the "No other types of validity errors" criteria is based on meeting 7 different data validity requirements from the specifications, while the "No unknown entities" criteria is based on just one timeliness requirement. We track how requirements are mapped to criteria on the [measurement tab of the data quality requirements tracker](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit?gid=1268095085#gid=1268095085). +Each of the criteria are based around one or more data quality requirements and their respective tests. + +> **Example** +> +> The "No other types of validity errors" criteria is based on meeting 6 different data quality requirements related to expected data formats. If a provision has failed any of the separate tests for these requirements the criteria is not met. +> +> These requirements and their tests are documented in our [data quality needs tracker](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit?gid=2142834080#gid=2142834080), and the mapping of requirements to criteria is captured in the [measurement tab](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit?gid=1268095085#gid=1268095085). + +
+ +The framework is flexible as each data quality test is independent and is carried out by the pipeline automatically each night. To make any changes to the measurement framework we can simply edit the mapping from test results to quality criteria, and quality criteria to quality levels. And to extend the framework we can build new tests and add them into the mapping. + +
+ +> **Note**: +> +> The scoring process currently a proof of concept which is run in a jupyter-notebook, but we're working on productionising it so scores are output to a `provision-quality` table by the pipeline. + +
-The framework is flexible and allows us to add more criteria to each level, or re-order them as required. Note that the criteria marked as "planned" are in development, and will be able to be used in the measurement framework once live. The chart below is an example of using the framework to measure the quality levels across all ODP dataset provisions (on 2024-11-20):