Skip to content

Config tests expectations#514

Merged
gibahjoe merged 12 commits intomainfrom
config_tests_expectations
Apr 2, 2026
Merged

Config tests expectations#514
gibahjoe merged 12 commits intomainfrom
config_tests_expectations

Conversation

@gibahjoe
Copy link
Copy Markdown
Contributor

@gibahjoe gibahjoe commented Mar 30, 2026

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

Created reusable checkpoint functions to help with testing in various projects that depend on this project.

  1. check_fields_are_within_range
    Validates one or more numeric fields against min/max ranges from an external CSV, with optional row filtering via lookup_rules.

  2. check_field_is_within_range_by_dataset_org
    Range validation like above, but matched on dataset + organisation keys between source and range files.

  3. check_allowed_values
    Checks a field only contains values from an allowed list and reports invalid rows/values.

  4. check_no_blank_rows
    Fails when a row is fully blank (all columns empty/whitespace).

  5. check_values_have_the_correct_datatype
    Uses datatype validators to verify column values match expected datatypes and returns invalid rows with line_number, field, datatype, and value.

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note
on the devices and browsers this has been tested on, as well as any relevant
images for UI changes.

Added/updated tests?

We encourage you to keep the code coverage percentage at 80% and above. Please refer to the Digital Land Testing Guidance for more information.

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

[optional] Are there any post deployment tasks we need to perform?

[optional] Are there any dependencies on other PRs or Work?

@gibahjoe gibahjoe requested a review from eveleighoj March 30, 2026 12:45
@gibahjoe gibahjoe marked this pull request as ready for review March 30, 2026 12:46
@gibahjoe gibahjoe force-pushed the config_tests_expectations branch from 171d4e1 to 6818887 Compare March 30, 2026 15:08
from shapely.geometry import GeometryCollection, MultiPolygon, Point, Polygon, shape


def _is_valid_datetime_value(value):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this looks like duplication of the code from elsewhere in the repository. I'm not sure we should be duplicating it.

so for this function thee should be a csv expectation which checks the date format of he input file. you can make it so that i accepts an argument which is the valid date formats that can be accepted int he date column itself duckdb can that cast to that format.

For the purposes of what you'er doing we should fix a specific format. The code below was developed to convert from any number of formats where as you should be checking that it directly matches a specific format as for it to be good data all data in the csv file should have the same format.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of commenting on each function below we should look at converting them to duckdb queries which check the properties are true. we should be as specific as possible.

field: the column name to validate
allowed_values: allowed values for the field
"""
cleaned_allowed_values = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows for a lot of cleaning of the values themselves. Is this needed? surely we want values to match almost exactly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Its not. will take it out.

@gibahjoe gibahjoe force-pushed the config_tests_expectations branch from f778c10 to 4ed3035 Compare March 31, 2026 14:04
"pattern": _is_valid_pattern_value,
"multipolygon": _is_valid_multipolygon_value,
"point": _is_valid_point_value,
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn’t able to use DuckDB with these data types. I am not sure if custom functions does work here like in Postgres.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya let's get these as separate expectations for now rather than one massive function, you already need to pass a data type dictionary in so you're already having to define what the datatypes are somewhere, instead of defining data types let's just have an expectation that says this column should be x.

e.g.

expect_column_to_be_datetime(conn,filepath,field='entry-date',date_format-'%Y-%M-%D')

separating it out makes it easier to read

@gibahjoe
Copy link
Copy Markdown
Contributor Author

gibahjoe commented Apr 1, 2026

I have pushed up the fixes @eveleighoj. The datatype validator now uses duckDB except a few pattern, multipolygon and point.

@gibahjoe gibahjoe requested a review from eveleighoj April 1, 2026 09:23
Copy link
Copy Markdown
Contributor

@eveleighoj eveleighoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have different expectation functions for different data validators. if they can be done with duckdb then great if not they can just use python and then they can fail individually. e.g. it can pass a datetime one and fail an integer one.

If we find a way to do a better one that tests everything then we can re-use these functions in the future

"pattern": _is_valid_pattern_value,
"multipolygon": _is_valid_multipolygon_value,
"point": _is_valid_point_value,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya let's get these as separate expectations for now rather than one massive function, you already need to pass a data type dictionary in so you're already having to define what the datatypes are somewhere, instead of defining data types let's just have an expectation that says this column should be x.

e.g.

expect_column_to_be_datetime(conn,filepath,field='entry-date',date_format-'%Y-%M-%D')

separating it out makes it easier to read

- Replaced the previous `check_values_have_the_correct_datatype` function with specific functions for each datatype (e.g., `expect_column_to_be_integer`, `expect_column_to_be_decimal`, etc.).
- Each new function performs validation for a specific datatype and returns detailed results including invalid rows.
- Updated integration tests to reflect the new validation functions and ensure they cover various scenarios for each datatype.
- Removed unused pattern validation function from `datatype_validators.py`.
@gibahjoe gibahjoe merged commit 83476ad into main Apr 2, 2026
5 checks passed
@gibahjoe gibahjoe deleted the config_tests_expectations branch April 2, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants