Conversation
…s with check_field_is_within_range in CSV operations and tests
… add dataset organization matching
cleanup
171d4e1 to
6818887
Compare
| from shapely.geometry import GeometryCollection, MultiPolygon, Point, Polygon, shape | ||
|
|
||
|
|
||
| def _is_valid_datetime_value(value): |
There was a problem hiding this comment.
I believe this looks like duplication of the code from elsewhere in the repository. I'm not sure we should be duplicating it.
so for this function thee should be a csv expectation which checks the date format of he input file. you can make it so that i accepts an argument which is the valid date formats that can be accepted int he date column itself duckdb can that cast to that format.
For the purposes of what you'er doing we should fix a specific format. The code below was developed to convert from any number of formats where as you should be checking that it directly matches a specific format as for it to be good data all data in the csv file should have the same format.
There was a problem hiding this comment.
instead of commenting on each function below we should look at converting them to duckdb queries which check the properties are true. we should be as specific as possible.
| field: the column name to validate | ||
| allowed_values: allowed values for the field | ||
| """ | ||
| cleaned_allowed_values = [ |
There was a problem hiding this comment.
this allows for a lot of cleaning of the values themselves. Is this needed? surely we want values to match almost exactly?
There was a problem hiding this comment.
No. Its not. will take it out.
f778c10 to
4ed3035
Compare
| "pattern": _is_valid_pattern_value, | ||
| "multipolygon": _is_valid_multipolygon_value, | ||
| "point": _is_valid_point_value, | ||
| } |
There was a problem hiding this comment.
I wasn’t able to use DuckDB with these data types. I am not sure if custom functions does work here like in Postgres.
There was a problem hiding this comment.
Heya let's get these as separate expectations for now rather than one massive function, you already need to pass a data type dictionary in so you're already having to define what the datatypes are somewhere, instead of defining data types let's just have an expectation that says this column should be x.
e.g.
expect_column_to_be_datetime(conn,filepath,field='entry-date',date_format-'%Y-%M-%D')
separating it out makes it easier to read
|
I have pushed up the fixes @eveleighoj. The datatype validator now uses duckDB except a few |
eveleighoj
left a comment
There was a problem hiding this comment.
let's have different expectation functions for different data validators. if they can be done with duckdb then great if not they can just use python and then they can fail individually. e.g. it can pass a datetime one and fail an integer one.
If we find a way to do a better one that tests everything then we can re-use these functions in the future
| "pattern": _is_valid_pattern_value, | ||
| "multipolygon": _is_valid_multipolygon_value, | ||
| "point": _is_valid_point_value, | ||
| } |
There was a problem hiding this comment.
Heya let's get these as separate expectations for now rather than one massive function, you already need to pass a data type dictionary in so you're already having to define what the datatypes are somewhere, instead of defining data types let's just have an expectation that says this column should be x.
e.g.
expect_column_to_be_datetime(conn,filepath,field='entry-date',date_format-'%Y-%M-%D')
separating it out makes it easier to read
- Replaced the previous `check_values_have_the_correct_datatype` function with specific functions for each datatype (e.g., `expect_column_to_be_integer`, `expect_column_to_be_decimal`, etc.). - Each new function performs validation for a specific datatype and returns detailed results including invalid rows. - Updated integration tests to reflect the new validation functions and ensure they cover various scenarios for each datatype. - Removed unused pattern validation function from `datatype_validators.py`.
What type of PR is this? (check all applicable)
Description
Created reusable checkpoint functions to help with testing in various projects that depend on this project.
check_fields_are_within_rangeValidates one or more numeric fields against min/max ranges from an external CSV, with optional row filtering via
lookup_rules.check_field_is_within_range_by_dataset_orgRange validation like above, but matched on dataset + organisation keys between source and range files.
check_allowed_valuesChecks a field only contains values from an allowed list and reports invalid rows/values.
check_no_blank_rowsFails when a row is fully blank (all columns empty/whitespace).
check_values_have_the_correct_datatypeUses datatype validators to verify column values match expected datatypes and returns invalid rows with
line_number,field,datatype, andvalue.Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Please replace this line with instructions on how to test your changes, a note
on the devices and browsers this has been tested on, as well as any relevant
images for UI changes.
Added/updated tests?
We encourage you to keep the code coverage percentage at 80% and above. Please refer to the Digital Land Testing Guidance for more information.
have not been included
[optional] Are there any post deployment tasks we need to perform?
[optional] Are there any dependencies on other PRs or Work?