Conversation
…pipeline-csvs' into 2396-check-that-all-entity-numbers-in-lookupcsv-are-contained-within-the-ranges-in-entity-organisationcsv
removed wrong column made error clearer clearer error
52224b1 to
a162a9c
Compare
…csv-are-contained-within-the-ranges-in-entity-organisationcsv
| def _datasette_query_csv(db, sql): | ||
| params = urllib.parse.urlencode({"sql": sql, "_size": "max"}) | ||
| query_url = f"{DATASETTE_BASE_URL}/{db}.csv?{params}" | ||
|
|
||
| try: | ||
| with urllib.request.urlopen(query_url, timeout=30) as response: | ||
| content = response.read().decode("utf-8") | ||
| except Exception as exc: | ||
| pytest.skip(f"Could not load Datasette query from {query_url}: {exc}") | ||
|
|
||
| return list(csv.DictReader(io.StringIO(content))) | ||
|
|
||
|
|
||
| def _ranges_for_collection_from_datasette(collection_name): | ||
| escaped_collection = collection_name.replace("'", "''") | ||
| sql = ( | ||
| "select dataset, collection, entity_minimum, entity_maximum " | ||
| "from dataset " | ||
| f"where collection = '{escaped_collection}'" | ||
| ) | ||
|
|
||
| rows = _datasette_query_csv(DATASETTE_DB, sql) | ||
|
|
||
| ranges = [] | ||
|
|
||
| for row in rows: | ||
| if not isinstance(row, dict): | ||
| continue | ||
|
|
||
| try: | ||
| min_val = int(row.get("entity_minimum")) | ||
| max_val = int(row.get("entity_maximum")) | ||
| except (TypeError, ValueError): | ||
| continue | ||
|
|
||
| dataset_name = (row.get("collection") or "").strip() | ||
| ranges.append((dataset_name, min_val, max_val)) | ||
|
|
||
| return ranges |
There was a problem hiding this comment.
Datasette represents the post-processing state of the data. So if we are adding a new dataset then it won't be in there yet and might fail.
Instead we should use the raw specification to check these at https://raw.githubusercontent.com/digital-land/specification/refs/heads/main/specification/dataset.csv as this will always be updated first.
Should be a fairly easy swap-around though!
There was a problem hiding this comment.
test can be improved to 1. only check entities with organisation and ignore those without org value.
2. ignore orgs government-organisation:D1342 and government-organisation:PB1164 from conservation area collection
…hanged the entity ranges to read from specification
…csv-are-contained-within-the-ranges-in-entity-organisationcsv # Conflicts: # collection/brownfield-land/endpoint.csv # collection/brownfield-land/source.csv
…ead of local file
…ests for CSV data integrity
…tained-within-the-ranges-in-entity-organisationcsv' into 2397-check-that-any-values-in-columns-with-a-date-datatype-contain-valid-dates
eveleighoj
left a comment
There was a problem hiding this comment.
I think you should look at using expectations to run these checks. It'll be easier to run them on other csvs in the future. Right now it's support specific to these and uses inefficient methods for running this analysis.
|
|
||
| REPO_ROOT = Path(__file__).resolve().parents[2] | ||
| SEARCH_DIRS = ["pipeline", "collection"] | ||
| SPECIFICATION_DATASET_URL = "https://raw.githubusercontent.com/digital-land/specification/refs/heads/main/specification/dataset.csv" |
There was a problem hiding this comment.
Instead of just getting the one csv from the specification create a pytest fixture to download the full spec, look at the specification class in digital-land-python. It should have a method to download it to a directory, you can use a temporary directory by using the tmp_path fixture.
This means the spec will be downloaded once and only once for any tests, especially if you set the fixture scope to the session. You can put the fixture in a conftest.py file
| lookup_files, | ||
| ids=[_test_id(f) for f in lookup_files], | ||
| ) | ||
| def test_lookup_entities_within_organisation_ranges(lookup_file): |
There was a problem hiding this comment.
I suggest using duckdb or a similar technology to run analytical queries like this, reading and scanning through them can become costly and inefficient.
There was a problem hiding this comment.
again you could write an expectation for this that can be reused on other csvs
| return ranges | ||
|
|
||
|
|
||
| OLD_ENTITY_IGNORED_ORGANISATIONS = { |
There was a problem hiding this comment.
the old_entity table doesn't contain any organisation data so I'm not sure why this variable is called this?
| old_entity_files, | ||
| ids=[_test_id(f) for f in old_entity_files], | ||
| ) | ||
| def test_old_entity_entities_are_within_specification_entity_ranges(old_entity_file): |
There was a problem hiding this comment.
the old_entity files already have a test which can be expanded using expectations from digital-land-python. for this one I suggest creating a csv operation in here https://github.com/digital-land/digital-land-python/blob/main/digital_land/expectations/operations/csv.py which changes that a column contains integers that belong to specific ranges.
You can then use that expectation in the OLD_ENTITY_RULES to remove a lot of the code here and keep things simple. It's more complicated to set up but the code is more portable and can be used in several different scenarios.
Updated tree-preservation-zone entries for local authorities HOR and BEN.
Updated heritage-at-risk entity ranges in the CSV.
Removed duplicate header and fixed buffer zone range.
…csv-are-contained-within-the-ranges-in-entity-organisationcsv # Conflicts: # pipeline/local-plan/entity-organisation.csv # pipeline/tree-preservation-order/entity-organisation.csv
…tained-within-the-ranges-in-entity-organisationcsv' into 2397-check-that-any-values-in-columns-with-a-date-datatype-contain-valid-dates # Conflicts: # collection/article-4-direction/source.csv
Removed HTML entities from the CSV file.
Updated the CSV file to correct the formatting of the documentation URL.
Reordered the header and data rows in the source.csv file.
Reinserted header and added new entries to the local planning authority data.
…csv-are-contained-within-the-ranges-in-entity-organisationcsv # Conflicts: # collection/developer-contributions/endpoint.csv # collection/developer-contributions/source.csv # collection/historic-england/endpoint.csv # collection/historic-england/source.csv # pipeline/developer-contributions/lookup.csv # pipeline/historic-england/entity-organisation.csv # tests/acceptance/test_config_dataset.py
…csv-are-contained-within-the-ranges-in-entity-organisationcsv
…csv-are-contained-within-the-ranges-in-entity-organisationcsv # Conflicts: # collection/border/source.csv # collection/developer-contributions/endpoint.csv # collection/developer-contributions/source.csv # collection/historic-england/endpoint.csv # collection/historic-england/source.csv # pipeline/article-4-direction/entity-organisation.csv # pipeline/brownfield-land/entity-organisation.csv # pipeline/conservation-area/entity-organisation.csv # pipeline/historic-england/entity-organisation.csv # pipeline/listed-building/entity-organisation.csv # pipeline/tree-preservation-order/entity-organisation.csv
Lookup/old-entity entity validation
Datatype Validation coverage
organisations)stringtextblobwktNotes
field.csvin specification repo, with field-level overrides where needed.Collapsed #2397 into this pr so we have only one pr to manage instead