Skip to content

2396 check that all entity numbers in lookupcsv are contained within the ranges in entity organisationcsv#2406

Open
gibahjoe wants to merge 54 commits intomainfrom
2396-check-that-all-entity-numbers-in-lookupcsv-are-contained-within-the-ranges-in-entity-organisationcsv
Open

2396 check that all entity numbers in lookupcsv are contained within the ranges in entity organisationcsv#2406
gibahjoe wants to merge 54 commits intomainfrom
2396-check-that-all-entity-numbers-in-lookupcsv-are-contained-within-the-ranges-in-entity-organisationcsv

Conversation

@gibahjoe
Copy link
Copy Markdown
Contributor

@gibahjoe gibahjoe commented Mar 19, 2026

Lookup/old-entity entity validation

Datatype Validation coverage

  • datetime (also used for date values)
  • integer
  • decimal
  • flag
  • json
  • curie
  • curie-list (override-based for organisations)
  • url
  • hash
  • pattern
  • latitude
  • longitude
  • point
  • multipolygon
  • string
  • text
  • blob
  • wkt

Notes

  • Datatypes come from field.csv in specification repo, with field-level overrides where needed.
  • Validation is inspired from specification repo especially for dateformat.

Collapsed #2397 into this pr so we have only one pr to manage instead

…pipeline-csvs' into 2396-check-that-all-entity-numbers-in-lookupcsv-are-contained-within-the-ranges-in-entity-organisationcsv
Swati-Dash and others added 2 commits March 19, 2026 17:15
@gibahjoe gibahjoe force-pushed the 2396-check-that-all-entity-numbers-in-lookupcsv-are-contained-within-the-ranges-in-entity-organisationcsv branch from 52224b1 to a162a9c Compare March 19, 2026 17:15
…csv-are-contained-within-the-ranges-in-entity-organisationcsv
Comment on lines +74 to +112
def _datasette_query_csv(db, sql):
params = urllib.parse.urlencode({"sql": sql, "_size": "max"})
query_url = f"{DATASETTE_BASE_URL}/{db}.csv?{params}"

try:
with urllib.request.urlopen(query_url, timeout=30) as response:
content = response.read().decode("utf-8")
except Exception as exc:
pytest.skip(f"Could not load Datasette query from {query_url}: {exc}")

return list(csv.DictReader(io.StringIO(content)))


def _ranges_for_collection_from_datasette(collection_name):
escaped_collection = collection_name.replace("'", "''")
sql = (
"select dataset, collection, entity_minimum, entity_maximum "
"from dataset "
f"where collection = '{escaped_collection}'"
)

rows = _datasette_query_csv(DATASETTE_DB, sql)

ranges = []

for row in rows:
if not isinstance(row, dict):
continue

try:
min_val = int(row.get("entity_minimum"))
max_val = int(row.get("entity_maximum"))
except (TypeError, ValueError):
continue

dataset_name = (row.get("collection") or "").strip()
ranges.append((dataset_name, min_val, max_val))

return ranges
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasette represents the post-processing state of the data. So if we are adding a new dataset then it won't be in there yet and might fail.

Instead we should use the raw specification to check these at https://raw.githubusercontent.com/digital-land/specification/refs/heads/main/specification/dataset.csv as this will always be updated first.

Should be a fairly easy swap-around though!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test can be improved to 1. only check entities with organisation and ignore those without org value.
2. ignore orgs government-organisation:D1342 and government-organisation:PB1164 from conservation area collection

gibahjoe and others added 18 commits March 23, 2026 11:02
…hanged the entity ranges to read from specification
…csv-are-contained-within-the-ranges-in-entity-organisationcsv

# Conflicts:
#	collection/brownfield-land/endpoint.csv
#	collection/brownfield-land/source.csv
…tained-within-the-ranges-in-entity-organisationcsv' into 2397-check-that-any-values-in-columns-with-a-date-datatype-contain-valid-dates
Copy link
Copy Markdown
Contributor

@eveleighoj eveleighoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should look at using expectations to run these checks. It'll be easier to run them on other csvs in the future. Right now it's support specific to these and uses inefficient methods for running this analysis.


REPO_ROOT = Path(__file__).resolve().parents[2]
SEARCH_DIRS = ["pipeline", "collection"]
SPECIFICATION_DATASET_URL = "https://raw.githubusercontent.com/digital-land/specification/refs/heads/main/specification/dataset.csv"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just getting the one csv from the specification create a pytest fixture to download the full spec, look at the specification class in digital-land-python. It should have a method to download it to a directory, you can use a temporary directory by using the tmp_path fixture.

This means the spec will be downloaded once and only once for any tests, especially if you set the fixture scope to the session. You can put the fixture in a conftest.py file

lookup_files,
ids=[_test_id(f) for f in lookup_files],
)
def test_lookup_entities_within_organisation_ranges(lookup_file):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using duckdb or a similar technology to run analytical queries like this, reading and scanning through them can become costly and inefficient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again you could write an expectation for this that can be reused on other csvs

return ranges


OLD_ENTITY_IGNORED_ORGANISATIONS = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old_entity table doesn't contain any organisation data so I'm not sure why this variable is called this?

old_entity_files,
ids=[_test_id(f) for f in old_entity_files],
)
def test_old_entity_entities_are_within_specification_entity_ranges(old_entity_file):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old_entity files already have a test which can be expanded using expectations from digital-land-python. for this one I suggest creating a csv operation in here https://github.com/digital-land/digital-land-python/blob/main/digital_land/expectations/operations/csv.py which changes that a column contains integers that belong to specific ranges.

You can then use that expectation in the OLD_ENTITY_RULES to remove a lot of the code here and keep things simple. It's more complicated to set up but the code is more portable and can be used in several different scenarios.

Swati-Dash and others added 9 commits March 25, 2026 21:29
Updated heritage-at-risk entity ranges in the CSV.
Removed duplicate header and fixed buffer zone range.
…csv-are-contained-within-the-ranges-in-entity-organisationcsv

# Conflicts:
#	pipeline/local-plan/entity-organisation.csv
#	pipeline/tree-preservation-order/entity-organisation.csv
…tained-within-the-ranges-in-entity-organisationcsv' into 2397-check-that-any-values-in-columns-with-a-date-datatype-contain-valid-dates

# Conflicts:
#	collection/article-4-direction/source.csv
Swati-Dash and others added 19 commits March 26, 2026 17:35
Removed HTML entities from the CSV file.
Updated the CSV file to correct the formatting of the documentation URL.
Reordered the header and data rows in the source.csv file.
Reinserted header and added new entries to the local planning authority data.
…csv-are-contained-within-the-ranges-in-entity-organisationcsv

# Conflicts:
#	collection/developer-contributions/endpoint.csv
#	collection/developer-contributions/source.csv
#	collection/historic-england/endpoint.csv
#	collection/historic-england/source.csv
#	pipeline/developer-contributions/lookup.csv
#	pipeline/historic-england/entity-organisation.csv
#	tests/acceptance/test_config_dataset.py
…csv-are-contained-within-the-ranges-in-entity-organisationcsv
…csv-are-contained-within-the-ranges-in-entity-organisationcsv

# Conflicts:
#	collection/border/source.csv
#	collection/developer-contributions/endpoint.csv
#	collection/developer-contributions/source.csv
#	collection/historic-england/endpoint.csv
#	collection/historic-england/source.csv
#	pipeline/article-4-direction/entity-organisation.csv
#	pipeline/brownfield-land/entity-organisation.csv
#	pipeline/conservation-area/entity-organisation.csv
#	pipeline/historic-england/entity-organisation.csv
#	pipeline/listed-building/entity-organisation.csv
#	pipeline/tree-preservation-order/entity-organisation.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants