Skip to content

Fix Entity Counts #2421

@eveleighoj

Description

@eveleighoj

Overview
The dataset resource log is created for each file that we process. when it is loaded into the dataset package counts are also created in the table. these counts appear to be incorrect for some of our resources. We need to fix this.

Pull Request(PR):

Tech Approach

Investigate:

  • The code here is where the counts are currently made
  • @sianteesdale has provided a set of examples where the counts don't match and a notebook which contains the code to compare the count in dataset_resource against the count from the actual file
  • you can run some of the collections locally (maybe pick smaller ones) to investigate why it's happening. It could be due to another issue around facts not being present which doesn't appear to happen locally.

Solve:

  • moving forward we will likely reduce the emphasis on the dataset package. This is important as it means it may be best to migrate the counting of entities to the pipeline itself. This will also mean the computation happens at a different time.
  • this calculation can happen here inside the transform method as it's where the log is created
  • a note on the above - check performance implications if there are some we may want a boolean argument to turn it on and off for now as this code does run in the async processor so might slow it down.
  • remember to remove coed calculating it in the sqlite files

Acceptance Criteria/Tests

  • entity counts in dataset_resource must match what is in the transformed_resource

Resourcing & Dependencies

  • will require changes to digital-land-python and maybe colelction-task and makerules

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Sprint Backlog ⏭️

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions