Overview
The dataset resource log is created for each file that we process. when it is loaded into the dataset package counts are also created in the table. these counts appear to be incorrect for some of our resources. We need to fix this.
Pull Request(PR):
Tech Approach
Investigate:
- The code here is where the counts are currently made
- @sianteesdale has provided a set of examples where the counts don't match and a notebook which contains the code to compare the count in dataset_resource against the count from the actual file
- you can run some of the collections locally (maybe pick smaller ones) to investigate why it's happening. It could be due to another issue around facts not being present which doesn't appear to happen locally.
Solve:
- moving forward we will likely reduce the emphasis on the dataset package. This is important as it means it may be best to migrate the counting of entities to the pipeline itself. This will also mean the computation happens at a different time.
- this calculation can happen here inside the transform method as it's where the log is created
- a note on the above - check performance implications if there are some we may want a boolean argument to turn it on and off for now as this code does run in the async processor so might slow it down.
- remember to remove coed calculating it in the sqlite files
Acceptance Criteria/Tests
- entity counts in dataset_resource must match what is in the transformed_resource
Resourcing & Dependencies
- will require changes to digital-land-python and maybe colelction-task and makerules
Overview
The dataset resource log is created for each file that we process. when it is loaded into the dataset package counts are also created in the table. these counts appear to be incorrect for some of our resources. We need to fix this.
Pull Request(PR):
Tech Approach
Investigate:
Solve:
Acceptance Criteria/Tests
Resourcing & Dependencies