-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
The PQG schema includes a project column on SamplingEvent entities to capture OpenContext project structure, but the Zenodo narrow/wide files have this column empty. Eric's OpenContext-specific PQG files have this data populated.
Current State
| File | Project Data | Record Count |
|---|---|---|
oc_isamples_pqg_wide.parquet (Eric's) |
✅ 1.1M SamplingEvents with project names | 11.6M total |
zenodo_narrow_*.parquet |
❌ Column exists but empty | 106M total |
zenodo_wide_*.parquet |
❌ Column exists but empty | 20M total |
Example Project Values (from Eric's OC files)
Pyla-Koutsopetria Archaeological Project I: Pedestrian Survey 8,475 records
Pyla-Koutsopetria Archaeological Project II: Geophysics and Excavation 6,971 records
Çatalhöyük Zooarchaeology 126,116 records
Petra Great Temple Excavations 108,846 records
Root Cause
The iSamples Central API export used to generate Zenodo files has produced_by populated for OpenContext records (99.5%), but the inner fields are NULL:
# From iSamples Central API export (OpenContext record)
produced_by = {
'label': None, # NULL
'description': None, # NULL
'identifier': None, # NULL
'sampling_site': {
'label': 'Polis Chrysochous', # Only this has data
'place_name': None
}
}Eric's OC PQG files were generated from a more complete OpenContext data source that includes project names.
Proposed Solution
Use Eric's oc_isamples_pqg.parquet (narrow) or oc_isamples_pqg_wide.parquet to fill in the missing project values for OpenContext records in the Zenodo files.
Options:
-
Replace OpenContext data entirely - Use Eric's OC files as the OpenContext source instead of iSamples Central API export
-
Merge project values - Join on
pidto add project values to existing Zenodo SamplingEvent records -
Upstream fix - Investigate why iSamples Central API doesn't include OpenContext project names
Files Referenced
- Eric's OC files:
~/Data/iSample/pqg_refining/oc_isamples_pqg*.parquet - Zenodo files:
~/Data/iSample/pqg_refining/zenodo_*.parquet - Remote:
https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_*.parquet
Why This Matters
OpenContext projects like PKAP (Pyla-Koutsopetria Archaeological Project) are split across multiple projects (I and II). Without project data, users can't:
- Filter samples by project
- Understand the organizational structure of OpenContext data
- Distinguish between different phases of the same archaeological site