Skip to content

OpenContext project names missing in Zenodo narrow/wide files #12

@rdhyee

Description

@rdhyee

Summary

The PQG schema includes a project column on SamplingEvent entities to capture OpenContext project structure, but the Zenodo narrow/wide files have this column empty. Eric's OpenContext-specific PQG files have this data populated.

Current State

File Project Data Record Count
oc_isamples_pqg_wide.parquet (Eric's) ✅ 1.1M SamplingEvents with project names 11.6M total
zenodo_narrow_*.parquet ❌ Column exists but empty 106M total
zenodo_wide_*.parquet ❌ Column exists but empty 20M total

Example Project Values (from Eric's OC files)

Pyla-Koutsopetria Archaeological Project I: Pedestrian Survey     8,475 records
Pyla-Koutsopetria Archaeological Project II: Geophysics and Excavation  6,971 records
Çatalhöyük Zooarchaeology                                        126,116 records
Petra Great Temple Excavations                                   108,846 records

Root Cause

The iSamples Central API export used to generate Zenodo files has produced_by populated for OpenContext records (99.5%), but the inner fields are NULL:

# From iSamples Central API export (OpenContext record)
produced_by = {
    'label': None,           # NULL
    'description': None,     # NULL  
    'identifier': None,      # NULL
    'sampling_site': {
        'label': 'Polis Chrysochous',  # Only this has data
        'place_name': None
    }
}

Eric's OC PQG files were generated from a more complete OpenContext data source that includes project names.

Proposed Solution

Use Eric's oc_isamples_pqg.parquet (narrow) or oc_isamples_pqg_wide.parquet to fill in the missing project values for OpenContext records in the Zenodo files.

Options:

  1. Replace OpenContext data entirely - Use Eric's OC files as the OpenContext source instead of iSamples Central API export

  2. Merge project values - Join on pid to add project values to existing Zenodo SamplingEvent records

  3. Upstream fix - Investigate why iSamples Central API doesn't include OpenContext project names

Files Referenced

  • Eric's OC files: ~/Data/iSample/pqg_refining/oc_isamples_pqg*.parquet
  • Zenodo files: ~/Data/iSample/pqg_refining/zenodo_*.parquet
  • Remote: https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_*.parquet

Why This Matters

OpenContext projects like PKAP (Pyla-Koutsopetria Archaeological Project) are split across multiple projects (I and II). Without project data, users can't:

  • Filter samples by project
  • Understand the organizational structure of OpenContext data
  • Distinguish between different phases of the same archaeological site

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions