Skip to content

Data quality gap: Eric's OpenContext files vs Zenodo files #13

@rdhyee

Description

@rdhyee

Summary

Systematic comparison of Eric's OpenContext parquet files vs Zenodo narrow/wide files reveals significant data quality differences. Eric's files contain richer metadata that was lost during the iSamples Central API export process.

Comparison Methodology

Compared files:

  • Eric's: oc_isamples_pqg.parquet (narrow), oc_isamples_pqg_wide.parquet (wide)
  • Zenodo: zenodo_narrow_2025-12-12.parquet, zenodo_wide_2026-01-09.parquet (filtered to n='OPENCONTEXT')

Key Findings

Data Eric Has That Zenodo Is Missing

Field/Entity Eric Zenodo
project (SamplingEvent) 100% populated 0% (all NULL)
IdentifiedConcept entities 25,929 0 (OC-specific)
Agent entities 577 0
last_modified_time 100% populated 0% (all NULL)
alternate_identifiers 100% populated 0% (all NULL)
Additional samples +66,808 PIDs not in Zenodo -

Data Zenodo Has That Eric Is Missing

Field/Entity Eric Zenodo
Coords on MaterialSampleRecord 0% (all NULL) 99.5% populated
p__curation edges NO YES
p__related_resource edges NO YES
Named graph (n column) NULL Populated

Architectural Differences

Location sharing:

  • Eric: ~5.8 samples per location on average (199K unique GeospatialCoordLocation rows)
  • Zenodo: 1:1 mapping (1,059K GeospatialCoordLocation rows for 1,064K samples)

Identifier strategy:

  • Eric: sample_identifier = label (e.g., "Bone 2631")
  • Zenodo: sample_identifier = ARK (e.g., "ark:/28722/k20k2rv0j")

IdentifiedConcept content:

Entity Count Comparison (Wide Format)

Entity Type Eric Zenodo (OC) Difference
MaterialSampleRecord 1,110,412 1,064,831 +45,581
SamplingEvent 1,110,412 1,059,103 +51,309
GeospatialCoordLocation 199,147 1,059,025 -859,878
IdentifiedConcept 25,929 0 +25,929
Agent 577 0 +577

Root Cause

The Zenodo files were generated from the iSamples Central API export, which:

  1. Did not preserve project field values
  2. Did not export IdentifiedConcept/Agent as separate entities
  3. Lost metadata like last_modified_time and alternate_identifiers

Eric's files were generated directly from OpenContext data, preserving richer metadata.

Recommendation

Consider merging the best of both sources:

  1. From Eric's files:

    • Project names
    • IdentifiedConcept entities with domain-specific labels
    • Agent entities
    • Metadata fields (last_modified_time, alternate_identifiers)
    • Location sharing model (more efficient)
  2. From Zenodo files:

    • Coordinate propagation to MaterialSampleRecord
    • Curation and related_resource edges
    • Named graph identification

This could be implemented as:

  • A merge script that enriches Zenodo data with Eric's metadata
  • Or regenerating Zenodo files from source (like Eric did) for all 4 sources

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions