-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Summary
Systematic comparison of Eric's OpenContext parquet files vs Zenodo narrow/wide files reveals significant data quality differences. Eric's files contain richer metadata that was lost during the iSamples Central API export process.
Comparison Methodology
Compared files:
- Eric's:
oc_isamples_pqg.parquet(narrow),oc_isamples_pqg_wide.parquet(wide) - Zenodo:
zenodo_narrow_2025-12-12.parquet,zenodo_wide_2026-01-09.parquet(filtered ton='OPENCONTEXT')
Key Findings
Data Eric Has That Zenodo Is Missing
| Field/Entity | Eric | Zenodo |
|---|---|---|
project (SamplingEvent) |
100% populated | 0% (all NULL) |
| IdentifiedConcept entities | 25,929 | 0 (OC-specific) |
| Agent entities | 577 | 0 |
last_modified_time |
100% populated | 0% (all NULL) |
alternate_identifiers |
100% populated | 0% (all NULL) |
| Additional samples | +66,808 PIDs not in Zenodo | - |
Data Zenodo Has That Eric Is Missing
| Field/Entity | Eric | Zenodo |
|---|---|---|
| Coords on MaterialSampleRecord | 0% (all NULL) | 99.5% populated |
p__curation edges |
NO | YES |
p__related_resource edges |
NO | YES |
Named graph (n column) |
NULL | Populated |
Architectural Differences
Location sharing:
- Eric: ~5.8 samples per location on average (199K unique GeospatialCoordLocation rows)
- Zenodo: 1:1 mapping (1,059K GeospatialCoordLocation rows for 1,064K samples)
Identifier strategy:
- Eric:
sample_identifier= label (e.g., "Bone 2631") - Zenodo:
sample_identifier= ARK (e.g., "ark:/28722/k20k2rv0j")
IdentifiedConcept content:
- Eric: Domain-specific labels ("Element :: Scapula", "Side :: Right")
- Zenodo: URI strings ("https://w3id.org/isample/vocabulary/...")
Entity Count Comparison (Wide Format)
| Entity Type | Eric | Zenodo (OC) | Difference |
|---|---|---|---|
| MaterialSampleRecord | 1,110,412 | 1,064,831 | +45,581 |
| SamplingEvent | 1,110,412 | 1,059,103 | +51,309 |
| GeospatialCoordLocation | 199,147 | 1,059,025 | -859,878 |
| IdentifiedConcept | 25,929 | 0 | +25,929 |
| Agent | 577 | 0 | +577 |
Root Cause
The Zenodo files were generated from the iSamples Central API export, which:
- Did not preserve
projectfield values - Did not export IdentifiedConcept/Agent as separate entities
- Lost metadata like
last_modified_timeandalternate_identifiers
Eric's files were generated directly from OpenContext data, preserving richer metadata.
Recommendation
Consider merging the best of both sources:
-
From Eric's files:
- Project names
- IdentifiedConcept entities with domain-specific labels
- Agent entities
- Metadata fields (last_modified_time, alternate_identifiers)
- Location sharing model (more efficient)
-
From Zenodo files:
- Coordinate propagation to MaterialSampleRecord
- Curation and related_resource edges
- Named graph identification
This could be implemented as:
- A merge script that enriches Zenodo data with Eric's metadata
- Or regenerating Zenodo files from source (like Eric did) for all 4 sources
Related Issues
- OpenContext project names missing in Zenodo narrow/wide files #12: OpenContext project names missing in Zenodo narrow/wide files
- Add human-readable labels for IdentifiedConcept URIs in Wide/Narrow formats #10: Add vocabulary labels to Wide/Narrow formats
Metadata
Metadata
Assignees
Labels
No labels