Skip to content

Stephen's SESAR parquet: serialization differences and alignment questions #14

@rdhyee

Description

@rdhyee

Context

Stephen Richard created a new SESAR parquet file using the wide schema for his web app:

This issue documents serialization differences between Stephen's SESAR parquet and our Zenodo wide format to guide alignment discussions.

Schema Alignment

Both files have 49 columns with matching names - good structural alignment.

Both use:

  • INTEGER[]/BIGINT[] arrays for p__* columns
  • WKB BLOB for geometry
  • latitude/longitude as DOUBLE
  • Same 12 p__* relationship columns

Critical Differences

1. IdentifiedConcept Duplication (Potential Schema Violation)

Metric Stephen's SESAR Zenodo Wide
Total rows ~77,000 55,893
Unique PIDs ~51,000 55,893
Ratio 1.5:1 (duplicates) 1:1 (correct)

Issue: Same geologic age PID appears multiple times with different scheme_name values:

  • "Geologic Age Older" vs "Geologic Age Younger"

Question: Should IdentifiedConcept support multiple scheme contexts per PID, or should this be modeled differently (e.g., separate concepts)?

2. SampleRelation Count (7.5x Difference)

File SampleRelation Rows % of Total
Stephen's SESAR 3,780,000 19%
Zenodo (all sources) 501,579 2%

Question: Is SESAR capturing more parent/child sample relationships? What relationship types are being stored?

3. Agent/MaterialSampleCuration Counts

Entity Type Stephen's SESAR Zenodo Wide
Agent 832 50,087
MaterialSampleCuration 765 720,254

Question: Are these entities not being extracted from SESAR source data, or is there a different modeling approach?

4. p__site_location Empty

  • Stephen's SESAR: p__site_location is 0% populated
  • Zenodo: p__site_location is 1.7% populated

Question: Is the SamplingSite → GeospatialCoordLocation relationship not captured in SESAR source?

Performance Notes from Stephen

"The app times out when I try to run it directly from the parquet file in S3, so what it does is copy the parquet file to the server so the search requests are local, that gets page responses down to about 40 seconds once the app is up and running. On my laptop I loaded the parquet file into a duckdb instance in memory and page requests took less than a second."

Suggestions to discuss:

  1. Partition parquet by otype - faster filtered queries
  2. DuckDB-WASM for browser - our tutorials use this for zero-server architecture
  3. Pre-compute aggregations - avoid runtime grouping
  4. Cloudflare R2 + HTTP range requests - works well for our files

Action Items

  • Clarify IdentifiedConcept duplication intent
  • Document SampleRelation types being captured
  • Investigate Agent/Curation extraction from SESAR
  • Consider adding validation tool for wide format compliance
  • Share DuckDB-WASM patterns for performance

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions