-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Context
Stephen Richard created a new SESAR parquet file using the wide schema for his web app:
- Parquet file:
s3://sesar-parquet/sesar_samples.parquet(public) - Web app: https://isamples-sesar.onrender.com/
- Code: https://github.com/usgin/isamplessesar
This issue documents serialization differences between Stephen's SESAR parquet and our Zenodo wide format to guide alignment discussions.
Schema Alignment
✅ Both files have 49 columns with matching names - good structural alignment.
Both use:
- INTEGER[]/BIGINT[] arrays for p__* columns
- WKB BLOB for geometry
- latitude/longitude as DOUBLE
- Same 12 p__* relationship columns
Critical Differences
1. IdentifiedConcept Duplication (Potential Schema Violation)
| Metric | Stephen's SESAR | Zenodo Wide |
|---|---|---|
| Total rows | ~77,000 | 55,893 |
| Unique PIDs | ~51,000 | 55,893 |
| Ratio | 1.5:1 (duplicates) | 1:1 (correct) |
Issue: Same geologic age PID appears multiple times with different scheme_name values:
- "Geologic Age Older" vs "Geologic Age Younger"
Question: Should IdentifiedConcept support multiple scheme contexts per PID, or should this be modeled differently (e.g., separate concepts)?
2. SampleRelation Count (7.5x Difference)
| File | SampleRelation Rows | % of Total |
|---|---|---|
| Stephen's SESAR | 3,780,000 | 19% |
| Zenodo (all sources) | 501,579 | 2% |
Question: Is SESAR capturing more parent/child sample relationships? What relationship types are being stored?
3. Agent/MaterialSampleCuration Counts
| Entity Type | Stephen's SESAR | Zenodo Wide |
|---|---|---|
| Agent | 832 | 50,087 |
| MaterialSampleCuration | 765 | 720,254 |
Question: Are these entities not being extracted from SESAR source data, or is there a different modeling approach?
4. p__site_location Empty
- Stephen's SESAR:
p__site_locationis 0% populated - Zenodo:
p__site_locationis 1.7% populated
Question: Is the SamplingSite → GeospatialCoordLocation relationship not captured in SESAR source?
Performance Notes from Stephen
"The app times out when I try to run it directly from the parquet file in S3, so what it does is copy the parquet file to the server so the search requests are local, that gets page responses down to about 40 seconds once the app is up and running. On my laptop I loaded the parquet file into a duckdb instance in memory and page requests took less than a second."
Suggestions to discuss:
- Partition parquet by otype - faster filtered queries
- DuckDB-WASM for browser - our tutorials use this for zero-server architecture
- Pre-compute aggregations - avoid runtime grouping
- Cloudflare R2 + HTTP range requests - works well for our files
Action Items
- Clarify IdentifiedConcept duplication intent
- Document SampleRelation types being captured
- Investigate Agent/Curation extraction from SESAR
- Consider adding validation tool for wide format compliance
- Share DuckDB-WASM patterns for performance
Related
- Analysis notebook:
isamples-python/examples/basic/schema_comparison.ipynb(Section 14) - Issue Data quality gap: Eric's OpenContext files vs Zenodo files #13: Eric's OpenContext vs Zenodo data quality gap