Stephen's SESAR parquet: serialization differences and alignment questions

## Context

Stephen Richard created a new SESAR parquet file using the wide schema for his web app:
- **Parquet file**: `s3://sesar-parquet/sesar_samples.parquet` (public)
- **Web app**: https://isamples-sesar.onrender.com/
- **Code**: https://github.com/usgin/isamplessesar

This issue documents serialization differences between Stephen's SESAR parquet and our Zenodo wide format to guide alignment discussions.

## Schema Alignment

✅ **Both files have 49 columns with matching names** - good structural alignment.

Both use:
- INTEGER[]/BIGINT[] arrays for p__* columns
- WKB BLOB for geometry
- latitude/longitude as DOUBLE
- Same 12 p__* relationship columns

## Critical Differences

### 1. IdentifiedConcept Duplication (Potential Schema Violation)

| Metric | Stephen's SESAR | Zenodo Wide |
|--------|-----------------|-------------|
| Total rows | ~77,000 | 55,893 |
| Unique PIDs | ~51,000 | 55,893 |
| Ratio | 1.5:1 (duplicates) | 1:1 (correct) |

**Issue**: Same geologic age PID appears multiple times with different `scheme_name` values:
- "Geologic Age Older" vs "Geologic Age Younger"

**Question**: Should IdentifiedConcept support multiple scheme contexts per PID, or should this be modeled differently (e.g., separate concepts)?

### 2. SampleRelation Count (7.5x Difference)

| File | SampleRelation Rows | % of Total |
|------|---------------------|------------|
| Stephen's SESAR | 3,780,000 | 19% |
| Zenodo (all sources) | 501,579 | 2% |

**Question**: Is SESAR capturing more parent/child sample relationships? What relationship types are being stored?

### 3. Agent/MaterialSampleCuration Counts

| Entity Type | Stephen's SESAR | Zenodo Wide |
|-------------|-----------------|-------------|
| Agent | 832 | 50,087 |
| MaterialSampleCuration | 765 | 720,254 |

**Question**: Are these entities not being extracted from SESAR source data, or is there a different modeling approach?

### 4. p__site_location Empty

- Stephen's SESAR: `p__site_location` is **0% populated**
- Zenodo: `p__site_location` is 1.7% populated

**Question**: Is the SamplingSite → GeospatialCoordLocation relationship not captured in SESAR source?

## Performance Notes from Stephen

> "The app times out when I try to run it directly from the parquet file in S3, so what it does is copy the parquet file to the server so the search requests are local, that gets page responses down to about 40 seconds once the app is up and running. On my laptop I loaded the parquet file into a duckdb instance in memory and page requests took less than a second."

**Suggestions to discuss**:
1. **Partition parquet by otype** - faster filtered queries
2. **DuckDB-WASM for browser** - our tutorials use this for zero-server architecture
3. **Pre-compute aggregations** - avoid runtime grouping
4. **Cloudflare R2 + HTTP range requests** - works well for our files

## Action Items

- [ ] Clarify IdentifiedConcept duplication intent
- [ ] Document SampleRelation types being captured
- [ ] Investigate Agent/Curation extraction from SESAR
- [ ] Consider adding validation tool for wide format compliance
- [ ] Share DuckDB-WASM patterns for performance

## Related

- Analysis notebook: `isamples-python/examples/basic/schema_comparison.ipynb` (Section 14)
- Issue #13: Eric's OpenContext vs Zenodo data quality gap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stephen's SESAR parquet: serialization differences and alignment questions #14

Context

Schema Alignment

Critical Differences

1. IdentifiedConcept Duplication (Potential Schema Violation)

2. SampleRelation Count (7.5x Difference)

3. Agent/MaterialSampleCuration Counts

4. p__site_location Empty

Performance Notes from Stephen

Action Items

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Stephen's SESAR	Zenodo Wide
Total rows	~77,000	55,893
Unique PIDs	~51,000	55,893
Ratio	1.5:1 (duplicates)	1:1 (correct)

Stephen's SESAR parquet: serialization differences and alignment questions #14

Description

Context

Schema Alignment

Critical Differences

1. IdentifiedConcept Duplication (Potential Schema Violation)

2. SampleRelation Count (7.5x Difference)

3. Agent/MaterialSampleCuration Counts

4. p__site_location Empty

Performance Notes from Stephen

Action Items

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions