Skip to content

feat: add shardManifestRef and manifests to entry lexicon#26

Merged
maxine-at-forecast merged 1 commit intodevelopfrom
feature/add-manifest-refs
Apr 4, 2026
Merged

feat: add shardManifestRef and manifests to entry lexicon#26
maxine-at-forecast merged 1 commit intodevelopfrom
feature/add-manifest-refs

Conversation

@maxine-at-forecast
Copy link
Copy Markdown
Contributor

Summary

Adds per-shard manifest references to science.alt.dataset.entry, enabling query-based access to dataset metadata without downloading full shards.

Changes

  • New def entry#shardManifestRef — object with:
    • header (blob, required): JSON manifest header containing shard-level metadata (schema info, sample count, per-field aggregates). Accepts application/json, max 1 MB.
    • samples (blob, optional): Parquet file with per-sample metadata for query-based filtering. Accepts application/octet-stream, max 100 MB.
  • New property entry.manifests — optional array of #shardManifestRef, max 10,000 entries (one per shard).
  • Updated docs/spec.md — documents the new manifest relationship in the record relationships section.

Versioning

This is an additive-only change per the versioning policy:

  • New optional field on an existing record (no existing consumers break)
  • New def within the existing entry lexicon (no new NSIDs)

Motivation

The reference implementation (atdata) already has Python types for this (ShardManifestRef, LexDatasetEntry.manifests) and uses manifests for efficient query-based dataset access. This PR aligns the lexicon spec with the implementation.

Related: forecast-bio/atdata#89

Test plan

  • entry.json validates as JSON
  • scripts/validate-nsids.sh passes (all lexicon IDs match file paths)
  • Downstream atdata test suite passes with this lexicon (1762 tests, 25 lexicon-specific)

Add optional manifests array to science.alt.dataset.entry for
query-based access to per-shard metadata. Each shardManifestRef
contains a required header blob (JSON metadata with schema info,
sample count, field aggregates) and an optional samples blob
(Parquet table for per-sample filtering).

This aligns the lexicon spec with the existing Python implementation
in atdata (ShardManifestRef, LexDatasetEntry.manifests).

Follows versioning policy: additive-only change (new optional field,
new def). No breaking changes to existing consumers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maxine-at-forecast maxine-at-forecast merged commit e958be3 into develop Apr 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant