feat: add shardManifestRef and manifests to entry lexicon#26
Merged
maxine-at-forecast merged 1 commit intodevelopfrom Apr 4, 2026
Merged
feat: add shardManifestRef and manifests to entry lexicon#26maxine-at-forecast merged 1 commit intodevelopfrom
maxine-at-forecast merged 1 commit intodevelopfrom
Conversation
Add optional manifests array to science.alt.dataset.entry for query-based access to per-shard metadata. Each shardManifestRef contains a required header blob (JSON metadata with schema info, sample count, field aggregates) and an optional samples blob (Parquet table for per-sample filtering). This aligns the lexicon spec with the existing Python implementation in atdata (ShardManifestRef, LexDatasetEntry.manifests). Follows versioning policy: additive-only change (new optional field, new def). No breaking changes to existing consumers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds per-shard manifest references to
science.alt.dataset.entry, enabling query-based access to dataset metadata without downloading full shards.Changes
entry#shardManifestRef— object with:header(blob, required): JSON manifest header containing shard-level metadata (schema info, sample count, per-field aggregates). Acceptsapplication/json, max 1 MB.samples(blob, optional): Parquet file with per-sample metadata for query-based filtering. Acceptsapplication/octet-stream, max 100 MB.entry.manifests— optional array of#shardManifestRef, max 10,000 entries (one per shard).docs/spec.md— documents the new manifest relationship in the record relationships section.Versioning
This is an additive-only change per the versioning policy:
entrylexicon (no new NSIDs)Motivation
The reference implementation (atdata) already has Python types for this (
ShardManifestRef,LexDatasetEntry.manifests) and uses manifests for efficient query-based dataset access. This PR aligns the lexicon spec with the implementation.Related: forecast-bio/atdata#89
Test plan
entry.jsonvalidates as JSONscripts/validate-nsids.shpasses (all lexicon IDs match file paths)