diff --git a/.chainlink/issues.db b/.chainlink/issues.db index edbde07..03f1fe0 100644 Binary files a/.chainlink/issues.db and b/.chainlink/issues.db differ diff --git a/.claude/commands/adr.md b/.claude/commands/adr.md new file mode 100644 index 0000000..8e50c2c --- /dev/null +++ b/.claude/commands/adr.md @@ -0,0 +1,42 @@ +--- +allowed-tools: Bash(git status:*), Bash(git log:*), Bash(chainlink tree:*), Bash(chainlink comment:*), Bash(chainlink subissue:*), Bash(chainlink create:*), Bash(chainlink session:*), Bash(chainlink --help), Bash(chainlink close:*), Bash(uv run pytest:*), Bash(uv run ruff:*) +description: Perform an adversarial review +--- + +## Context + +- Current issue tree: !`chainlink tree` +- Current test outputs: !`uv run pytest -v` +- Recent commits: !`git log --oneline -10` +- Chainlink help: !`chainlink --help` + +## Your task + +1. Develop summary assessment of test suite + - Look through all of the unit tests currently in the project, and create a plan of how well these tests are implemented to test the functionality at the core of the project, how well these tests actually fully cover desired behavior and edge cases, whether the tests are formally correct, and whether there is any redundancy in the tests or documentation for them + - Develop a plan for how to address these concerns point by point +2. Develop summary assessment of codebase + - Look through all of the source files currently in the project's main modules, and create a plan of how well-implemented, efficient, and generalizable the current implementation is, as well as whether there is adequate, too sparse, or too verbose documentation + - Develop a plan for improvements, tweaks, or refactors that could be applied to the current codebase and its documentation +3. Create issue and subissues + - Create a base issue in chainlink for this adversarial review + - Create subissues for each of the plan items addressed in steps 1 and 2. +4. Address all subissues for this adversarial review + - Ordered by priority, address and close each of the subissues identified + - Provide thorough documentation of each step you take in the chainlink comments + +## Constraints + +- **Adversarial**: You are engaging in this task from the perspective of a reviewer that is hyper-critical. +- **Optimize code contraction**: You are operating as one half of a cyclical dyad, in which the other half is responsible for generating a lot of code, but has a propensity to write too much, and write implementations that are verbose, inefficient, or inaccurate at times. Your job is to be the critical eye, and to identify and implement revisions that make the code concise, efficient, and formally correct. +- **Consider test correctness**: The tests you are presented with are not necessarily complete for covering the desired functionality. Think through ways in which you could make the test suite more accurate to the task at hand, and also of ways in which you could test the codebase's functionality that are not currently addressed. Be creative and leverage web search in this endeavor to see current best practices for the problem that could aid developing tests. +- **Preserve documentation for API generation**: This project uses quartodoc to auto-generate API documentation from docstrings. Docstrings are a feature, not bloat. When reviewing documentation verbosity, apply these rules: + - **KEEP**: Module-level docstrings, class-level docstrings, `Args:`, `Returns:`, `Raises:`, `Examples:` sections on all public APIs + - **KEEP**: Docstrings that explain *why* something works a certain way, non-obvious behavior, or protocol/interface contracts + - **KEEP**: `Examples:` sections — these render as live code samples in the docs site + - **TRIM**: Docstrings that *only* restate the function signature with no added value (e.g. "`name: The name`" when the type hint already says `name: str`) + - **TRIM**: Multi-paragraph explanations on private/internal helpers where a one-liner suffices + - **NEVER REMOVE**: Docstrings from public API methods, protocol definitions, or decorated classes + - When in doubt, leave the docstring. A slightly verbose docstring that helps a user is better than a missing one that forces them to read source. +- **Batch mechanical fixes**: Group similar changes (e.g. all weak assertion fixes) into a single commit rather than one subissue per file. Reserve individual subissues for changes that require design thought. +- **Close low-value issues**: If a finding would add complexity, risk regressions, or save fewer than 10 lines, close it as "not worth the churn" with a comment explaining why. diff --git a/.claude/commands/changelog.md b/.claude/commands/changelog.md new file mode 100644 index 0000000..e521936 --- /dev/null +++ b/.claude/commands/changelog.md @@ -0,0 +1,61 @@ +--- +allowed-tools: Bash(git log:*), Bash(git tag:*), Bash(git diff:*), Bash(chainlink *) +description: Generate a clean CHANGELOG entry from recent work +--- + +## Context + +- Current version: !`grep '^version' pyproject.toml` +- Recent tags: !`git tag --sort=-creatordate | head -5` +- CHANGELOG head: !`head -20 CHANGELOG.md` +- Recent chainlink issues: !`chainlink list` + +## Your task + +Generate a properly structured CHANGELOG entry for the current release, following [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. + +### 1. Gather changes + +Identify all changes since the last release by examining: +- `git log --oneline ..HEAD` for commit messages +- `chainlink list` for closed issues and their descriptions +- `git diff --stat ..HEAD` for files changed + +### 2. Categorize changes + +Sort changes into Keep a Changelog sections: + +- **Added**: New features, new files, new public APIs, new test suites +- **Changed**: Modifications to existing behavior, refactors, dependency updates, CI changes +- **Fixed**: Bug fixes, lint fixes, CI fixes +- **Deprecated**: Newly deprecated APIs (with migration path) +- **Removed**: Removed features, deleted files, removed APIs + +### 3. Write the entry + +Follow these formatting rules: +- Each item should be a concise, user-facing description — not a chainlink issue title +- Group related changes under bold sub-headers (e.g. **`LocalDiskStore`**: description) +- Use nested bullets for sub-items that belong to a feature group +- Omit internal-only changes (individual subissue closes, review assessments, investigation tickets) +- Include GitHub issue references where relevant (e.g. `(GH#42)`) +- Do NOT include chainlink issue numbers — these are internal tracking + +### 4. Update CHANGELOG.md + +- Insert the new version section between `## [Unreleased]` and the previous release +- Leave `## [Unreleased]` empty at the top +- Do not modify any existing release sections below + +### 5. Verify + +- Confirm the CHANGELOG renders as valid markdown +- Confirm no chainlink auto-appended entries leaked into existing release sections + +## Constraints + +- Follow Keep a Changelog format strictly +- Write for the library's users, not for internal tracking +- Consolidate — 5 well-written bullets are better than 30 issue titles +- Preserve existing release sections exactly as they are +- If chainlink has appended noise to existing sections, clean it up diff --git a/.claude/commands/release.md b/.claude/commands/release.md new file mode 100644 index 0000000..f68d383 --- /dev/null +++ b/.claude/commands/release.md @@ -0,0 +1,61 @@ +--- +allowed-tools: Bash(git *), Bash(gh *), Bash(uv lock*), Bash(uv run ruff*), Bash(uv run pytest*), Bash(chainlink *) +description: Prepare and submit a beta release +--- + +## Context + +- Current branch: !`git branch --show-current` +- Recent commits: !`git log --oneline -15` +- All branches: !`git branch --list 'release/*' | tail -5` +- Current version: !`grep '^version' pyproject.toml` +- Remotes: !`git remote -v` + +## Your task + +The user will provide a version string (e.g. `v0.3.0b2`). Perform the full release flow: + +### 1. Validate preconditions +- Confirm all tests pass: `uv run pytest tests/ -x -q` +- Confirm lint is clean: `uv run ruff check src/ tests/` +- Confirm no uncommitted changes (other than `.chainlink/issues.db`) +- Identify the previous release branch to branch from (e.g. `release/v0.3.0b1`) +- Identify the feature branch to merge (current branch or ask user) + +### 2. Create release branch +- Stash any uncommitted changes +- `git checkout ` +- `git checkout -b release/` +- `git merge --no-ff --no-edit` +- `git stash pop` (if anything was stashed) + +### 3. Prepare release +- Bump version in `pyproject.toml` +- Run `uv lock` to update the lockfile +- Run `/changelog` skill to generate a clean CHANGELOG entry (or generate one manually following Keep a Changelog format with Added/Changed/Fixed sections) +- Run `uv run ruff check src/ tests/` and fix any lint errors +- Run `uv run pytest tests/ -x -q` to confirm tests pass + +### 4. Commit and push +- `git add pyproject.toml uv.lock CHANGELOG.md .chainlink/issues.db` +- `git commit -m "release: prepare "` +- `git push -u origin release/` + +### 5. Create PR +- Create PR to `upstream/main` using `gh pr create`: + - `--repo forecast-bio/atdata` + - `--base main` + - `--head release/` + - Title: `release: ` + - Body: summary of changes from CHANGELOG, test plan with pass counts + +### 6. Track in chainlink +- Create a chainlink issue for the release, close when PR is submitted + +## Constraints + +- Always use `--no-ff` for merges to preserve branch topology +- Always run `uv lock` after version bumps — stale lockfiles break CI +- Always run lint check before committing — ruff errors break CI +- Never force-push to release branches +- The CHANGELOG should follow Keep a Changelog format with proper Added/Changed/Fixed sections, not a flat list of chainlink issues diff --git a/.github/workflows/uv-test.yml b/.github/workflows/uv-test.yml index 2e18832..8d709dd 100644 --- a/.github/workflows/uv-test.yml +++ b/.github/workflows/uv-test.yml @@ -17,7 +17,7 @@ concurrency: jobs: lint: name: Lint - runs-on: ubuntu-latest + runs-on: forecast-ci-linux-x64 steps: - uses: actions/checkout@v5 @@ -37,7 +37,7 @@ jobs: test: name: Test (py${{ matrix.python-version }}, redis${{ matrix.redis-version }}) - runs-on: ubuntu-latest + runs-on: forecast-ci-linux-x64 environment: name: test strategy: @@ -78,7 +78,7 @@ jobs: benchmark: name: Benchmarks - runs-on: ubuntu-latest + runs-on: forecast-ci-linux-x64 needs: [lint] permissions: contents: write diff --git a/.planning/setup/01_overview.md b/.planning/phases/01-atproto-foundation/01_overview.md similarity index 100% rename from .planning/setup/01_overview.md rename to .planning/phases/01-atproto-foundation/01_overview.md diff --git a/.planning/setup/02_lexicon_design.md b/.planning/phases/01-atproto-foundation/02_lexicon_design.md similarity index 100% rename from .planning/setup/02_lexicon_design.md rename to .planning/phases/01-atproto-foundation/02_lexicon_design.md diff --git a/.planning/setup/03_python_client.md b/.planning/phases/01-atproto-foundation/03_python_client.md similarity index 100% rename from .planning/setup/03_python_client.md rename to .planning/phases/01-atproto-foundation/03_python_client.md diff --git a/.planning/setup/04_appview.md b/.planning/phases/01-atproto-foundation/04_appview.md similarity index 100% rename from .planning/setup/04_appview.md rename to .planning/phases/01-atproto-foundation/04_appview.md diff --git a/.planning/setup/05_codegen.md b/.planning/phases/01-atproto-foundation/05_codegen.md similarity index 100% rename from .planning/setup/05_codegen.md rename to .planning/phases/01-atproto-foundation/05_codegen.md diff --git a/.planning/setup/README.md b/.planning/phases/01-atproto-foundation/README.md similarity index 100% rename from .planning/setup/README.md rename to .planning/phases/01-atproto-foundation/README.md diff --git a/.planning/setup/atproto_integration.md b/.planning/phases/01-atproto-foundation/atproto_integration.md similarity index 100% rename from .planning/setup/atproto_integration.md rename to .planning/phases/01-atproto-foundation/atproto_integration.md diff --git a/.planning/setup/decisions/01_schema_representation_format.md b/.planning/phases/01-atproto-foundation/decisions/01_schema_representation_format.md similarity index 100% rename from .planning/setup/decisions/01_schema_representation_format.md rename to .planning/phases/01-atproto-foundation/decisions/01_schema_representation_format.md diff --git a/.planning/setup/decisions/02_lens_code_storage.md b/.planning/phases/01-atproto-foundation/decisions/02_lens_code_storage.md similarity index 100% rename from .planning/setup/decisions/02_lens_code_storage.md rename to .planning/phases/01-atproto-foundation/decisions/02_lens_code_storage.md diff --git a/.planning/setup/decisions/03_webdataset_storage.md b/.planning/phases/01-atproto-foundation/decisions/03_webdataset_storage.md similarity index 100% rename from .planning/setup/decisions/03_webdataset_storage.md rename to .planning/phases/01-atproto-foundation/decisions/03_webdataset_storage.md diff --git a/.planning/setup/decisions/04_schema_evolution.md b/.planning/phases/01-atproto-foundation/decisions/04_schema_evolution.md similarity index 100% rename from .planning/setup/decisions/04_schema_evolution.md rename to .planning/phases/01-atproto-foundation/decisions/04_schema_evolution.md diff --git a/.planning/setup/decisions/05_lexicon_namespace.md b/.planning/phases/01-atproto-foundation/decisions/05_lexicon_namespace.md similarity index 100% rename from .planning/setup/decisions/05_lexicon_namespace.md rename to .planning/phases/01-atproto-foundation/decisions/05_lexicon_namespace.md diff --git a/.planning/setup/decisions/06_lexicon_validation.md b/.planning/phases/01-atproto-foundation/decisions/06_lexicon_validation.md similarity index 100% rename from .planning/setup/decisions/06_lexicon_validation.md rename to .planning/phases/01-atproto-foundation/decisions/06_lexicon_validation.md diff --git a/.planning/setup/decisions/README.md b/.planning/phases/01-atproto-foundation/decisions/README.md similarity index 100% rename from .planning/setup/decisions/README.md rename to .planning/phases/01-atproto-foundation/decisions/README.md diff --git a/.planning/setup/decisions/assessment.md b/.planning/phases/01-atproto-foundation/decisions/assessment.md similarity index 100% rename from .planning/setup/decisions/assessment.md rename to .planning/phases/01-atproto-foundation/decisions/assessment.md diff --git a/.planning/setup/decisions/record_lexicon_assessment.md b/.planning/phases/01-atproto-foundation/decisions/record_lexicon_assessment.md similarity index 100% rename from .planning/setup/decisions/record_lexicon_assessment.md rename to .planning/phases/01-atproto-foundation/decisions/record_lexicon_assessment.md diff --git a/.planning/setup/decisions/sampleSchema_design_questions.md b/.planning/phases/01-atproto-foundation/decisions/sampleSchema_design_questions.md similarity index 100% rename from .planning/setup/decisions/sampleSchema_design_questions.md rename to .planning/phases/01-atproto-foundation/decisions/sampleSchema_design_questions.md diff --git a/.planning/setup/examples/code/ndarray_roundtrip.py b/.planning/phases/01-atproto-foundation/examples/code/ndarray_roundtrip.py similarity index 100% rename from .planning/setup/examples/code/ndarray_roundtrip.py rename to .planning/phases/01-atproto-foundation/examples/code/ndarray_roundtrip.py diff --git a/.planning/setup/examples/code/validate_ndarray_shim.py b/.planning/phases/01-atproto-foundation/examples/code/validate_ndarray_shim.py similarity index 100% rename from .planning/setup/examples/code/validate_ndarray_shim.py rename to .planning/phases/01-atproto-foundation/examples/code/validate_ndarray_shim.py diff --git a/.planning/setup/examples/dataset_blob_storage.json b/.planning/phases/01-atproto-foundation/examples/dataset_blob_storage.json similarity index 100% rename from .planning/setup/examples/dataset_blob_storage.json rename to .planning/phases/01-atproto-foundation/examples/dataset_blob_storage.json diff --git a/.planning/setup/examples/dataset_external_storage.json b/.planning/phases/01-atproto-foundation/examples/dataset_external_storage.json similarity index 100% rename from .planning/setup/examples/dataset_external_storage.json rename to .planning/phases/01-atproto-foundation/examples/dataset_external_storage.json diff --git a/.planning/setup/examples/lens_example.json b/.planning/phases/01-atproto-foundation/examples/lens_example.json similarity index 100% rename from .planning/setup/examples/lens_example.json rename to .planning/phases/01-atproto-foundation/examples/lens_example.json diff --git a/.planning/setup/examples/sampleSchema_example.json b/.planning/phases/01-atproto-foundation/examples/sampleSchema_example.json similarity index 100% rename from .planning/setup/examples/sampleSchema_example.json rename to .planning/phases/01-atproto-foundation/examples/sampleSchema_example.json diff --git a/.planning/setup/lexicons/README.md b/.planning/phases/01-atproto-foundation/lexicons/README.md similarity index 100% rename from .planning/setup/lexicons/README.md rename to .planning/phases/01-atproto-foundation/lexicons/README.md diff --git a/.planning/setup/lexicons/README_ARRAY_FORMATS.md b/.planning/phases/01-atproto-foundation/lexicons/README_ARRAY_FORMATS.md similarity index 100% rename from .planning/setup/lexicons/README_ARRAY_FORMATS.md rename to .planning/phases/01-atproto-foundation/lexicons/README_ARRAY_FORMATS.md diff --git a/.planning/setup/lexicons/README_SCHEMA_TYPES.md b/.planning/phases/01-atproto-foundation/lexicons/README_SCHEMA_TYPES.md similarity index 100% rename from .planning/setup/lexicons/README_SCHEMA_TYPES.md rename to .planning/phases/01-atproto-foundation/lexicons/README_SCHEMA_TYPES.md diff --git a/.planning/setup/lexicons/ac.foundation.dataset.arrayFormat.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.arrayFormat.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.arrayFormat.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.arrayFormat.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.getLatestSchema.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.getLatestSchema.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.getLatestSchema.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.getLatestSchema.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.lens.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.lens.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.lens.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.lens.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.record.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.record.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.record.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.record.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.sampleSchema.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.sampleSchema.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.sampleSchema.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.sampleSchema.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.schemaType.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.schemaType.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.schemaType.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.schemaType.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.storageBlobs.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.storageBlobs.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.storageBlobs.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.storageBlobs.json diff --git a/.planning/setup/lexicons/ac.foundation.dataset.storageExternal.json b/.planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.storageExternal.json similarity index 100% rename from .planning/setup/lexicons/ac.foundation.dataset.storageExternal.json rename to .planning/phases/01-atproto-foundation/lexicons/ac.foundation.dataset.storageExternal.json diff --git a/.planning/setup/lexicons/ndarray_shim.json b/.planning/phases/01-atproto-foundation/lexicons/ndarray_shim.json similarity index 100% rename from .planning/setup/lexicons/ndarray_shim.json rename to .planning/phases/01-atproto-foundation/lexicons/ndarray_shim.json diff --git a/.planning/setup/ndarray_shim_spec.md b/.planning/phases/01-atproto-foundation/ndarray_shim_spec.md similarity index 100% rename from .planning/setup/ndarray_shim_spec.md rename to .planning/phases/01-atproto-foundation/ndarray_shim_spec.md diff --git a/.planning/roadmap/v0.2/03_human-review-assessment.md b/.planning/phases/02-v0.2-review/03_human-review-assessment.md similarity index 100% rename from .planning/roadmap/v0.2/03_human-review-assessment.md rename to .planning/phases/02-v0.2-review/03_human-review-assessment.md diff --git a/.planning/roadmap/v0.3/01_codebase-review.md b/.planning/phases/03-v0.3-roadmap/01_codebase-review.md similarity index 100% rename from .planning/roadmap/v0.3/01_codebase-review.md rename to .planning/phases/03-v0.3-roadmap/01_codebase-review.md diff --git a/.planning/roadmap/v0.3/02_synthesis-roadmap.md b/.planning/phases/03-v0.3-roadmap/02_synthesis-roadmap.md similarity index 100% rename from .planning/roadmap/v0.3/02_synthesis-roadmap.md rename to .planning/phases/03-v0.3-roadmap/02_synthesis-roadmap.md diff --git a/.planning/roadmap/v0.3/architecture-doc.md b/.planning/phases/03-v0.3-roadmap/architecture-doc.md similarity index 100% rename from .planning/roadmap/v0.3/architecture-doc.md rename to .planning/phases/03-v0.3-roadmap/architecture-doc.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 312972d..5e2328c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,88 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). ## [Unreleased] +## [0.3.0b2] - 2026-02-02 + +### Added +- **`LocalDiskStore`**: Local filesystem data store implementing `AbstractDataStore` protocol + - Writes WebDataset shards to disk with `write_shards()` + - Default root at `~/.atdata/data/`, configurable via constructor +- **`write_samples()`**: Module-level function to write samples directly to WebDataset tar files + - Single tar or sharded output via `maxcount`/`maxsize` parameters + - Returns typed `Dataset[ST]` wrapping the written files +- **`Index.write()`**: Write samples and create an index entry in one step + - Combines `write_samples()` + `insert_dataset()` into a single call + - Auto-creates `LocalDiskStore` when no data store is configured +- **`Index.promote_entry()` and `Index.promote_dataset()`**: Atmosphere promotion via Index + - Promote locally-indexed datasets to ATProto without standalone functions + - Schema deduplication and automatic publishing +- Top-level exports: `atdata.Index`, `atdata.LocalDiskStore`, `atdata.write_samples` +- `write()` method added to `AbstractIndex` protocol +- 38 new tests: `test_write_samples.py`, `test_disk_store.py`, `test_index_write.py` + +### Changed +- Fix ruff lint errors in test_coverage_gaps.py (#549) +- Migrate repo references from foundation-ac to forecast-bio/atdata (#548) +- Update CLAUDE.md to reflect recent additions and fix divergences (#540) +- Reorganize .planning/ directory for temporal clarity (#547) +- Investigate and fix code coverage reduction from recent changes (#541) +- Add tests for dataset.py uncovered edge cases (#546) +- Add tests for _index.py uncovered error/edge paths (#545) +- Add tests for testing.py list field, Optional field, and fixture paths (#544) +- Add tests for providers/_factory.py postgres and unknown provider paths (#543) +- Add tests for _helpers.py object dtype and legacy npy paths (#542) +- Create local skills for release, changelog, and adversarial review (#536) +- Create /changelog skill (#539) +- Update /ad skill — less aggressive docstring trimming (#538) +- Create /release skill (#537) +- Fix CI failures on v0.3.0b2 release (#535) +- Release v0.3.0b2 beta (#534) +- `promote.py` updated as backward-compat wrapper delegating to `Index.promote_entry()` +- Trimmed `_protocols.py` docstrings by 30% (487 → 343 lines) +- Trimmed verbose test docstrings across test suite (−173 lines) +- Strengthened weak test assertions (isinstance checks, tautological tests) +- Removed dead code: `parse_cid()` function and tests +- Added `@pytest.mark.filterwarnings` to tests exercising deprecated APIs + +## [0.3.0b1] - 2026-01-31 + +### Added +- **Structured logging**: `atdata.configure_logging()` with pluggable logger protocol +- **Partial failure handling**: `PartialFailureError` and shard-level error handling in `Dataset.map()` +- **Testing utilities**: `atdata.testing` module with mock clients, fixtures, and helpers +- **Per-shard manifest and query system** (GH#35) + - `ManifestBuilder`, `ManifestWriter`, `QueryExecutor`, `SampleLocation` + - `ManifestField` annotation and `resolve_manifest_fields()` + - Aggregate collectors (categorical, numeric, set) + - Integrated into write path and `Dataset.query()` +- **Performance benchmark suite**: `bench_dataset_io`, `bench_index_providers`, `bench_query`, `bench_atmosphere` + - HTML benchmark reports with CI integration + - Median/IQR statistics with per-sample columns +- **SQLite/PostgreSQL index providers** (GH#42) + - `SqliteIndexProvider`, `PostgresIndexProvider`, `RedisIndexProvider` + - `IndexProvider` protocol in `_protocols.py` + - SQLite as default provider (replacing Redis) +- **Developer experience improvements** (GH#38) + - CLI: `atdata inspect`, `atdata preview`, `atdata schema show/diff` + - `Dataset.head()`, `Dataset.__iter__`, `Dataset.__len__`, `Dataset.select()` + - `Dataset.filter()`, `Dataset.map()`, `Dataset.describe()` + - `Dataset.get(key)`, `Dataset.schema`, `Dataset.column_names` + - `Dataset.to_pandas()`, `Dataset.to_dict()` + - Custom exception hierarchy with actionable error messages +- **Consolidated Index with Repository system** + - `Repository` dataclass and `_AtmosphereBackend` + - Prefix routing for multi-backend index operations + - Default `Index` singleton with `load_dataset` integration + - `AtmosphereIndex` deprecated in favor of `Index(atmosphere=client)` +- Comprehensive test coverage: 1155 tests + +### Changed +- Split `local.py` monolith into `local/` package (`_index.py`, `_entry.py`, `_schema.py`, `_s3.py`, `_repo_legacy.py`) +- Migrated CLI from argparse to typer +- Migrated type annotations from `PackableSample` to `Packable` protocol +- Multiple adversarial review passes with code quality improvements +- CI: fixed duplicate runs, scoped permissions, benchmark auto-commit + ## [0.2.2b1] - 2026-01-28 ### Added @@ -25,243 +107,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). - **Comprehensive integration test suite**: 593 tests covering E2E flows, error handling, edge cases ### Changed -- Investigate upload-artifact not finding benchmark output (#512) -- Fix duplicate CI runs for push+PR overlap (#511) -- Scope contents:write permission to benchmark job only (#510) -- Add benchmark docs auto-commit to CI workflow (#509) -- Submit PR for v0.3.0b1 release to upstream/main (#508) -- Implement GH#39: Production hardening (observability, error handling, testing infra) (#504) -- Add pluggable structured logging via atdata.configure_logging (#507) -- Add PartialFailureError and shard-level error handling to Dataset.map (#506) -- Add atdata.testing module with mock clients, fixtures, and helpers (#505) -- Fix CI linting failures (20 ruff errors) (#503) -- Adversarial review: Post-benchmark suite assessment (#494) -- Remove redundant protocol docstrings that restate signatures (#500) -- Add missing unit tests for _type_utils.py (#499) -- Strengthen weak assertions (assert X is not None → value checks) (#498) -- Trim verbose exception constructor docstrings (#501) -- Analyze benchmark results for performance improvement opportunities (#502) -- Consolidate remaining duplicate sample types in test files (#497) -- Remove dead code: _repo_legacy.py legacy UUID field, unused imports (#496) -- Trim verbose docstrings in dataset.py and _index.py (#495) -- Benchmark report: replace mean/stddev with median/IQR, add per-sample columns (#492) -- Add parameter descriptions to benchmark suite with automatic report introspection (#491) -- HTML benchmark reports with CI integration (#487) -- Add bench + render step to CI on highest Python version only (#490) -- Update justfile bench commands to export JSON and render (#489) -- Create render_report.py script to convert JSON to HTML (#488) -- Increase test coverage for low-coverage modules (#480) -- Add providers/_postgres.py tests (mock-based) (#485) -- Add _stub_manager.py tests (#484) -- Add manifest/_query.py tests (#483) -- Add repository.py tests (#482) -- Add CLI tests (cli/__init__, diagnose, local, preview, schema) (#481) -- Check test coverage for CLI utils (#479) -- Add performance benchmark suite for atdata (#471) -- Verify benchmarks run (#478) -- Update pyproject.toml and justfile (#477) -- Create bench_atmosphere.py (#476) -- Create bench_query.py (#475) -- Create bench_dataset_io.py (#474) -- Create bench_index_providers.py (#473) -- Create benchmarks/conftest.py with shared fixtures (#472) -- Add per-shard manifest and query system (GH #35) (#462) -- Write unit and integration tests (#470) -- Integrate manifest into write path and Dataset.query() (#469) -- Implement QueryExecutor and SampleLocation (#468) -- Implement ManifestWriter (JSON + parquet) (#467) -- Implement ManifestBuilder (#465) -- Implement ShardManifest data model (#466) -- Implement aggregate collectors (categorical, numeric, set) (#464) -- Implement ManifestField annotation and resolve_manifest_fields() (#463) -- Migrate type annotations from PackableSample to Packable protocol (#461) -- Remove LocalIndex factory — consolidate to Index (#460) -- Split local.py monolith into local/ package (#452) -- Verify tests and lint pass (#459) -- Create __init__.py re-export facade and delete local.py (#458) -- Create _repo_legacy.py with deprecated Repo class (#457) -- Create _index.py with Index class and LocalIndex factory (#456) -- Create _s3.py with S3DataStore and S3 helpers (#455) -- Create _schema.py with schema models and helpers (#454) -- Create _entry.py with LocalDatasetEntry and constants (#453) -- Migrate CLI from argparse to typer (#449) -- Investigate test failures (#450) -- Fix ensure_stub receiving LocalSchemaRecord instead of dict (#451) -- GH#38: Developer experience improvements (#437) -- CLI: atdata preview command (#440) -- CLI: atdata schema show/diff commands (#439) -- CLI: atdata inspect command (#438) -- Dataset.__len__ and Dataset.select() for sample count and indexed access (#447) -- Dataset.to_pandas() and Dataset.to_dict() export methods (#446) -- Dataset.filter() and Dataset.map() streaming transforms (#445) -- Dataset.get(key) for keyed sample access (#442) -- Dataset.describe() summary statistics (#444) -- Dataset.schema property and column_names (#443) -- Dataset.head(n) and Dataset.__iter__ convenience methods (#441) -- Custom exception hierarchy with actionable error messages (#448) -- Adversarial review: Post-Repository consolidation assessment (#430) -- Remove backwards-compat dict-access methods from SchemaField and LocalSchemaRecord (#436) -- Add missing test coverage for Repository prefix routing edge cases and error paths (#435) -- Trim over-verbose docstrings in local.py module/class level (#434) -- Fix formally incorrect test assertions (batch_size, CID, brace notation) (#433) -- Consolidate duplicate test sample types across test files into conftest.py (#432) -- Consolidate duplicate entry-creation logic in Index (add_entry vs _insert_dataset_to_provider) (#431) -- Switch default Index provider from Redis to SQLite (#429) -- Consolidated Index with Repository system (#424) -- Phase 4: Deprecate AtmosphereIndex, update exports (#428) -- Phase 3: Default Index singleton and load_dataset integration (#427) -- Phase 2: Extend Index with repos/atmosphere params and prefix routing (#426) -- Phase 1: Create Repository dataclass and _AtmosphereBackend in repository.py (#425) -- Adversarial review: Post-IndexProvider pluggable storage assessment (#417) -- Convert TODO comments to tracked issues or remove (#422) -- Remove deprecated shard_list property references from docstrings (#421) -- Replace bare except in _stub_manager.py and cli/local.py with specific exceptions (#423) -- Tighten generic pytest.raises(Exception) to specific exception types in tests (#420) -- Replace assert statements with ValueError in production code (#419) -- Consolidate duplicated _parse_semver into _type_utils.py (#418) -- feat: Add SQLite/PostgreSQL index providers (GH #42) (#409) -- Update documentation and public API exports (#416) -- Add tests for all providers (#415) -- Refactor Index class to accept provider parameter (#414) -- Implement PostgresIndexProvider (#413) -- Implement SqliteIndexProvider (#412) -- Implement RedisIndexProvider (extract from Index class) (#411) -- Define IndexProvider protocol in _protocols.py (#410) -- Add just lint command to justfile (#408) -- Add SQLite/PostgreSQL providers for LocalIndex (in addition to Redis) (#407) -- Fix type hints for @atdata.packable decorator to show PackableSample methods (#406) -- Review GitHub workflows and recommend CI improvements (#405) -- Fix type signatures for Dataset.ordered and Dataset.shuffled (GH#28) (#404) -- Investigate quartodoc Example section rendering - missing CSS classes on pre/code tags (#401) -- Update all docstrings from Example: to Examples: format (#403) -- Create GitHub issues for v0.3 roadmap feature domains (#402) -- Expand Quarto documentation with architectural narrative (#395) -- Expand atmosphere tutorial with federation context (#400) -- Expand local-workflow tutorial with system narrative (#399) -- Expand quickstart tutorial with design context (#398) -- Expand index.qmd with architecture narrative (#397) -- Add architecture overview page (reference/architecture.qmd) (#396) -- Adversarial review: Post-PDSBlobStore comprehensive assessment (#389) -- Remove deprecated shard_list property warnings if unused (#394) -- Add test for Dataset iteration over empty tar file (#393) -- Consolidate duplicate sample types in live atmosphere tests (#392) -- Convert TODO comment in dataset.py to design note or remove (#391) -- Remove redundant no-op statements in _stub_manager.py (#390) -- Update atmosphere example with blob storage case (#216) -- Implement PDSBlobStore for atmosphere data storage (#244) -- Update docs and examples to include PDSBlobStore (#384) -- Add API docs for PDSBlobStore and BlobSource (#388) -- Update atmosphere_demo.py example (#387) -- Update atmosphere reference docs (#386) -- Update atmosphere tutorial with PDSBlobStore (#385) -- Implement PDSBlobStore for ATProto blob storage (#380) -- Add tests for PDSBlobStore and BlobSource (#383) -- Add BlobSource for reading PDS blobs as DataSource (#382) -- Create PDSBlobStore class in atmosphere module (#381) -- Investigate Redis index entry expiration/reset issue (#376) -- Audit codebase for xs/@property vs list_xs() convention (#377) -- Evaluate PackableSample → Packable protocol migration (#375) -- Fix load_dataset overload type hints for AbstractIndex (#379) -- Fix load_dataset to use source-appropriate credentials (#378) -- Review and plan human-review.md feedback items (#374) -- Create v0.3 roadmap synthesis document (#373) -- Document justfile in CLAUDE.md (#372) -- Make docs script work from any directory (#371) -- Add uv script shortcut 'docs' for documentation build (#370) -- Update docstrings in local.py (#367) -- Update docstrings in _protocols.py (#366) -- Update docstrings in lens.py (#365) -- Update docstrings in dataset.py (#364) -- Review and address human-review.md feedback (#344) -- Fix load_dataset overloads and AbstractIndex compatibility (#348) -- Connect load_dataset to index data_store for S3 credentials (#361) -- Fix load_dataset overload return types for DictSample (#360) -- Add data_store to AbstractIndex protocol (#359) -- Audit and fix xs/list_xs naming convention (#347) -- Fix AtmosphereIndex: list_datasets/list_schemas return types (#357) -- Refactor DataSource/Dataset: shards()/shard_list -> shards/list_shards() (#356) -- Refactor local.py: entries/all_entries -> entries/list_entries (#355) -- Update AbstractIndex protocol to match new naming convention (#358) -- Investigate Redis index entry removal issue (#346) -- Implement 'atdata diagnose' command for Redis health check (#354) -- Implement 'atdata local up' command to run Redis + MinIO (#353) -- Create atdata.cli module with entry point (#352) -- Evaluate PackableSample → Packable protocol migration (#345) -- Update publish_schema and other signatures to use Packable protocol (#351) -- Update @packable decorator return type annotation (#350) -- Define Packable protocol in _protocols.py (#349) -- Review and update README for v0.2.2 release (#343) -- Streamline Dataset API with DictSample default type (#338) -- Add tests for DictSample and new API (#342) -- Update load_dataset default type to DictSample (#341) -- Update @packable to auto-register DictSample lens (#340) -- Implement DictSample class with __getattr__ and __getitem__ (#339) -- Fix failing tests in test_integration_error_handling.py (#337) -- v0.2.2 beta release improvements (#326) -- Document to_parquet() memory usage (#336) -- Evaluate splitting local.py into modules (#335) -- Add error path tests (timeouts, partial failures) (#334) -- Add deployment guide to docs (#333) -- Add troubleshooting/FAQ section to docs (#332) -- Document __orig_class__ assumption in Dataset docstring (#331) -- Centralize tar creation helper in test fixtures (#330) -- Consolidate duplicate test sample types to conftest.py (#329) -- Document expected filterwarnings in test suite (#328) -- Complete truncated atmosphere.qmd documentation (#327) -- Comprehensive v0.2.2 beta release review (#321) -- Compile findings into .review/comprehensive-review.md (#325) -- Review documentation website and examples (#324) -- Review test suite coverage and quality (#323) -- Review core codebase architecture and code quality (#322) -- Human Review: Local Workflow API Improvements (#274) -- Update documentation and examples for current codebase (#316) -- Update README.md with current API (#320) -- Update examples/*.py files for current API (#319) -- Update reference/protocols.qmd with DataSource protocol (#318) -- Update reference/datasets.qmd for DataSource API (#317) -- Adversarial review: Post-DataSource refactor assessment (#307) -- Clean up unused TypeAlias definitions in dataset.py (#315) -- Remove verbose docstrings that restate function signatures (#314) -- Consolidate schema reference parsing logic in local.py (#313) -- Add error tests for corrupted msgpack data in Dataset.wrap() (#312) -- Remove or implement skipped test_repo_insert_round_trip (#311) -- Fix bare exception handlers in _stub_manager.py and _cid.py (#310) -- Replace assertion with ValueError in lens.py input validation (#309) -- Replace assertions with ValueError in dataset.py msgpack validation (#308) -- Refactor Dataset to use DataSource abstraction (#299) -- Research WebDataset streaming alternatives beyond HTTP/S URLs (#298) -- Write tests for DataSource implementations (#306) -- Update load_dataset to use DataSource (#305) -- Update S3DataStore to create S3Source instances (#304) -- Refactor Dataset to accept DataSource | str (#303) -- Implement S3Source with boto3 streaming (#302) -- Implement URLSource in new _sources.py module (#301) -- Add DataSource protocol to _protocols.py (#300) -- Fix S3 mock fixture regionname typo in tests (#297) -- Human review feedback: API improvements from human-review-01 (#290) -- AbstractIndex: Protocol vs subclass causing linting errors (#296) -- load_dataset linting: no matching overloads error (#295) -- @atdata.lens linting: LocalTextSample not recognized as PackableSample subclass (#291) -- LocalDatasetEntry: underscore-prefixed attributes should be public (#294) -- Default batch_size should be None for Dataset.ordered/shuffled (#292) -- Improve SchemaNamespace typing for IDE support (#289) -- Schema namespace API: index.load_schema() + index.schemas.MyType (#288) -- Auto-typed get_schema/decode_schema return type (#287) -- Improve decode_schema typing for IDE support (#286) -- Fix stub filename collisions with authority-based namespacing (#285) -- Auto-generate stubs on schema access (#281) -- Add tests for auto-stub functionality (#284) -- Integrate auto-stub into Index class (#283) -- Add StubManager class for stub file management (#282) -- Improve decoded_type dynamic typing/signatures (#279) -- Document atdata URI specification (#280) -- Create proper SampleSchema Python type (#278) -- Fix @atdata.packable decorator class identity (#275) -- Fix @atdata.packable decorator class identity (#275) -- Fix @atdata.packable decorator class identity (#275) -- Improve index.publish_schema API (#276) -- Improve list_schemas API semantics (#277) -- Fix @atdata.packable decorator class identity (#275) - **Architecture refactor**: `LocalIndex` + `S3DataStore` composable pattern - `LocalIndex` now accepts optional `data_store` parameter - `S3DataStore` implements `AbstractDataStore` for S3 operations diff --git a/CLAUDE.md b/CLAUDE.md index 6b096b4..0245668 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,11 +4,13 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Project Overview -`atdata` is a Python library that implements a loose federation of distributed, typed datasets built on top of WebDataset. It provides: +`atdata` is a Python library that implements a loose federation of distributed, typed datasets built on top of WebDataset and ATProto. It provides: - **Typed samples** with automatic serialization via msgpack +- **Local and atmosphere storage** with pluggable index providers (SQLite, Redis, PostgreSQL) - **Lens-based transformations** between different dataset schemas -- **Batch aggregation** with automatic numpy array stacking +- **ATProto integration** for publishing and discovering datasets on the atmosphere +- **HuggingFace-style API** with `load_dataset()` for convenient access - **WebDataset integration** for efficient large-scale dataset storage ## Development Commands @@ -28,11 +30,10 @@ uv run pytest # Run specific test file uv run pytest tests/test_dataset.py -uv run pytest tests/test_lens.py +uv run pytest tests/test_local.py # Run single test -uv run pytest tests/test_dataset.py::test_create_sample -uv run pytest tests/test_lens.py::test_lens +uv run pytest tests/test_dataset.py::test_create_sample -v ``` ### Building @@ -50,6 +51,13 @@ just test # Run all tests with coverage just test tests/test_dataset.py # Run specific test file just lint # Run ruff check + format check just docs # Build documentation (runs quartodoc + quarto) +just bench # Run full benchmark suite +just bench-io # Run I/O benchmarks only +just bench-index # Run index provider benchmarks +just bench-query # Run query benchmarks +just bench-report # Generate HTML benchmark report +just bench-save # Save benchmark results +just bench-compare a b # Compare two benchmark runs ``` The `justfile` is in the project root. Add new dev tasks there rather than creating shell scripts. @@ -67,23 +75,41 @@ uv run python script.py ## Architecture -### Core Components +### Module Overview -The codebase has three main modules under `src/atdata/`: +The codebase lives under `src/atdata/` with these main components: -1. **dataset.py** - Core dataset and sample infrastructure - - `PackableSample`: Base class for samples that can be serialized with msgpack - - `Dataset[ST]`: Generic typed dataset wrapping WebDataset tar files - - `SampleBatch[DT]`: Automatic batching with attribute aggregation - - `@packable` decorator: Converts dataclasses into PackableSample subclasses +**Core modules:** +- `dataset.py` — `PackableSample`, `DictSample`, `Dataset[ST]`, `SampleBatch[DT]`, `@packable`, `write_samples()` +- `lens.py` — `Lens[S, V]`, `LensNetwork`, `@lens` decorator +- `_protocols.py` — Protocol definitions: `Packable`, `IndexEntry`, `AbstractIndex`, `AbstractDataStore`, `DataSource` +- `_hf_api.py` — `load_dataset()`, `DatasetDict`, HuggingFace-style path resolution +- `_exceptions.py` — Custom exception hierarchy (`AtdataError`, `SchemaError`, `ShardError`, etc.) -2. **lens.py** - Type transformation system - - `Lens[S, V]`: Bidirectional transformations between sample types (getter/putter) - - `LensNetwork`: Singleton registry for lens transformations - - `@lens` decorator: Registers lens getters globally +**Index and storage:** +- `local/` — `Index`, `LocalDatasetEntry`, `S3DataStore`, `LocalDiskStore`, schema management +- `providers/` — Pluggable index backends: `SqliteProvider` (default), `RedisProvider`, `PostgresProvider` +- `repository.py` — `Repository` dataclass pairing provider + data store, prefix routing -3. **_helpers.py** - Serialization utilities - - `array_to_bytes()` / `bytes_to_array()`: numpy array serialization +**ATProto integration:** +- `atmosphere/` — `AtmosphereClient`, schema/dataset/lens publishers and loaders, `PDSBlobStore` +- `promote.py` — Local-to-atmosphere promotion (deprecated in favor of `Index.promote_entry()`) + +**Data pipeline:** +- `_sources.py` — `URLSource`, `S3Source`, `BlobSource` (streaming shard data to Dataset) +- `manifest/` — Per-shard metadata manifests for query-based access (`ManifestField`, `QueryExecutor`) + +**Utilities:** +- `_helpers.py` — NumPy array serialization (`array_to_bytes` / `bytes_to_array`) +- `_cid.py` — ATProto-compatible CID generation via libipld +- `_schema_codec.py` — Dynamic Python type generation from stored schemas +- `_stub_manager.py` — IDE stub file generation for dynamic types +- `_type_utils.py` — Shared type conversion utilities +- `_logging.py` — Pluggable structured logging +- `testing.py` — Mock clients, fixtures, and test helpers + +**CLI:** +- `cli/` — Typer-based CLI: `atdata inspect`, `atdata preview`, `atdata schema show/diff`, `atdata local up/down/status`, `atdata diagnose` ### Key Design Patterns @@ -105,6 +131,36 @@ class MySample: field2: NDArray ``` +**Writing and Indexing Data** + +```python +# Write samples directly to tar files +ds = atdata.write_samples(samples, "output/data.tar") + +# Or use Index for managed storage +index = atdata.Index(data_store=atdata.LocalDiskStore()) +entry = index.write(samples, name="my-dataset") +``` + +**Index with Pluggable Storage** + +```python +# SQLite backend (default, zero dependencies) +index = atdata.Index() + +# With local disk storage +index = atdata.Index(data_store=atdata.LocalDiskStore()) + +# With S3 storage +from atdata.local import S3DataStore +index = atdata.Index(data_store=S3DataStore(credentials, bucket="my-bucket")) + +# With atmosphere backend +from atdata.atmosphere import AtmosphereClient +client = AtmosphereClient.login("handle", "password") +index = atdata.Index(atmosphere=client) +``` + **NDArray Handling** Fields annotated as `NDArray` or `NDArray | None` are automatically: @@ -129,7 +185,7 @@ def my_lens_put(view: ViewType, source: SourceType) -> SourceType: ds = atdata.Dataset[SourceType](url).as_type(ViewType) ``` -The `LensNetwork` singleton (in `lens.py:183`) maintains a global registry of all lenses decorated with `@lens`. +The `LensNetwork` singleton (in `lens.py`) maintains a global registry of all lenses decorated with `@lens`. **Batch Aggregation** @@ -186,26 +242,27 @@ The codebase uses Python 3.12+ generics heavily: ## Testing Notes -- Tests use parametrization heavily via `@pytest.mark.parametrize` -- Test cases cover both decorator and inheritance syntax +- 1155+ tests across 38 test files +- Tests use parametrization via `@pytest.mark.parametrize` where appropriate - Temporary WebDataset tar files created in `tmp_path` fixture -- Tests verify both serialization and batch aggregation behavior +- Shared sample types defined in `conftest.py` (`SharedBasicSample`, `SharedNumpySample`) - Lens tests verify well-behavedness (GetPut/PutGet/PutPut laws) +- Integration tests cover local, atmosphere, cross-backend, and error handling scenarios ### Warning Suppression Convention **Keep warning suppression local to individual tests, not global.** -When tests generate expected warnings (e.g., from third-party library incompatibilities), suppress them using `@pytest.mark.filterwarnings` decorators on each affected test rather than global suppression in `conftest.py`. This: +When tests generate expected warnings (e.g., from deprecated APIs or third-party library incompatibilities), suppress them using `@pytest.mark.filterwarnings` decorators on each affected test rather than global suppression in `conftest.py`. This: - Documents which specific tests have known warning behaviors - Makes it easier to track when warnings appear in unexpected places - Avoids masking genuine warnings from new code -Example for s3fs/moto async incompatibility warnings: +Example for deprecated API tests: ```python -@pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") -@pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") -def test_repo_insert_with_s3(mock_s3, clean_redis): +@pytest.mark.filterwarnings("ignore::DeprecationWarning") +class TestAtmosphereIndex: + """Tests for deprecated AtmosphereIndex backward compat.""" ... ``` @@ -303,6 +360,16 @@ chainlink show 123 uv run chainlink list # Not needed ``` +## Custom Skills + +Project-level Claude Code skills are defined in `.claude/commands/`: + +- `/release ` — Full release flow: branch, merge, version bump, changelog, PR +- `/adr` — Adversarial review with docstring-preservation rules for quartodoc +- `/changelog` — Generate clean CHANGELOG entry from chainlink history + +User-level skills (in `~/.claude/commands/`) take precedence over project-level skills with the same name. + ## Git Workflow ### Committing Changes @@ -312,20 +379,31 @@ When using the `/commit` command or creating commits: - This ensures issue tracking history is preserved across sessions - The issues.db file tracks all chainlink issues, comments, and status changes +### Release Flow + +Releases follow this pattern (automated by `/release` skill): +1. Create `release/v` branch from previous release +2. Merge feature branch with `--no-ff` to preserve topology +3. Bump version in `pyproject.toml`, run `uv lock` +4. Write CHANGELOG entry (Keep a Changelog format) +5. Push and create PR to `upstream/main` + ### CLI Module -- **Track `src/atdata/cli/`** - Always include the CLI module in commits -- The CLI provides `atdata local up/down/status` and `atdata diagnose` commands +- **Track `src/atdata/cli/`** — Always include the CLI module in commits +- The CLI is built with typer and provides `atdata inspect`, `atdata preview`, `atdata schema`, `atdata local`, and `atdata diagnose` commands - Changes to CLI should be committed with the related feature changes ### Planning Documents -- **Track `.planning/` directory in git** - Do not ignore planning documents -- Planning documents in `.planning/` should be committed to preserve design history -- This includes architecture notes, implementation plans, and design decisions +- **Track `.planning/` directory in git** — Do not ignore planning documents +- Planning documents are organized by phase in `.planning/phases/`: + - `01-atproto-foundation/` — Initial ATProto integration design, lexicon definitions, architecture decisions + - `02-v0.2-review/` — Human review assessments from v0.2 cycle + - `03-v0.3-roadmap/` — Codebase review and synthesis roadmap for v0.3 ### Reference Materials -- **Track `.reference/` directory in git** - Include reference documentation in commits +- **Track `.reference/` directory in git** — Include reference documentation in commits - The `.reference/` directory contains external specifications and reference materials - This includes API specs, lexicon definitions, and other reference documentation used for development diff --git a/README.md b/README.md index 0f19324..364fb86 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # atdata -[![codecov](https://codecov.io/gh/foundation-ac/atdata/branch/main/graph/badge.svg)](https://codecov.io/gh/foundation-ac/atdata) +[![codecov](https://codecov.io/gh/forecast-bio/atdata/branch/main/graph/badge.svg)](https://codecov.io/gh/forecast-bio/atdata) A loose federation of distributed, typed datasets built on WebDataset. diff --git a/docs/api/AbstractDataStore.html b/docs/api/AbstractDataStore.html index 4ceb07c..6605a66 100644 --- a/docs/api/AbstractDataStore.html +++ b/docs/api/AbstractDataStore.html @@ -256,7 +256,6 @@

On this page

  • Methods
  • @@ -272,15 +271,12 @@

    On this page

    AbstractDataStore

    AbstractDataStore()
    -

    Protocol for data storage operations.

    -

    This protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)

    -

    The separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.

    +

    Protocol for data storage backends (S3, local disk, PDS blobs).

    +

    Separates index (metadata) from data store (shard files), enabling flexible deployment combinations.

    Examples

    >>> store = S3DataStore(credentials, bucket="my-bucket")
    ->>> urls = store.write_shards(dataset, prefix="training/v1")
    ->>> print(urls)
    -['s3://my-bucket/training/v1/shard-000000.tar', ...]
    +>>> urls = store.write_shards(dataset, prefix="training/v1")

    Methods

    @@ -294,13 +290,9 @@

    Methods

    read_url -Resolve a storage URL for reading. +Resolve a storage URL for reading (e.g., sign S3 URLs). -supports_streaming -Whether this store supports streaming reads. - - write_shards Write dataset shards to storage. @@ -309,84 +301,14 @@

    Methods

    read_url

    AbstractDataStore.read_url(url)
    -

    Resolve a storage URL for reading.

    -

    Some storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.

    -
    -

    Parameters

    - - - - - - - - - - - - - - - - - -
    NameTypeDescriptionDefault
    urlstrStorage URL to resolve.required
    -
    -
    -

    Returns

    - - - - - - - - - - - - - - - -
    NameTypeDescription
    strWebDataset-compatible URL for reading.
    -
    -
    -
    -

    supports_streaming

    -
    AbstractDataStore.supports_streaming()
    -

    Whether this store supports streaming reads.

    -
    -

    Returns

    - - - - - - - - - - - - - - - - - - - - -
    NameTypeDescription
    boolTrue if the store supports efficient streaming (like S3),
    boolFalse if data must be fully downloaded first.
    -
    +

    Resolve a storage URL for reading (e.g., sign S3 URLs).

    write_shards

    -
    AbstractDataStore.write_shards(ds, *, prefix, **kwargs)
    +
    AbstractDataStore.write_shards(ds, *, prefix, **kwargs)

    Write dataset shards to storage.

    -
    -

    Parameters

    +
    +

    Parameters

    @@ -406,20 +328,20 @@

    - + - +
    prefix strPath prefix for the shards (e.g., ‘datasets/mnist/v1’).Path prefix (e.g., 'datasets/mnist/v1'). required
    **kwargs Backend-specific options (e.g., maxcount for shard size).Backend-specific options (maxcount, maxsize, etc.). {}
    -
    -

    Returns

    +
    +

    Returns

    @@ -432,12 +354,7 @@

    - - - - - - +
    list[str]List of URLs for the written shards, suitable for use with
    list[str]WebDataset or atdata.Dataset().List of shard URLs suitable for atdata.Dataset().
    diff --git a/docs/api/AbstractIndex.html b/docs/api/AbstractIndex.html index d04ab5d..db70079 100644 --- a/docs/api/AbstractIndex.html +++ b/docs/api/AbstractIndex.html @@ -252,7 +252,6 @@

    On this page

    @@ -278,27 +276,15 @@

    On this page

    AbstractIndex

    AbstractIndex()
    -

    Protocol for index operations - implemented by LocalIndex and AtmosphereIndex.

    -

    This protocol defines the common interface for managing dataset metadata: - Publishing and retrieving schemas - Inserting and listing datasets - (Future) Publishing and retrieving lenses

    -

    A single index can hold datasets of many different sample types. The sample type is tracked via schema references, not as a generic parameter on the index.

    -
    -

    Optional Extensions

    -

    Some index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution.

    -
    +

    Protocol for index operations — implemented by Index and AtmosphereIndex.

    +

    Manages dataset metadata: publishing/retrieving schemas, inserting/listing datasets. A single index holds datasets of many sample types, tracked via schema references.

    Examples

    >>> def publish_and_list(index: AbstractIndex) -> None:
    -...     # Publish schemas for different types
    -...     schema1 = index.publish_schema(ImageSample, version="1.0.0")
    -...     schema2 = index.publish_schema(TextSample, version="1.0.0")
    -...
    -...     # Insert datasets of different types
    -...     index.insert_dataset(image_ds, name="images")
    -...     index.insert_dataset(text_ds, name="texts")
    -...
    -...     # List all datasets (mixed types)
    -...     for entry in index.list_datasets():
    -...         print(f"{entry.name} -> {entry.schema_ref}")
    +... index.publish_schema(ImageSample, version="1.0.0") +... index.insert_dataset(image_ds, name="images") +... for entry in index.list_datasets(): +... print(f"{entry.name} -> {entry.schema_ref}")

    Attributes

    @@ -314,14 +300,6 @@

    Attributes

    data_store Optional data store for reading/writing shards. - -datasets -Lazily iterate over all dataset entries in this index. - - -schemas -Lazily iterate over all schema records in this index. -
    @@ -337,7 +315,7 @@

    Methods

    decode_schema -Reconstruct a Python Packable type from a stored schema. +Reconstruct a Packable type from a stored schema. get_dataset @@ -349,77 +327,22 @@

    Methods

    insert_dataset -Insert a dataset into the index. - - -list_datasets -Get all dataset entries as a materialized list. - - -list_schemas -Get all schema records as a materialized list. +Register an existing dataset in the index. publish_schema Publish a schema for a sample type. + +write +Write samples and create an index entry in one step. +

    decode_schema

    AbstractIndex.decode_schema(ref)
    -

    Reconstruct a Python Packable type from a stored schema.

    -

    This method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.

    -
    -

    Parameters

    - - - - - - - - - - - - - - - - - -
    NameTypeDescriptionDefault
    refstrSchema reference string (local:// or at://).required
    -
    -
    -

    Returns

    - - - - - - - - - - - - - - - - - - - - - - - - - -
    NameTypeDescription
    Type[Packable]A dynamically generated Packable class with fields matching
    Type[Packable]the schema definition. The class can be used with
    Type[Packable]Dataset[T] to load and iterate over samples.
    -
    +

    Reconstruct a Packable type from a stored schema.

    Raises

    @@ -439,64 +362,21 @@

    Rais

    - +
    ValueErrorIf schema cannot be decoded (unsupported field types).If schema has unsupported field types.

    Examples

    -
    >>> entry = index.get_dataset("my-dataset")
    ->>> SampleType = index.decode_schema(entry.schema_ref)
    ->>> ds = Dataset[SampleType](entry.data_urls[0])
    ->>> for sample in ds.ordered():
    -...     print(sample)  # sample is instance of SampleType
    +
    >>> SampleType = index.decode_schema(entry.schema_ref)
    +>>> ds = Dataset[SampleType](entry.data_urls[0])

    get_dataset

    AbstractIndex.get_dataset(ref)

    Get a dataset entry by name or reference.

    -
    -

    Parameters

    - - - - - - - - - - - - - - - - - -
    NameTypeDescriptionDefault
    refstrDataset name, path, or full reference string.required
    -
    -
    -

    Returns

    - - - - - - - - - - - - - - - -
    NameTypeDescription
    IndexEntryIndexEntry for the dataset.
    -

    Raises

    @@ -521,51 +401,6 @@

    Ra

    get_schema

    AbstractIndex.get_schema(ref)

    Get a schema record by reference.

    -
    -

    Parameters

    -
    - - - - - - - - - - - - - - - - -
    NameTypeDescriptionDefault
    refstrSchema reference string (local:// or at://).required
    -
    -
    -

    Returns

    - - - - - - - - - - - - - - - - - - - - -
    NameTypeDescription
    dictSchema record as a dictionary with fields like ‘name’, ‘version’,
    dict‘fields’, etc.
    -

    Raises

    @@ -589,10 +424,9 @@

    Ra

    insert_dataset

    AbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)
    -

    Insert a dataset into the index.

    -

    The sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.

    -
    -

    Parameters

    +

    Register an existing dataset in the index.

    +
    +

    Parameters

    @@ -606,80 +440,70 @@

    - + - + - + - +
    ds DatasetThe Dataset to register in the index (any sample type).The Dataset to register. required
    name strHuman-readable name for the dataset.Human-readable name. required
    schema_ref Optional[str]Optional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.Explicit schema ref; auto-published if None. None
    **kwargs Additional backend-specific options.Backend-specific options. {}
    -
    -

    Returns

    +
    +
    +

    publish_schema

    +
    AbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)
    +

    Publish a schema for a sample type.

    +
    +

    Parameters

    + - - - + + + + - -
    Name Type DescriptionDefault
    IndexEntryIndexEntry for the inserted dataset.sample_typetypeA Packable type (@packable-decorated or subclass).required
    -
    -
    -
    -

    list_datasets

    -
    AbstractIndex.list_datasets()
    -

    Get all dataset entries as a materialized list.

    -
    -

    Returns

    - - - - - - + + + + + - - + - - + +
    NameTypeDescription
    versionstrSemantic version string.'1.0.0'
    **kwargs list[IndexEntry]List of IndexEntry for each dataset.Backend-specific options.{}
    -
    -
    -

    list_schemas

    -
    AbstractIndex.list_schemas()
    -

    Get all schema records as a materialized list.

    -
    -

    Returns

    +
    +

    Returns

    @@ -691,20 +515,20 @@

    - - + +
    list[dict]List of schema records as dictionaries.strSchema reference string (local://... or at://...).
    -
    -

    publish_schema

    -
    AbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)
    -

    Publish a schema for a sample type.

    -

    The sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.

    -
    -

    Parameters

    +
    +

    write

    +
    AbstractIndex.write(samples, *, name, schema_ref=None, **kwargs)
    +

    Write samples and create an index entry in one step.

    +

    Serializes samples to WebDataset tar files, stores them via the appropriate backend, and creates an index entry.

    +
    +

    Parameters

    @@ -716,28 +540,34 @@

    -

    - - + + + - + - - + + + + + + + + - +
    sample_typetypeA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.samplesIterableIterable of Packable samples. Must be non-empty. required
    versionname strSemantic version string for the schema.'1.0.0'Dataset name, optionally prefixed with target backend.required
    schema_refOptional[str]Optional schema reference.None
    **kwargs Additional backend-specific options.Backend-specific options (maxcount, description, etc.). {}
    -
    -

    Returns

    +
    +

    Returns

    @@ -749,18 +579,8 @@

    - - - - - - - - - - - - + +
    strSchema reference string:
    str- Local: ‘local://schemas/{module.Class}@version
    str- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’IndexEntryIndexEntry for the created dataset.
    diff --git a/docs/api/AtmosphereIndex.html b/docs/api/AtmosphereIndex.html index 7b2a301..17f15ed 100644 --- a/docs/api/AtmosphereIndex.html +++ b/docs/api/AtmosphereIndex.html @@ -278,20 +278,18 @@

    On this page

    AtmosphereIndex

    atmosphere.AtmosphereIndex(client, *, data_store=None)

    ATProto index implementing AbstractIndex protocol.

    -

    Wraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.

    +

    .. deprecated:: Use atdata.Index(atmosphere=client) instead. AtmosphereIndex is retained for backwards compatibility and will be removed in a future release.

    +

    Wraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with Index.

    Optionally accepts a PDSBlobStore for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.

    Examples

    -
    >>> client = AtmosphereClient()
    ->>> client.login("handle.bsky.social", "app-password")
    ->>>
    ->>> # Without blob storage (external URLs only)
    ->>> index = AtmosphereIndex(client)
    ->>>
    ->>> # With PDS blob storage
    ->>> store = PDSBlobStore(client)
    ->>> index = AtmosphereIndex(client, data_store=store)
    ->>> entry = index.insert_dataset(dataset, name="my-data")
    +
    >>> # Preferred: use unified Index
    +>>> from atdata.local import Index
    +>>> from atdata.atmosphere import AtmosphereClient
    +>>> index = Index(atmosphere=client)
    +>>>
    +>>> # Legacy (deprecated)
    +>>> index = AtmosphereIndex(client)

    Attributes

    diff --git a/docs/api/DataSource.html b/docs/api/DataSource.html index 186b1e4..e35dfe5 100644 --- a/docs/api/DataSource.html +++ b/docs/api/DataSource.html @@ -272,20 +272,12 @@

    On this page

    DataSource

    DataSource()
    -

    Protocol for data sources that provide streams to Dataset.

    -

    A DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)

    -

    The key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects

    +

    Protocol for data sources that stream shard data to Dataset.

    +

    Implementations (URLSource, S3Source, BlobSource) yield (identifier, stream) pairs fed to WebDataset’s tar expander, bypassing URL resolution. This enables private S3, custom endpoints, and ATProto blob streaming.

    Examples

    -
    >>> source = S3Source(
    -...     bucket="my-bucket",
    -...     keys=["data-000.tar", "data-001.tar"],
    -...     endpoint="https://r2.example.com",
    -...     credentials=creds,
    -... )
    ->>> ds = Dataset[MySample](source)
    ->>> for sample in ds.ordered():
    -...     print(sample)
    +
    >>> source = S3Source(bucket="my-bucket", keys=["data-000.tar"])
    +>>> ds = Dataset[MySample](source)

    Attributes

    @@ -299,7 +291,7 @@

    Attributes

    shards -Lazily yield (identifier, stream) pairs for each shard. +Lazily yield (shard_id, stream) pairs for each shard. @@ -316,84 +308,23 @@

    Methods

    list_shards -Get list of shard identifiers without opening streams. +Shard identifiers without opening streams. open_shard -Open a single shard by its identifier. +Open a single shard for random access (e.g., DataLoader splitting).

    list_shards

    DataSource.list_shards()
    -

    Get list of shard identifiers without opening streams.

    -

    Used for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.

    -
    -

    Returns

    - - - - - - - - - - - - - - - -
    NameTypeDescription
    list[str]List of shard identifier strings.
    -
    +

    Shard identifiers without opening streams.

    open_shard

    DataSource.open_shard(shard_id)
    -

    Open a single shard by its identifier.

    -

    This method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.

    -
    -

    Parameters

    - - - - - - - - - - - - - - - - - -
    NameTypeDescriptionDefault
    shard_idstrShard identifier from shard_list.required
    -
    -
    -

    Returns

    - - - - - - - - - - - - - - - -
    NameTypeDescription
    IO[bytes]File-like stream for reading the shard.
    -
    +

    Open a single shard for random access (e.g., DataLoader splitting).

    Raises

    @@ -408,7 +339,7 @@

    Rais

    - +
    KeyErrorIf shard_id is not in shard_list.If shard_id is not in list_shards().
    diff --git a/docs/api/Dataset.html b/docs/api/Dataset.html index f7a6789..29b99ff 100644 --- a/docs/api/Dataset.html +++ b/docs/api/Dataset.html @@ -259,9 +259,19 @@

    On this page

  • Methods
    • as_type
    • +
    • describe
    • +
    • filter
    • +
    • get
    • +
    • head
    • list_shards
    • +
    • map
    • ordered
    • +
    • process_shards
    • +
    • query
    • +
    • select
    • shuffled
    • +
    • to_dict
    • +
    • to_pandas
    • to_parquet
    • wrap
    • wrap_batch
    • @@ -354,38 +364,108 @@

      Methods

      as_type -View this dataset through a different sample type using a registered lens. +View this dataset through a different sample type via a registered lens. + + +describe +Summary statistics: sample_type, fields, num_shards, shards, url, metadata. + + +filter +Return a new dataset that yields only samples matching predicate. + + +get +Retrieve a single sample by its __key__. + + +head +Return the first n samples from the dataset. list_shards -Get list of individual dataset shards. +Return all shard paths/URLs as a list. +map +Return a new dataset that applies fn to each sample during iteration. + + ordered Iterate over the dataset in order. + +process_shards +Process each shard independently, collecting per-shard results. + + +query +Query this dataset using per-shard manifest metadata. + + +select +Return samples at the given integer indices. + shuffled Iterate over the dataset in random order. +to_dict +Materialize the dataset as a column-oriented dictionary. + + +to_pandas +Materialize the dataset (or first limit samples) as a DataFrame. + + to_parquet -Export dataset contents to parquet format. +Export dataset to parquet file(s). wrap -Wrap a raw msgpack sample into the appropriate dataset-specific type. +Deserialize a raw WDS sample dict into type ST. wrap_batch -Wrap a batch of raw msgpack samples into a typed SampleBatch. +Deserialize a raw WDS batch dict into SampleBatch[ST].

      as_type

      Dataset.as_type(other)
      -

      View this dataset through a different sample type using a registered lens.

      +

      View this dataset through a different sample type via a registered lens.

      +
      +

      Raises

      + + + + + + + + + + + + + + + +
      NameTypeDescription
      ValueErrorIf no lens exists between the current and target types.
      +
      +
      +
      +

      describe

      +
      Dataset.describe()
      +

      Summary statistics: sample_type, fields, num_shards, shards, url, metadata.

      +
      +
      +

      filter

      +
      Dataset.filter(predicate)
      +

      Return a new dataset that yields only samples matching predicate.

      +

      The filter is applied lazily during iteration — no data is copied.

      Parameters

      @@ -399,9 +479,9 @@

      -

      - - + + + @@ -420,24 +500,115 @@

      Re

      - - + + - + +
      otherType[RT]The target sample type to transform into. Must be a type derived from PackableSample.predicateCallable[[ST], bool]A function that takes a sample and returns True to keep it or False to discard it. required
      Dataset[RT]A new Dataset instance that yields samples of type otherDataset[ST]A new Dataset whose iterators apply the filter.
      +
      +
      +

      Examples

      +
      >>> long_names = ds.filter(lambda s: len(s.name) > 10)
      +>>> for sample in long_names:
      +...     assert len(sample.name) > 10
      +
      +
      +
      +

      get

      +
      Dataset.get(key)
      +

      Retrieve a single sample by its __key__.

      +

      Scans shards sequentially until a sample with a matching key is found. This is O(n) for streaming datasets.

      +
      +

      Parameters

      + + + + + + + + + + + + + + + + + +
      NameTypeDescriptionDefault
      keystrThe WebDataset __key__ string to search for.required
      +
      +
      +

      Returns

      + + + + + + + + + + - - + + + + +
      NameTypeDescription
      Dataset[RT]by applying the appropriate lens transformation from the globalSTThe matching sample.
      +
      +
      +

      Raises

      + + + + + + + + - - + +
      NameTypeDescription
      Dataset[RT]LensNetwork registry.SampleKeyErrorIf no sample with the given key exists.
      -
      -

      Raises

      +
      +

      Examples

      +
      >>> sample = ds.get("00000001-0001-1000-8000-010000000000")
      +
      +
      +
      +

      head

      +
      Dataset.head(n=5)
      +

      Return the first n samples from the dataset.

      +
      +

      Parameters

      + + + + + + + + + + + + + + + + + +
      NameTypeDescriptionDefault
      nintNumber of samples to return. Default: 5.5
      +
      +
      +

      Returns

      @@ -449,48 +620,82 @@

      Rais

      - - + +
      ValueErrorIf no registered lens exists between the current sample type and the target type.list[ST]List of up to n samples in shard order.
      +
      +

      Examples

      +
      >>> samples = ds.head(3)
      +>>> len(samples)
      +3
      +

      list_shards

      -
      Dataset.list_shards()
      -

      Get list of individual dataset shards.

      -
      -

      Returns

      +
      Dataset.list_shards()
      +

      Return all shard paths/URLs as a list.

      +
      +
      +

      map

      +
      Dataset.map(fn)
      +

      Return a new dataset that applies fn to each sample during iteration.

      +

      The mapping is applied lazily during iteration — no data is copied.

      +
      +

      Parameters

      + - - - + + + + - + +
      Name Type DescriptionDefault
      list[str]A full (non-lazy) list of the individual tar files within thefnCallable[[ST], Any]A function that takes a sample of type ST and returns a transformed value.required
      +
      +
      +

      Returns

      + + + + + + + + + + - - + +
      NameTypeDescription
      list[str]source WebDataset.DatasetA new Dataset whose iterators apply the mapping.
      +
      +

      Examples

      +
      >>> names = ds.map(lambda s: s.name)
      +>>> for name in names:
      +...     print(name)
      +

      ordered

      -
      Dataset.ordered(batch_size=None)
      +
      Dataset.ordered(batch_size=None)

      Iterate over the dataset in order.

      -
      -

      Parameters

      +
      +

      Parameters

      @@ -510,8 +715,8 @@

      -

      Returns

      +
      +

      Returns

      @@ -544,20 +749,223 @@

      -
      -

      Examples

      -
      >>> for sample in ds.ordered():
      -...     process(sample)  # sample is ST
      ->>> for batch in ds.ordered(batch_size=32):
      -...     process(batch)  # batch is SampleBatch[ST]
      +
      +

      Examples

      +
      >>> for sample in ds.ordered():
      +...     process(sample)  # sample is ST
      +>>> for batch in ds.ordered(batch_size=32):
      +...     process(batch)  # batch is SampleBatch[ST]
      +
      +
      +
      +

      process_shards

      +
      Dataset.process_shards(fn, *, shards=None)
      +

      Process each shard independently, collecting per-shard results.

      +

      Unlike :meth:map (which is lazy and per-sample), this method eagerly processes each shard in turn, calling fn with the full list of samples from that shard. If some shards fail, raises :class:~atdata._exceptions.PartialFailureError containing both the successful results and the per-shard errors.

      +
      +

      Parameters

      + + + + + + + + + + + + + + + + + + + + + + + +
      NameTypeDescriptionDefault
      fnCallable[[list[ST]], Any]Function receiving a list of samples from one shard and returning an arbitrary result.required
      shardslist[str] | NoneOptional list of shard identifiers to process. If None, processes all shards in the dataset. Useful for retrying only the failed shards from a previous PartialFailureError.None
      +
      +
      +

      Returns

      + + + + + + + + + + + + + + + +
      NameTypeDescription
      dict[str, Any]Dict mapping shard identifier to fn’s return value for each shard.
      +
      +
      +

      Raises

      + + + + + + + + + + + + + + + +
      NameTypeDescription
      PartialFailureErrorIf at least one shard fails. The exception carries .succeeded_shards, .failed_shards, .errors, and .results for inspection and retry.
      +
      +
      +

      Examples

      +
      >>> results = ds.process_shards(lambda samples: len(samples))
      +>>> # On partial failure, retry just the failed shards:
      +>>> try:
      +...     results = ds.process_shards(expensive_fn)
      +... except PartialFailureError as e:
      +...     retry = ds.process_shards(expensive_fn, shards=e.failed_shards)
      +
      +
      +
      +

      query

      +
      Dataset.query(where)
      +

      Query this dataset using per-shard manifest metadata.

      +

      Requires manifests to have been generated during shard writing. Discovers manifest files alongside the tar shards, loads them, and executes a two-phase query (shard-level aggregate pruning, then sample-level parquet filtering).

      +
      +

      Parameters

      + + + + + + + + + + + + + + + + + +
      NameTypeDescriptionDefault
      whereCallable[[pd.DataFrame], pd.Series]Predicate function that receives a pandas DataFrame of manifest fields and returns a boolean Series selecting matching rows.required
      +
      +
      +

      Returns

      + + + + + + + + + + + + + + + +
      NameTypeDescription
      list[SampleLocation]List of SampleLocation for matching samples.
      +
      +
      +

      Raises

      + + + + + + + + + + + + + + + +
      NameTypeDescription
      FileNotFoundErrorIf no manifest files are found alongside shards.
      +
      +
      +

      Examples

      +
      >>> locs = ds.query(where=lambda df: df["confidence"] > 0.9)
      +>>> len(locs)
      +42
      +
      +
      +
      +

      select

      +
      Dataset.select(indices)
      +

      Return samples at the given integer indices.

      +

      Iterates through the dataset in order and collects samples whose positional index matches. This is O(n) for streaming datasets.

      +
      +

      Parameters

      + + + + + + + + + + + + + + + + + +
      NameTypeDescriptionDefault
      indicesSequence[int]Sequence of zero-based indices to select.required
      +
      +
      +

      Returns

      + + + + + + + + + + + + + + + +
      NameTypeDescription
      list[ST]List of samples at the requested positions, in index order.
      +
      +
      +

      Examples

      +
      >>> samples = ds.select([0, 5, 10])
      +>>> len(samples)
      +3

      shuffled

      -
      Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)
      +
      Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)

      Iterate over the dataset in random order.

      -
      -

      Parameters

      +
      +

      Parameters

      @@ -589,8 +997,8 @@

      -

      Returns

      +
      +

      Returns

      @@ -623,21 +1031,20 @@

      -
      -

      Examples

      -
      >>> for sample in ds.shuffled():
      -...     process(sample)  # sample is ST
      ->>> for batch in ds.shuffled(batch_size=32):
      -...     process(batch)  # batch is SampleBatch[ST]
      +
      +

      Examples

      +
      >>> for sample in ds.shuffled():
      +...     process(sample)  # sample is ST
      +>>> for batch in ds.shuffled(batch_size=32):
      +...     process(batch)  # batch is SampleBatch[ST]
      -
      -

      to_parquet

      -
      Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)
      -

      Export dataset contents to parquet format.

      -

      Converts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.

      -
      -

      Parameters

      +
      +

      to_dict

      +
      Dataset.to_dict(limit=None)
      +

      Materialize the dataset as a column-oriented dictionary.

      +
      +

      Parameters

      @@ -649,56 +1056,57 @@

      -

      - - - - - - - - + + + + +
      pathPathlikeOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.required
      sample_mapOptional[SampleExportMap]Optional function to convert samples to dictionaries. Defaults to dataclasses.asdict.limitint | NoneMaximum number of samples to include. None means all. None
      +
      +
      +

      Returns

      + + + + + + + + + - - - - + + + - - - + +
      NameTypeDescription
      maxcountOptional[int]If specified, split output into multiple files with at most this many samples each. Recommended for large datasets.Nonedict[str, list[Any]]Dictionary mapping field names to lists of values (one entry
      **kwargs Additional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.{}dict[str, list[Any]]per sample).

      Warning

      -

      Memory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.

      -

      For datasets larger than available RAM, always specify maxcount::

      -
      # Safe for large datasets - processes in chunks
      -ds.to_parquet("output.parquet", maxcount=10000)
      -

      This creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.

      +

      With limit=None this loads the entire dataset into memory.

      -
      -

      Examples

      -
      >>> ds = Dataset[MySample]("data.tar")
      ->>> # Small dataset - load all at once
      ->>> ds.to_parquet("output.parquet")
      ->>>
      ->>> # Large dataset - process in chunks
      ->>> ds.to_parquet("output.parquet", maxcount=50000)
      +
      +

      Examples

      +
      >>> d = ds.to_dict(limit=10)
      +>>> d.keys()
      +dict_keys(['name', 'embedding'])
      +>>> len(d['name'])
      +10
      -
      -

      wrap

      -
      Dataset.wrap(sample)
      -

      Wrap a raw msgpack sample into the appropriate dataset-specific type.

      -
      -

      Parameters

      +
      +

      to_pandas

      +
      Dataset.to_pandas(limit=None)
      +

      Materialize the dataset (or first limit samples) as a DataFrame.

      +
      +

      Parameters

      @@ -710,16 +1118,16 @@

      -

      - - - + + + +
      sampleWDSRawSampleA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.requiredlimitint | NoneMaximum number of samples to include. None means all samples (may use significant memory for large datasets).None
      -
      -

      Returns

      +
      +

      Returns

      @@ -731,24 +1139,34 @@

      - - + + - - + +
      STA deserialized sample of type ST, optionally transformed throughpd.DataFrameA pandas DataFrame with one row per sample and columns matching
      STa lens if as_type() was called.pd.DataFramethe sample fields.
      +
      +

      Warning

      +

      With limit=None this loads the entire dataset into memory.

      -
      -

      wrap_batch

      -
      Dataset.wrap_batch(batch)
      -

      Wrap a batch of raw msgpack samples into a typed SampleBatch.

      -
      -

      Parameters

      +
      +

      Examples

      +
      >>> df = ds.to_pandas(limit=100)
      +>>> df.columns.tolist()
      +['name', 'embedding']
      +
      +
      +
      +

      to_parquet

      +
      Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)
      +

      Export dataset to parquet file(s).

      +
      +

      Parameters

      @@ -760,47 +1178,51 @@

      -

      - - + + + - -
      batchWDSRawBatchA dictionary containing a 'msgpack' key with a list of serialized sample bytes.pathPathlikeOutput path. With maxcount, files are named {stem}-{segment:06d}.parquet. required
      -
      -
      -

      Returns

      - - - - - - + + + + + - - - - - + + + + + - - + +
      NameTypeDescription
      sample_mapOptional[SampleExportMap]Convert sample to dict. Defaults to dataclasses.asdict.None
      SampleBatch[ST]A SampleBatch[ST] containing deserialized samples, optionallymaxcountOptional[int]Split into files of at most this many samples. Without it, the entire dataset is loaded into memory.None
      **kwargs SampleBatch[ST]transformed through a lens if as_type() was called.Passed to pandas.DataFrame.to_parquet().{}
      -
      -

      Note

      -

      This implementation deserializes samples one at a time, then aggregates them into a batch.

      +
      +

      Examples

      +
      >>> ds.to_parquet("output.parquet", maxcount=50000)
      +
      +
      +
      +

      wrap

      +
      Dataset.wrap(sample)
      +

      Deserialize a raw WDS sample dict into type ST.

      +
      +
      +

      wrap_batch

      +
      Dataset.wrap_batch(batch)
      +

      Deserialize a raw WDS batch dict into SampleBatch[ST].

      -