Skip to content

release: atdata v0.3.0b1#44

Merged
maxine-at-forecast merged 42 commits intofoundation-ac:mainfrom
forecast-bio:release/v0.3.0b1
Jan 31, 2026
Merged

release: atdata v0.3.0b1#44
maxine-at-forecast merged 42 commits intofoundation-ac:mainfrom
forecast-bio:release/v0.3.0b1

Conversation

@maxine-at-forecast
Copy link
Copy Markdown
Collaborator

Summary

Release v0.3.0b1 — a major feature release with infrastructure improvements across the board:

  • Per-shard manifest and query system (GH#35): ManifestField annotations, ManifestBuilder, ShardManifest, ManifestWriter (JSON + parquet), QueryExecutor, SampleLocation, and Dataset.query() for two-phase shard-level filtering
  • CLI migration to typer (GH#38): Replace argparse with typer for declarative commands; add atdata inspect, schema show/diff, and preview commands
  • Production hardening (GH#39): atdata.configure_logging for structured logging, PartialFailureError + Dataset.process_shards() for shard-level error handling with retry, atdata.testing module with mock clients and fixtures
  • Local index refactoring: Split 1955-line local.py monolith into local/ package (_entry, _schema, _s3, _index, _repo_legacy); remove LocalIndex factory in favor of Index(provider="sqlite"); consolidate string-based provider selection into Index.__init__
  • Pluggable storage providers: SQLite (default, zero-dependency), Redis, PostgreSQL backends for Index; Repository system with prefix routing
  • Type system improvements: Migrate type bounds from PackableSample to Packable protocol
  • Performance: Compact struct-based array serialization (replaces np.save/np.load), fix numpy_dtype_to_string longest-match ordering
  • Benchmark suite: pytest-benchmark integration with per-category markers, HTML reports via render_report.py, CI benchmark job
  • Test coverage: New tests for CLI, postgres provider, query, repository, stub manager, type utils; strengthen weak assertions

Breaking changes

  • LocalIndex() factory removed — use Index(provider="sqlite") or Index(redis=conn) directly
  • local.py is now local/ package (import paths unchanged via __init__.py facade)

Test plan

  • uv run pytest — 823 passed, 33 skipped, 0 failed
  • just lint — no ruff errors
  • CI pipeline (lint + test matrix)
  • Benchmark suite (just bench)

🤖 Generated with Claude Code

maxinelevesque and others added 30 commits January 28, 2026 13:43
…hods to type checkers

The decorator now returns type[PackableSample] instead of type[_T].
Combined with @dataclass_transform(), this allows IDEs to recognize both:
- Original class fields (via dataclass_transform)
- PackableSample methods: packed, as_wds, from_bytes, from_data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test and lint recipes to justfile, update CLAUDE.md to document
all available just commands, and regenerate docs with updated quarto
theme/styling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…reSQL, Redis)

Refactor Index class to delegate persistence to an IndexProvider protocol.
Extract existing Redis logic into RedisProvider and add SqliteProvider and
PostgresProvider backends. LocalIndex is now a factory function that selects
the backend by name. Adds optional `psycopg[binary]` dependency, provider
tests, and updated changelog/docs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate duplicated _parse_semver into _type_utils.py, replace bare
except clauses and assert statements with specific exceptions, tighten
generic pytest.raises(Exception) to exact types, and convert TODO
comments to explanatory notes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… singleton

- Introduce Repository dataclass and _AtmosphereBackend in new
  repository.py module
- Extend Index with repos param for named repositories and atmosphere
  param for ATProto backend with lazy anonymous client default
- Add _resolve_prefix routing for @handle/dataset, atdata:// URIs,
  and repo/name prefixed references
- Add get_default_index/set_default_index singleton so load_dataset
  no longer requires an explicit index argument
- Deprecate AtmosphereIndex in favour of Index(atmosphere=client)
- Update exports, tests, and CHANGELOG

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…endency setup

Index() now uses SqliteProvider when no provider, redis connection, or
Redis kwargs are given. Explicit redis= or **kwargs still select Redis
for backwards compatibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove backwards-compat dict-access methods from SchemaField and
  LocalSchemaRecord (unused since dataclass migration)
- Consolidate add_entry to delegate to _insert_dataset_to_provider,
  eliminating duplicate entry-creation logic
- Trim over-verbose module and class docstrings in local.py
- Narrow pytest.raises to exact IndexError for batch_size=0 test
- Add test coverage for prefix routing edge cases and error paths
  (atmosphere disabled, unknown repo, @handle without slash)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…chy (GH#38)

- Add _exceptions.py with AtdataError, LensNotFoundError, SchemaError,
  SampleKeyError, ShardError
- Add Dataset convenience API: __iter__, __len__, head, get, describe,
  schema, column_names, filter, map, select, to_pandas, to_dict
- Wire filter/map into ordered() and shuffled() via _post_wrap_stages
- LensNetwork.get_lens now raises LensNotFoundError with available targets
- Export new exceptions from atdata.__init__

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add atdata inspect: dataset summary with sample count, schema, shards
- Add atdata schema show/diff: display and compare dataset schemas
- Add atdata preview: print first N samples from a dataset
- Make LensNotFoundError inherit ValueError for backwards compatibility
- Update lens error message and corresponding test assertions
- Add test_dev_experience.py for new Dataset convenience methods

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…n (GH#38)

Replace argparse-based CLI with typer for declarative command definitions,
automatic help generation, and better subcommand support. Fix bug where
get_schema() passed a LocalSchemaRecord to ensure_stub() instead of a
plain dict, causing silent stub generation failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…used modules

Decompose the 1955-line local.py into a local/ package with dedicated
modules for entry types (_entry.py), schema models (_schema.py), S3
storage (_s3.py), the Index class (_index.py), and the deprecated Repo
class (_repo_legacy.py). The __init__.py facade re-exports all public
names to preserve backward compatibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ction into Index

Absorb LocalIndex's string-based provider selection (sqlite/redis/postgres)
directly into Index.__init__ via new `provider: str`, `path`, and `dsn`
parameters. Remove the LocalIndex factory function and update all
references across source, docstrings, and tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace TypeVar bounds and type annotations that reference the concrete
PackableSample class with the Packable protocol across dataset, schema
codec, atmosphere, local index, HF API, and tests. This decouples
generic type parameters from the implementation class.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduce ManifestField annotation, ManifestBuilder, ShardManifest data
model, ManifestWriter (JSON + parquet), QueryExecutor, and SampleLocation.
Add Dataset.query() for two-phase manifest-based sample lookup. Integrate
manifest generation into S3DataStore.write_shards() with optional
manifest=True flag. Export all public types from atdata.__init__.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add benchmarks/ package covering dataset I/O, index providers, query
execution, and atmosphere operations. Include shared fixtures in
conftest.py, pytest-benchmark dev dependency, and justfile commands
(bench, bench-save, bench-compare) for running and comparing results.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…y, and stub manager

Add dedicated test modules for previously low-coverage areas:
test_cli.py, test_postgres_provider.py, test_query_coverage.py,
test_repository_coverage.py, and test_stub_manager.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…I job

Introduce per-category pytest markers (bench_serial, bench_index,
bench_io, bench_query, bench_s3) with separate JSON exports. Add
render_report.py for HTML report generation with median/IQR stats.
Update justfile with per-category bench commands and report step.
Add benchmark CI job to uv-test.yml. Use realistic data shapes
(ImageNet uint8, timeseries float32) in conftest constants.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add TestQueryIterationBenchmarks with equality, range, and large result
set iteration benchmarks measuring end-to-end query-and-iterate cost.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mples

Drop unused imports (PackableSample, Optional, dataclasses, tqdm) from
_hf_api, cli, and dataset modules. Shorten verbose docstrings in
DictSample and Dataset to one-liners. Consolidate duplicate sample
types across test files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… dtype matching, strengthen assertions

Replace np.save/np.load with compact struct-based binary format in
_helpers.py (backward-compatible with legacy .npy deserialization).
Strip redundant Protocol method docstrings in _protocols.py. Fix
numpy_dtype_to_string to prefer exact match and sort substring keys
longest-first to avoid "int8"/"uint8" ambiguity. Strengthen weak
`assert X is not None` test assertions to verify actual values.
Add test_type_utils.py for _type_utils coverage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
github-actions bot and others added 6 commits January 31, 2026 21:40
Remove release/* from push triggers and branch filter from pull_request
to avoid double runs when PRs target main.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants