Skip to content

Latest commit

 

History

History
300 lines (288 loc) · 16.7 KB

File metadata and controls

300 lines (288 loc) · 16.7 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[Unreleased]

[0.2.2b1] - 2026-01-28

Added

  • Blob storage for atmosphere datasets: Full support for storing dataset shards as ATProto blobs via PDS
    • DatasetPublisher.publish_with_blobs() for uploading shards as blobs
    • DatasetLoader.get_blobs() and get_blob_urls() for retrieval
    • AtmosphereClient.upload_blob() and get_blob() wrappers
  • HuggingFace-style API: load_dataset() function with path resolution, split handling, and streaming support
    • WebDataset brace notation, glob patterns, local directories, remote URLs
    • DatasetDict class for multi-split datasets
    • @handle/dataset path resolution via atmosphere index
  • Protocol-based architecture: Abstract protocols for backend interoperability
    • IndexEntry, AbstractIndex, AbstractDataStore protocols
    • Enables polymorphic code across local and atmosphere backends
  • Local to atmosphere promotion: promote_to_atmosphere() workflow with schema deduplication
  • Quarto documentation site: Tutorials, reference docs, and API reference at docs/
  • Comprehensive integration test suite: 593 tests covering E2E flows, error handling, edge cases

Changed

  • Investigate upload-artifact not finding benchmark output (#512)
  • Fix duplicate CI runs for push+PR overlap (#511)
  • Scope contents:write permission to benchmark job only (#510)
  • Add benchmark docs auto-commit to CI workflow (#509)
  • Submit PR for v0.3.0b1 release to upstream/main (#508)
  • Implement GH#39: Production hardening (observability, error handling, testing infra) (#504)
  • Add pluggable structured logging via atdata.configure_logging (#507)
  • Add PartialFailureError and shard-level error handling to Dataset.map (#506)
  • Add atdata.testing module with mock clients, fixtures, and helpers (#505)
  • Fix CI linting failures (20 ruff errors) (#503)
  • Adversarial review: Post-benchmark suite assessment (#494)
  • Remove redundant protocol docstrings that restate signatures (#500)
  • Add missing unit tests for _type_utils.py (#499)
  • Strengthen weak assertions (assert X is not None → value checks) (#498)
  • Trim verbose exception constructor docstrings (#501)
  • Analyze benchmark results for performance improvement opportunities (#502)
  • Consolidate remaining duplicate sample types in test files (#497)
  • Remove dead code: _repo_legacy.py legacy UUID field, unused imports (#496)
  • Trim verbose docstrings in dataset.py and _index.py (#495)
  • Benchmark report: replace mean/stddev with median/IQR, add per-sample columns (#492)
  • Add parameter descriptions to benchmark suite with automatic report introspection (#491)
  • HTML benchmark reports with CI integration (#487)
  • Add bench + render step to CI on highest Python version only (#490)
  • Update justfile bench commands to export JSON and render (#489)
  • Create render_report.py script to convert JSON to HTML (#488)
  • Increase test coverage for low-coverage modules (#480)
  • Add providers/_postgres.py tests (mock-based) (#485)
  • Add _stub_manager.py tests (#484)
  • Add manifest/_query.py tests (#483)
  • Add repository.py tests (#482)
  • Add CLI tests (cli/init, diagnose, local, preview, schema) (#481)
  • Check test coverage for CLI utils (#479)
  • Add performance benchmark suite for atdata (#471)
  • Verify benchmarks run (#478)
  • Update pyproject.toml and justfile (#477)
  • Create bench_atmosphere.py (#476)
  • Create bench_query.py (#475)
  • Create bench_dataset_io.py (#474)
  • Create bench_index_providers.py (#473)
  • Create benchmarks/conftest.py with shared fixtures (#472)
  • Add per-shard manifest and query system (GH #35) (#462)
  • Write unit and integration tests (#470)
  • Integrate manifest into write path and Dataset.query() (#469)
  • Implement QueryExecutor and SampleLocation (#468)
  • Implement ManifestWriter (JSON + parquet) (#467)
  • Implement ManifestBuilder (#465)
  • Implement ShardManifest data model (#466)
  • Implement aggregate collectors (categorical, numeric, set) (#464)
  • Implement ManifestField annotation and resolve_manifest_fields() (#463)
  • Migrate type annotations from PackableSample to Packable protocol (#461)
  • Remove LocalIndex factory — consolidate to Index (#460)
  • Split local.py monolith into local/ package (#452)
  • Verify tests and lint pass (#459)
  • Create init.py re-export facade and delete local.py (#458)
  • Create _repo_legacy.py with deprecated Repo class (#457)
  • Create _index.py with Index class and LocalIndex factory (#456)
  • Create _s3.py with S3DataStore and S3 helpers (#455)
  • Create _schema.py with schema models and helpers (#454)
  • Create _entry.py with LocalDatasetEntry and constants (#453)
  • Migrate CLI from argparse to typer (#449)
  • Investigate test failures (#450)
  • Fix ensure_stub receiving LocalSchemaRecord instead of dict (#451)
  • GH#38: Developer experience improvements (#437)
  • CLI: atdata preview command (#440)
  • CLI: atdata schema show/diff commands (#439)
  • CLI: atdata inspect command (#438)
  • Dataset.len and Dataset.select() for sample count and indexed access (#447)
  • Dataset.to_pandas() and Dataset.to_dict() export methods (#446)
  • Dataset.filter() and Dataset.map() streaming transforms (#445)
  • Dataset.get(key) for keyed sample access (#442)
  • Dataset.describe() summary statistics (#444)
  • Dataset.schema property and column_names (#443)
  • Dataset.head(n) and Dataset.iter convenience methods (#441)
  • Custom exception hierarchy with actionable error messages (#448)
  • Adversarial review: Post-Repository consolidation assessment (#430)
  • Remove backwards-compat dict-access methods from SchemaField and LocalSchemaRecord (#436)
  • Add missing test coverage for Repository prefix routing edge cases and error paths (#435)
  • Trim over-verbose docstrings in local.py module/class level (#434)
  • Fix formally incorrect test assertions (batch_size, CID, brace notation) (#433)
  • Consolidate duplicate test sample types across test files into conftest.py (#432)
  • Consolidate duplicate entry-creation logic in Index (add_entry vs _insert_dataset_to_provider) (#431)
  • Switch default Index provider from Redis to SQLite (#429)
  • Consolidated Index with Repository system (#424)
  • Phase 4: Deprecate AtmosphereIndex, update exports (#428)
  • Phase 3: Default Index singleton and load_dataset integration (#427)
  • Phase 2: Extend Index with repos/atmosphere params and prefix routing (#426)
  • Phase 1: Create Repository dataclass and _AtmosphereBackend in repository.py (#425)
  • Adversarial review: Post-IndexProvider pluggable storage assessment (#417)
  • Convert TODO comments to tracked issues or remove (#422)
  • Remove deprecated shard_list property references from docstrings (#421)
  • Replace bare except in _stub_manager.py and cli/local.py with specific exceptions (#423)
  • Tighten generic pytest.raises(Exception) to specific exception types in tests (#420)
  • Replace assert statements with ValueError in production code (#419)
  • Consolidate duplicated _parse_semver into _type_utils.py (#418)
  • feat: Add SQLite/PostgreSQL index providers (GH #42) (#409)
  • Update documentation and public API exports (#416)
  • Add tests for all providers (#415)
  • Refactor Index class to accept provider parameter (#414)
  • Implement PostgresIndexProvider (#413)
  • Implement SqliteIndexProvider (#412)
  • Implement RedisIndexProvider (extract from Index class) (#411)
  • Define IndexProvider protocol in _protocols.py (#410)
  • Add just lint command to justfile (#408)
  • Add SQLite/PostgreSQL providers for LocalIndex (in addition to Redis) (#407)
  • Fix type hints for @atdata.packable decorator to show PackableSample methods (#406)
  • Review GitHub workflows and recommend CI improvements (#405)
  • Fix type signatures for Dataset.ordered and Dataset.shuffled (GH#28) (#404)
  • Investigate quartodoc Example section rendering - missing CSS classes on pre/code tags (#401)
  • Update all docstrings from Example: to Examples: format (#403)
  • Create GitHub issues for v0.3 roadmap feature domains (#402)
  • Expand Quarto documentation with architectural narrative (#395)
  • Expand atmosphere tutorial with federation context (#400)
  • Expand local-workflow tutorial with system narrative (#399)
  • Expand quickstart tutorial with design context (#398)
  • Expand index.qmd with architecture narrative (#397)
  • Add architecture overview page (reference/architecture.qmd) (#396)
  • Adversarial review: Post-PDSBlobStore comprehensive assessment (#389)
  • Remove deprecated shard_list property warnings if unused (#394)
  • Add test for Dataset iteration over empty tar file (#393)
  • Consolidate duplicate sample types in live atmosphere tests (#392)
  • Convert TODO comment in dataset.py to design note or remove (#391)
  • Remove redundant no-op statements in _stub_manager.py (#390)
  • Update atmosphere example with blob storage case (#216)
  • Implement PDSBlobStore for atmosphere data storage (#244)
  • Update docs and examples to include PDSBlobStore (#384)
  • Add API docs for PDSBlobStore and BlobSource (#388)
  • Update atmosphere_demo.py example (#387)
  • Update atmosphere reference docs (#386)
  • Update atmosphere tutorial with PDSBlobStore (#385)
  • Implement PDSBlobStore for ATProto blob storage (#380)
  • Add tests for PDSBlobStore and BlobSource (#383)
  • Add BlobSource for reading PDS blobs as DataSource (#382)
  • Create PDSBlobStore class in atmosphere module (#381)
  • Investigate Redis index entry expiration/reset issue (#376)
  • Audit codebase for xs/@property vs list_xs() convention (#377)
  • Evaluate PackableSample → Packable protocol migration (#375)
  • Fix load_dataset overload type hints for AbstractIndex (#379)
  • Fix load_dataset to use source-appropriate credentials (#378)
  • Review and plan human-review.md feedback items (#374)
  • Create v0.3 roadmap synthesis document (#373)
  • Document justfile in CLAUDE.md (#372)
  • Make docs script work from any directory (#371)
  • Add uv script shortcut 'docs' for documentation build (#370)
  • Update docstrings in local.py (#367)
  • Update docstrings in _protocols.py (#366)
  • Update docstrings in lens.py (#365)
  • Update docstrings in dataset.py (#364)
  • Review and address human-review.md feedback (#344)
  • Fix load_dataset overloads and AbstractIndex compatibility (#348)
  • Connect load_dataset to index data_store for S3 credentials (#361)
  • Fix load_dataset overload return types for DictSample (#360)
  • Add data_store to AbstractIndex protocol (#359)
  • Audit and fix xs/list_xs naming convention (#347)
  • Fix AtmosphereIndex: list_datasets/list_schemas return types (#357)
  • Refactor DataSource/Dataset: shards()/shard_list -> shards/list_shards() (#356)
  • Refactor local.py: entries/all_entries -> entries/list_entries (#355)
  • Update AbstractIndex protocol to match new naming convention (#358)
  • Investigate Redis index entry removal issue (#346)
  • Implement 'atdata diagnose' command for Redis health check (#354)
  • Implement 'atdata local up' command to run Redis + MinIO (#353)
  • Create atdata.cli module with entry point (#352)
  • Evaluate PackableSample → Packable protocol migration (#345)
  • Update publish_schema and other signatures to use Packable protocol (#351)
  • Update @packable decorator return type annotation (#350)
  • Define Packable protocol in _protocols.py (#349)
  • Review and update README for v0.2.2 release (#343)
  • Streamline Dataset API with DictSample default type (#338)
  • Add tests for DictSample and new API (#342)
  • Update load_dataset default type to DictSample (#341)
  • Update @packable to auto-register DictSample lens (#340)
  • Implement DictSample class with getattr and getitem (#339)
  • Fix failing tests in test_integration_error_handling.py (#337)
  • v0.2.2 beta release improvements (#326)
  • Document to_parquet() memory usage (#336)
  • Evaluate splitting local.py into modules (#335)
  • Add error path tests (timeouts, partial failures) (#334)
  • Add deployment guide to docs (#333)
  • Add troubleshooting/FAQ section to docs (#332)
  • Document orig_class assumption in Dataset docstring (#331)
  • Centralize tar creation helper in test fixtures (#330)
  • Consolidate duplicate test sample types to conftest.py (#329)
  • Document expected filterwarnings in test suite (#328)
  • Complete truncated atmosphere.qmd documentation (#327)
  • Comprehensive v0.2.2 beta release review (#321)
  • Compile findings into .review/comprehensive-review.md (#325)
  • Review documentation website and examples (#324)
  • Review test suite coverage and quality (#323)
  • Review core codebase architecture and code quality (#322)
  • Human Review: Local Workflow API Improvements (#274)
  • Update documentation and examples for current codebase (#316)
  • Update README.md with current API (#320)
  • Update examples/*.py files for current API (#319)
  • Update reference/protocols.qmd with DataSource protocol (#318)
  • Update reference/datasets.qmd for DataSource API (#317)
  • Adversarial review: Post-DataSource refactor assessment (#307)
  • Clean up unused TypeAlias definitions in dataset.py (#315)
  • Remove verbose docstrings that restate function signatures (#314)
  • Consolidate schema reference parsing logic in local.py (#313)
  • Add error tests for corrupted msgpack data in Dataset.wrap() (#312)
  • Remove or implement skipped test_repo_insert_round_trip (#311)
  • Fix bare exception handlers in _stub_manager.py and _cid.py (#310)
  • Replace assertion with ValueError in lens.py input validation (#309)
  • Replace assertions with ValueError in dataset.py msgpack validation (#308)
  • Refactor Dataset to use DataSource abstraction (#299)
  • Research WebDataset streaming alternatives beyond HTTP/S URLs (#298)
  • Write tests for DataSource implementations (#306)
  • Update load_dataset to use DataSource (#305)
  • Update S3DataStore to create S3Source instances (#304)
  • Refactor Dataset to accept DataSource | str (#303)
  • Implement S3Source with boto3 streaming (#302)
  • Implement URLSource in new _sources.py module (#301)
  • Add DataSource protocol to _protocols.py (#300)
  • Fix S3 mock fixture regionname typo in tests (#297)
  • Human review feedback: API improvements from human-review-01 (#290)
  • AbstractIndex: Protocol vs subclass causing linting errors (#296)
  • load_dataset linting: no matching overloads error (#295)
  • @atdata.lens linting: LocalTextSample not recognized as PackableSample subclass (#291)
  • LocalDatasetEntry: underscore-prefixed attributes should be public (#294)
  • Default batch_size should be None for Dataset.ordered/shuffled (#292)
  • Improve SchemaNamespace typing for IDE support (#289)
  • Schema namespace API: index.load_schema() + index.schemas.MyType (#288)
  • Auto-typed get_schema/decode_schema return type (#287)
  • Improve decode_schema typing for IDE support (#286)
  • Fix stub filename collisions with authority-based namespacing (#285)
  • Auto-generate stubs on schema access (#281)
  • Add tests for auto-stub functionality (#284)
  • Integrate auto-stub into Index class (#283)
  • Add StubManager class for stub file management (#282)
  • Improve decoded_type dynamic typing/signatures (#279)
  • Document atdata URI specification (#280)
  • Create proper SampleSchema Python type (#278)
  • Fix @atdata.packable decorator class identity (#275)
  • Fix @atdata.packable decorator class identity (#275)
  • Fix @atdata.packable decorator class identity (#275)
  • Improve index.publish_schema API (#276)
  • Improve list_schemas API semantics (#277)
  • Fix @atdata.packable decorator class identity (#275)
  • Architecture refactor: LocalIndex + S3DataStore composable pattern
    • LocalIndex now accepts optional data_store parameter
    • S3DataStore implements AbstractDataStore for S3 operations
  • Deprecated Repo class: Use LocalIndex(data_store=S3DataStore(...)) instead
    • Repo remains as thin backwards-compatibility wrapper with deprecation warning
  • Renamed BasicIndexEntry to LocalDatasetEntry with CID-based identity
  • Added ATProto-compatible CID generation via libipld
  • Performance improvements: cached sample_type property, precompiled regex patterns

Fixed

  • Dark theme styling for callouts and code blocks in Quarto docs
  • Browser chrome color updates on dark/light mode toggle

[0.2.0] - 2026-01-06

Added

  • Initial atmosphere module with ATProto integration
  • Schema, dataset, and lens publishing to ATProto PDS
  • AtmosphereClient for ATProto authentication and record management
  • AtmosphereIndex for querying published datasets and schemas
  • Dynamic sample type reconstruction from published schemas

Changed

  • Improved type hint coverage throughout codebase
  • Enhanced error messages for common failure modes

[0.1.0] - 2025-12-15

Added

  • Core PackableSample and @packable decorator for typed samples
  • Dataset[ST] generic typed dataset with WebDataset backend
  • SampleBatch[DT] with automatic attribute aggregation
  • Lens[S, V] bidirectional transformations
  • Local storage with Redis index and S3 data store
  • WebDataset tar file reading and writing
  • NumPy array serialization via msgpack