Skip to content

Latest commit

 

History

History
357 lines (291 loc) · 26.4 KB

File metadata and controls

357 lines (291 loc) · 26.4 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[Unreleased]

[0.7.0b1] - 2026-02-26

Added

  • Client-side AppView integration: Atmosphere client now supports XRPC queries (xrpc_query()) and procedures (xrpc_procedure()) routed through a configurable AppView service. Schema, lens, label, and record loaders automatically use AppView for listing, search, and resolution when available, falling back to client-side pagination otherwise. New has_appview property and AppViewRequiredError/AppViewUnavailableError exceptions for clean error handling (GH#74)

Fixed

  • Blob URL parameter bug: Fixed incorrect parameter passing in blob URL construction within atmosphere record publishing
  • Fallback logging: Improved diagnostic logging when AppView is unavailable and client-side fallback is used

[0.6.0b1] - 2026-02-22

Added

  • Dataset manifest support: Optional manifests property on dataset entry records for per-shard metadata references, with ShardManifestRef and LensCodeRef Python mirror types (GH#62)
  • Handle-based schema resolution: get_schema() and get_schema_type() now accept @handle/TypeName@version format, resolving schemas by handle + name + optional semver instead of requiring raw AT-URIs (GH#61)

Changed

  • ADR: v0.7.0b1 release review (#876)
  • Migrate test_local.py from deprecated add_entry() to insert_dataset() (#877)
  • AppView-aware loaders/publishers: SchemaLoader, LabelLoader, DatasetLoader, LensLoader and their corresponding publishers now prefer AppView XRPC endpoints when configured, with automatic graceful fallback to client-side com.atproto.repo workarounds (GH#50)
  • load_dataset() atmosphere parameter: New optional atmosphere kwarg passes an Atmosphere client (and its AppView) through to AT URI resolution (GH#50)
  • AbstractIndex deprecation: AbstractIndex protocol is deprecated in favor of direct Index usage; a backward-compatible __getattr__ shim emits DeprecationWarning on import (GH#40)
  • Unified search API: New SearchBackend protocol with LocalSearchBackend and AppViewSearchBackend implementations, SearchAggregator for multi-backend queries, and Index.search() integration (GH#33)
  • Lens verification workflow: New VerificationPublisher/VerificationLoader for science.alt.dataset.lensVerification records, with LexCodeHash and LexLensVerification Python types (GH#34)
  • Lens schema version compatibility: LensPublisher.publish() now accepts source_schema_version and target_schema_version parameters; LexLensRecord updated with corresponding fields (GH#34)
  • Array format support: New serialization helpers for sparse matrices (scipy), structured arrays, Arrow tensors (pyarrow), safetensors, and DataFrames (pandas/Parquet). Codegen and pipeline updated to recognize new shim $ref types. Optional dependency groups added to pyproject.toml (GH#76)
  • NDArray v1.1.0 annotations: Schema codegen supports optional dtype, shape, and dimensionNames annotation fields from the v1.1.0 ndarray shim (GH#76)
  • Upstream lexicon sync: Added lensVerification.json, verificationMethod.json, programmingLanguage.json, and updated arrayFormat.json/lens.json from forecast-bio/atdata-lexicon
  • Namespace rename: Lexicon namespace renamed from ac.foundation.dataset to science.alt.dataset across all source, tests, and documentation. Lexicon JSON files vendored from forecast-bio/atdata-lexicon with NSID-to-path directory structure. Lexicon loader updated to resolve NSIDs via path traversal. Added label and resolveLabel to LEXICON_IDS (GH#71)
  • Lexicon record → entry rename: The dataset record lexicon is renamed from ac.foundation.dataset.record to ac.foundation.dataset.entry throughout the codebase — lexicon files, Python types, collection constants, tests, and documentation (GH#63)
  • Schema version field rename: $atdataSchemaVersion renamed to atdataSchemaVersion (no $ prefix) to follow ATProto naming conventions for non-reserved properties (GH#65)
  • DID resolution refactor: Extracted Atmosphere.resolve_did() as a public method, deduplicating handle-to-DID resolution across schema, label, and record loaders
  • CI: Redis and Postgres container images pulled from AWS ECR Public Gallery (public.ecr.aws/docker/library/) instead of Docker Hub to avoid rate limits; replaced supercharge/redis-github-action with native service containers

Fixed

  • Manifest serialization: Manifests field now uses truthiness check consistent with the tags field pattern, preventing empty lists from being serialized as None

[0.5.1b1] - 2026-02-16

Fixed

  • Cross-account record reads: get_record() and list_records() now route reads for foreign DIDs through the public AppView instead of the authenticated PDS, fixing RecordNotFound errors when fetching records published by other users (e.g. schemas from foundation.ac while logged in as maxine.science)

[0.5.0b1] - 2026-02-07

Added

  • Typed content metadata: Datasets can now carry a metadata schema describing dataset-level properties (description, license, creation date) alongside the sample schema (#58)
  • Atmosphere label integration: Index.insert_dataset() publishes a label record alongside the dataset record, enabling @handle/name path resolution through get_label() and get_dataset() (#59)
  • E2E lens integration tests: Comprehensive tests for the lens publish → retrieve → execute lifecycle against a real ATProto PDS (#55)
  • E2E session management tests: Integration tests covering ATProto login, session export/import round-trip, unauthenticated reads, and error handling (#57)
  • XRPC query workaround docs: Documented client-side list_records() + filter patterns used as temporary workarounds pending AppView support (#56)

Changed

  • Schema lexicon rename: getLatestSchema renamed to resolveSchema to match resolveLabel semantics; added $atdataSchemaVersion property for format versioning (#53)
  • Schema API consolidation: decode_schema, load_schema, decode_schema_as, and schema_to_type consolidated into get_schema_type() with deprecation shims for old names (#54)
  • Build config cleanup: .chainlink/ and .claude/ directories excluded from sdist builds (#54)
  • Security hardening: Sanitized schema field names in _schema_codec.py and _stub_manager.py; removed allow_pickle=True from numpy save/load in _helpers.py

Fixed

  • Lens discovery pagination: find_by_schemas() now paginates with limit=100 + cursor instead of exceeding ATProto's list_records cap of 100
  • Truncated session test: Rewrote test_truncated_session_string_raises to handle the atproto SDK silently accepting truncated sessions via its 4-field backward compat path (MarshalX/atproto#656)
  • Exception chaining: Proper raise ... from exception chaining and best-effort label publish error handling
  • Content metadata validation: Metadata validated before writing files; dead code removed
  • Chainlink db protection: Added git hooks (.githooks/) to prevent worktree symlinks from overwriting the issues database on merge

[0.4.1b2] - 2026-02-05

Fixed

  • Atmosphere schema codec: _convert_atmosphere_schema() now handles the flattened ATProto wire format where properties/required are at the top level of the schema dict, not nested under a schemaBody key. Previously, schemas fetched from a PDS produced types with zero fields.

[0.4.1b1] - 2026-02-05

Fixed

  • Atmosphere schema routing: get_schema() / get_schema_record() now handle at:// URI refs by delegating to the atmosphere backend instead of raising ValueError
  • Blob URL resolution: AtmosphereIndexEntry.data_urls resolves storageBlobs CIDs to PDS HTTP URLs via plc.directory (with caching), replacing the empty-list placeholder
  • Indexed path routing: load_dataset('@handle/dataset') now passes the full @handle/name path through to Index._resolve_prefix() so atmosphere routing works correctly
  • Atmosphere schema codec: schema_to_type() handles atmosphere JSON Schema format by converting to local field format before type reconstruction
  • Structural lens fallback: Dataset.as_type() falls back to structural field mapping when no registered lens exists between structurally compatible types (e.g. dynamic vs user-defined classes)
  • Blob shard checksums: SHA-256 digests from PDSBlobStore are now attached per-blob in BlobEntry.checksum instead of being buried in metadata.custom

Changed

  • Checksum extraction: Deduplicated checksum extraction logic into _extract_blob_checksums() helper with warning on count mismatch
  • Test nomenclature: Renamed test_integration_*.pytest_workflow_*.py across the test suite
  • CI workflow triggers: Updated workflow file references to match renamed test files

[0.4.0b2] - 2026-02-04

Added

  • Redis live integration tests: 16 tests covering entry CRUD, schema operations, label operations, lens operations, and concurrent access against a real Redis 7 service container
  • Redis service in CI: Integration workflow now includes a Redis 7 service container alongside PostgreSQL and MinIO

[0.4.0b1] - 2026-02-04

Added

  • Persistent lens storage: Index.store_lens() / Index.load_lens() serialize lens definitions (field-mapping or code-reference) to all provider backends (SQLite, Redis, PostgreSQL) with automatic reconstitution, version auto-increment, find_lenses_by_schemas(), and stub generation
  • Typed dataset metadata: DatasetMetadata dataclass replaces opaque bytes metadata in atmosphere records, with structured fields for description, license, tags, and custom key-value pairs

Changed

  • Metadata encoding: Atmosphere metadata now uses ATProto $bytes format to prevent PydanticSerializationError on binary payloads

Fixed

  • Metadata falsy values: DatasetMetadata.from_dict() now preserves falsy values (empty strings, zero, False) instead of dropping them
  • CI: Correct benchmark output path in bench workflow
  • Lint: Fix unused imports and undefined forward references in lens persistence code
  • Live integration tests: New tests/integration/ suite with ATProto, S3, and PostgreSQL live tests against remote sandboxes, plus dedicated CI workflow (.github/workflows/integration.yml)

[0.3.4b1] - 2026-02-04

Added

  • Content checksums: Per-shard SHA-256 digests computed at write time across all storage backends (LocalDiskStore, S3DataStore, PDSBlobStore). Checksums are carried via ShardWriteResult and automatically merged into index entry metadata
  • verify_checksums(): Utility function to verify stored checksums against shard files on disk; remote URLs (s3://, at://, http://) are gracefully skipped
  • atdata verify CLI command: Verify content integrity of indexed datasets from the command line
  • AT URI support in load_dataset(): load_dataset("at://did:plc:abc/.../rkey") now fetches dataset records from ATProto and resolves storage (blobs, HTTP, S3) into streamable datasets with automatic schema decoding
  • Lens composition operators: @ (compose) and | (pipe) operators for chaining lenses, plus identity_lens() factory for pass-through transforms

[0.3.3b2] - 2026-02-04

Testing

  • Coverage improvements (92% → 94%): 61 new tests across atmosphere client (swap_commit, model conversion fallbacks), DatasetLoader (HTTP/S3 storage paths, get_typed, to_dataset, checksum validation), DatasetPublisher (publish_with_s3), Redis/Postgres provider label CRUD, Redis schema edge cases (bytes decoding, legacy format), and lexicon loading/validation

Fixed

  • CI: Use cp -f in bench workflow to avoid interactive prompt on file overwrite

[0.3.3b1] - 2026-02-04

Added

  • Dataset labels: Named, versioned pointers to dataset records — separating identity (CID-addressed) from naming (mutable labels). store_label(), get_label(), list_labels(), delete_label() across all index providers (SQLite, Redis, PostgreSQL)
  • Atmosphere label records: LabelPublisher and LabelLoader for publishing and resolving ac.foundation.dataset.label records on ATProto PDS, with ac.foundation.dataset.resolveLabel query lexicon
  • Label-aware load_dataset(): Path resolution now tries label lookup before falling back to dataset name, enabling load_dataset("@local/mnist") to resolve through labels

Changed

  • Git flow: Adopted standard git flow branching model — develop as integration branch, feature/* from develop, release/* cut from develop. Updated /release, /feature, /publish, and /featree skills accordingly
  • Worktree chainlink sharing: /featree now symlinks .chainlink/issues.db to the base clone's copy so all worktrees share a single authoritative issue database

[0.3.2b3] - 2026-02-04

Fixed

  • Atmosphere.upload_blob() TypeError: The timeout heuristic passed timeout= to Client.upload_blob() which only accepts (data: bytes). Switched to the namespace method com.atproto.repo.upload_blob() which forwards kwargs through to httpx

Testing

  • ATProto SDK signature compatibility tests: New test_atproto_compat.py with 7 tests that instantiate a real atproto Client (with ClientRaw._invoke patched) to validate method signatures without network I/O. Covers upload_blob, create_record, list_records, get_record, delete_record, and export_session

[0.3.2b2] - 2026-02-03

Added

  • Lexicon-mirror type system: StorageHttp, StorageS3, StorageBlobs, BlobEntry, ShardChecksum dataclasses that mirror ATProto lexicon definitions, with storage_from_record() union deserializer
  • ShardUploadResult: Typed return from PDSBlobStore.write_shards() carrying both AT URIs and blob ref dicts
  • Lexicon reference docs: Auto-generated documentation page for the ac.foundation.dataset.* namespace
  • Example docs: dataset-profiler, lens-graph, and query-cookbook with plots and interactive tabsets
  • Typed proxy DSL for manifest queries (foundation-ac #43)

Changed

  • DatasetPublisher refactored: Extracted _create_record() helper, fixing a bug where publish() used dataset.url instead of dataset.list_shards() for multi-shard datasets
  • PDSBlobStore.write_shards() returns ShardUploadResult instead of using a _last_blob_refs side-channel
  • Blob storage uploads: PDS blob uploads now use storageBlobs with embedded blob ref objects instead of string AT URIs in storageExternal, preventing PDS garbage collection of uploaded blobs
  • Replaced lexicon symlinks with real files
  • Guarded redis imports behind TYPE_CHECKING in index/_entry.py and index/_index.py
  • Standardized benchmark outputs to .benchmarks/ directory

Fixed

  • publish() multi-shard bug: was passing single URL instead of full shard list
  • Double-write eliminated in PDSBlobStore
  • Lens-graph example: removed float rounding in calibrate lens that broke law assertions
  • Unused imports and E402 violations in atmosphere module
  • Unused variable and import in test files

Testing

  • Strengthened weak mock assertions with argument verification across 4 test files
  • Fixed misleading unicode tests: real emoji (🌍🎉🚀) and CJK characters (日本語テスト, 中文测试, 한국어시험) instead of ASCII placeholders
  • Exact shard count assertions instead of >= 2 bounds
  • Fixed self-referential assertion in test_publish_schema
  • Removed unnecessary isinstance builtin patch
  • Added content assertions for empty/corrupted shard recovery tests

[0.3.2b1] - 2026-02-03

Changed

  • Index.write()Index.write_samples(): Renamed with atmosphere-aware defaults — automatic PDS blob upload, 50 MB per-shard limit, 1 GB total dataset guard
    • New force flag bypasses PDS size limits for large datasets
    • New copy flag forces data transfer from private/remote sources to destination store
    • New data_store kwarg to override the default storage backend
  • Index.insert_dataset() overhaul: Smart source routing for atmosphere targets
    • Local files auto-upload via PDSBlobStore
    • Remote HTTP/HTTPS URLs referenced as external storage (zero-copy)
    • Credentialed S3Source errors by default to prevent leaking private endpoints; pass copy=True to copy data to the destination store
  • PDS constants: PDS_BLOB_LIMIT_BYTES (50 MB) and PDS_TOTAL_DATASET_LIMIT_BYTES (1 GB) in atmosphere/store.py; PDSBlobStore.write_shards() defaults to 50 MB shard size
  • CI overhaul: Sequential Lint → Pilot → Matrix flow; codecov uploads once per run instead of per-matrix-cell; benchmarks split to separate workflow
  • Lazy-import pandas and requests in dataset.py to reduce import time

Fixed

  • Atmosphere blob uploads: Index.write_samples() targeting atmosphere now uploads data as PDS blobs instead of publishing local temp file paths in the ATProto record

Deprecated

  • Index.add_entry() — use Index.insert_dataset() instead
  • Index.promote_entry() and Index.promote_dataset() — use Index.insert_dataset() with an atmosphere-backed Index instead
  • URLSource.shard_list and S3Source.shard_list properties — use list_shards() method instead

[0.3.1b1] - 2026-02-03

Added

  • Lexicon packaging: ATProto lexicon JSON files bundled in src/atdata/lexicons/ with importlib.resources access via atdata.lexicons.get_lexicon() and list_lexicons()
  • DatasetDict single-split proxy: When a DatasetDict has one split, .ordered(), .shuffled(), .list_shards(), and other Dataset methods are proxied directly
  • write_samples(manifest=True): Opt-in manifest generation during sample writing for query-based access
  • Example documentation: Five executable Quarto example docs covering typed pipelines, lens transforms, manifest queries, index workflows, and multi-split datasets
  • Bounds checking in bytes_to_array() for truncated/corrupted input buffers

Changed

  • Production hardening: observability and checkpoint/resume (GH#39 5.1/5.2) (#590)
  • Expand logging coverage across write/read/index/atmosphere paths (#593)
  • Add checkpoint/resume and on_shard_error to process_shards (#592)
  • Add log_operation context manager to _logging.py (#591)
  • Add reference documentation for atdata's atproto lexicons (#589)
  • Add version auto-suggest to /release and /publish skills (#588)
  • Create /publish skill for post-merge release tagging and PyPI publish (#587)
  • Fix wheel build: duplicate filename in ZIP archive rejected by PyPI (#586)
  • Update /release skill to run ruff format --check before committing (#585)
  • AtmosphereClientAtmosphere: Renamed with factory classmethods Atmosphere.login() and Atmosphere.from_env(); AtmosphereClient remains as a deprecated alias
  • sampleSchemaschema: Lexicon record type renamed from ac.foundation.dataset.sampleSchema to ac.foundation.dataset.schema (clean break, no backward compat)
  • Module reorganization: local/ split into index/ (Index, entries, schema management) and stores/ (LocalDiskStore, S3DataStore); local/ remains as backward-compat re-export shim
  • CLI rename: atdata local subcommand renamed to atdata infra
  • Uniform Repository model: Index now treats "local" as a regular Repository, collapsing 3-way routing (local/named/atmosphere) to 2-way (repo/atmosphere)
  • SampleBatch aggregation uses np.stack() instead of np.array(list(...)) for efficiency
  • Numpy scalar coercion in _make_packable — numpy scalars are now extracted to Python primitives before msgpack serialization
  • Removed dead legacy aliases in StubManager (_stub_filename, _stub_path, _stub_is_current, _write_stub_atomic)
  • Streamlined homepage and updated docs site to reflect new APIs

Fixed

  • Schema round-trip in Index.write() — schemas with NDArray fields now survive publish/decode correctly
  • Test isolation: protocol tests now use temporary SQLite databases instead of shared default

[0.3.0b2] - 2026-02-02

Added

  • LocalDiskStore: Local filesystem data store implementing AbstractDataStore protocol
    • Writes WebDataset shards to disk with write_shards()
    • Default root at ~/.atdata/data/, configurable via constructor
  • write_samples(): Module-level function to write samples directly to WebDataset tar files
    • Single tar or sharded output via maxcount/maxsize parameters
    • Returns typed Dataset[ST] wrapping the written files
  • Index.write(): Write samples and create an index entry in one step
    • Combines write_samples() + insert_dataset() into a single call
    • Auto-creates LocalDiskStore when no data store is configured
  • Index.promote_entry() and Index.promote_dataset(): Atmosphere promotion via Index
    • Promote locally-indexed datasets to ATProto without standalone functions
    • Schema deduplication and automatic publishing
  • Top-level exports: atdata.Index, atdata.LocalDiskStore, atdata.write_samples
  • write() method added to AbstractIndex protocol
  • 38 new tests: test_write_samples.py, test_disk_store.py, test_index_write.py

Changed

  • promote.py updated as backward-compat wrapper delegating to Index.promote_entry()
  • Trimmed _protocols.py docstrings by 30% (487 → 343 lines)
  • Trimmed verbose test docstrings across test suite (−173 lines)
  • Strengthened weak test assertions (isinstance checks, tautological tests)
  • Removed dead code: parse_cid() function and tests
  • Added @pytest.mark.filterwarnings to tests exercising deprecated APIs

[0.3.0b1] - 2026-01-31

Added

  • Structured logging: atdata.configure_logging() with pluggable logger protocol
  • Partial failure handling: PartialFailureError and shard-level error handling in Dataset.map()
  • Testing utilities: atdata.testing module with mock clients, fixtures, and helpers
  • Per-shard manifest and query system (GH#35)
    • ManifestBuilder, ManifestWriter, QueryExecutor, SampleLocation
    • ManifestField annotation and resolve_manifest_fields()
    • Aggregate collectors (categorical, numeric, set)
    • Integrated into write path and Dataset.query()
  • Performance benchmark suite: bench_dataset_io, bench_index_providers, bench_query, bench_atmosphere
    • HTML benchmark reports with CI integration
    • Median/IQR statistics with per-sample columns
  • SQLite/PostgreSQL index providers (GH#42)
    • SqliteIndexProvider, PostgresIndexProvider, RedisIndexProvider
    • IndexProvider protocol in _protocols.py
    • SQLite as default provider (replacing Redis)
  • Developer experience improvements (GH#38)
    • CLI: atdata inspect, atdata preview, atdata schema show/diff
    • Dataset.head(), Dataset.__iter__, Dataset.__len__, Dataset.select()
    • Dataset.filter(), Dataset.map(), Dataset.describe()
    • Dataset.get(key), Dataset.schema, Dataset.column_names
    • Dataset.to_pandas(), Dataset.to_dict()
    • Custom exception hierarchy with actionable error messages
  • Consolidated Index with Repository system
    • Repository dataclass and _AtmosphereBackend
    • Prefix routing for multi-backend index operations
    • Default Index singleton with load_dataset integration
    • AtmosphereIndex deprecated in favor of Index(atmosphere=client)
  • Comprehensive test coverage: 1155 tests

Changed

  • Split local.py monolith into local/ package (_index.py, _entry.py, _schema.py, _s3.py, _repo_legacy.py)
  • Migrated CLI from argparse to typer
  • Migrated type annotations from PackableSample to Packable protocol
  • Multiple adversarial review passes with code quality improvements
  • CI: fixed duplicate runs, scoped permissions, benchmark auto-commit

[0.2.2b1] - 2026-01-28

Added

  • Blob storage for atmosphere datasets: Full support for storing dataset shards as ATProto blobs via PDS
    • DatasetPublisher.publish_with_blobs() for uploading shards as blobs
    • DatasetLoader.get_blobs() and get_blob_urls() for retrieval
    • AtmosphereClient.upload_blob() and get_blob() wrappers
  • HuggingFace-style API: load_dataset() function with path resolution, split handling, and streaming support
    • WebDataset brace notation, glob patterns, local directories, remote URLs
    • DatasetDict class for multi-split datasets
    • @handle/dataset path resolution via atmosphere index
  • Protocol-based architecture: Abstract protocols for backend interoperability
    • IndexEntry, AbstractIndex, AbstractDataStore protocols
    • Enables polymorphic code across local and atmosphere backends
  • Local to atmosphere promotion: promote_to_atmosphere() workflow with schema deduplication
  • Quarto documentation site: Tutorials, reference docs, and API reference at docs/
  • Comprehensive integration test suite: 593 tests covering E2E flows, error handling, edge cases

Changed

  • Architecture refactor: LocalIndex + S3DataStore composable pattern
    • LocalIndex now accepts optional data_store parameter
    • S3DataStore implements AbstractDataStore for S3 operations
  • Deprecated Repo class: Use LocalIndex(data_store=S3DataStore(...)) instead
    • Repo remains as thin backwards-compatibility wrapper with deprecation warning
  • Renamed BasicIndexEntry to LocalDatasetEntry with CID-based identity
  • Added ATProto-compatible CID generation via libipld
  • Performance improvements: cached sample_type property, precompiled regex patterns

Fixed

  • Dark theme styling for callouts and code blocks in Quarto docs
  • Browser chrome color updates on dark/light mode toggle

[0.2.0] - 2026-01-06

Added

  • Initial atmosphere module with ATProto integration
  • Schema, dataset, and lens publishing to ATProto PDS
  • AtmosphereClient for ATProto authentication and record management
  • AtmosphereIndex for querying published datasets and schemas
  • Dynamic sample type reconstruction from published schemas

Changed

  • Improved type hint coverage throughout codebase
  • Enhanced error messages for common failure modes

[0.1.0] - 2025-12-15

Added

  • Core PackableSample and @packable decorator for typed samples
  • Dataset[ST] generic typed dataset with WebDataset backend
  • SampleBatch[DT] with automatic attribute aggregation
  • Lens[S, V] bidirectional transformations
  • Local storage with Redis index and S3 data store
  • WebDataset tar file reading and writing
  • NumPy array serialization via msgpack