All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Client-side AppView integration:
Atmosphereclient now supports XRPC queries (xrpc_query()) and procedures (xrpc_procedure()) routed through a configurable AppView service. Schema, lens, label, and record loaders automatically use AppView for listing, search, and resolution when available, falling back to client-side pagination otherwise. Newhas_appviewproperty andAppViewRequiredError/AppViewUnavailableErrorexceptions for clean error handling (GH#74)
- Blob URL parameter bug: Fixed incorrect parameter passing in blob URL construction within atmosphere record publishing
- Fallback logging: Improved diagnostic logging when AppView is unavailable and client-side fallback is used
- Dataset manifest support: Optional
manifestsproperty on dataset entry records for per-shard metadata references, withShardManifestRefandLensCodeRefPython mirror types (GH#62) - Handle-based schema resolution:
get_schema()andget_schema_type()now accept@handle/TypeName@versionformat, resolving schemas by handle + name + optional semver instead of requiring raw AT-URIs (GH#61)
- ADR: v0.7.0b1 release review (#876)
- Migrate test_local.py from deprecated add_entry() to insert_dataset() (#877)
- AppView-aware loaders/publishers:
SchemaLoader,LabelLoader,DatasetLoader,LensLoaderand their corresponding publishers now prefer AppView XRPC endpoints when configured, with automatic graceful fallback to client-sidecom.atproto.repoworkarounds (GH#50) load_dataset()atmosphere parameter: New optionalatmospherekwarg passes anAtmosphereclient (and its AppView) through to AT URI resolution (GH#50)- AbstractIndex deprecation:
AbstractIndexprotocol is deprecated in favor of directIndexusage; a backward-compatible__getattr__shim emitsDeprecationWarningon import (GH#40) - Unified search API: New
SearchBackendprotocol withLocalSearchBackendandAppViewSearchBackendimplementations,SearchAggregatorfor multi-backend queries, andIndex.search()integration (GH#33) - Lens verification workflow: New
VerificationPublisher/VerificationLoaderforscience.alt.dataset.lensVerificationrecords, withLexCodeHashandLexLensVerificationPython types (GH#34) - Lens schema version compatibility:
LensPublisher.publish()now acceptssource_schema_versionandtarget_schema_versionparameters;LexLensRecordupdated with corresponding fields (GH#34) - Array format support: New serialization helpers for sparse matrices (
scipy), structured arrays, Arrow tensors (pyarrow), safetensors, and DataFrames (pandas/Parquet). Codegen and pipeline updated to recognize new shim$reftypes. Optional dependency groups added topyproject.toml(GH#76) - NDArray v1.1.0 annotations: Schema codegen supports optional
dtype,shape, anddimensionNamesannotation fields from the v1.1.0 ndarray shim (GH#76) - Upstream lexicon sync: Added
lensVerification.json,verificationMethod.json,programmingLanguage.json, and updatedarrayFormat.json/lens.jsonfromforecast-bio/atdata-lexicon - Namespace rename: Lexicon namespace renamed from
ac.foundation.datasettoscience.alt.datasetacross all source, tests, and documentation. Lexicon JSON files vendored from forecast-bio/atdata-lexicon with NSID-to-path directory structure. Lexicon loader updated to resolve NSIDs via path traversal. AddedlabelandresolveLabeltoLEXICON_IDS(GH#71) - Lexicon record → entry rename: The dataset record lexicon is renamed from
ac.foundation.dataset.recordtoac.foundation.dataset.entrythroughout the codebase — lexicon files, Python types, collection constants, tests, and documentation (GH#63) - Schema version field rename:
$atdataSchemaVersionrenamed toatdataSchemaVersion(no$prefix) to follow ATProto naming conventions for non-reserved properties (GH#65) - DID resolution refactor: Extracted
Atmosphere.resolve_did()as a public method, deduplicating handle-to-DID resolution across schema, label, and record loaders - CI: Redis and Postgres container images pulled from AWS ECR Public Gallery (
public.ecr.aws/docker/library/) instead of Docker Hub to avoid rate limits; replacedsupercharge/redis-github-actionwith native service containers
- Manifest serialization: Manifests field now uses truthiness check consistent with the tags field pattern, preventing empty lists from being serialized as
None
- Cross-account record reads:
get_record()andlist_records()now route reads for foreign DIDs through the public AppView instead of the authenticated PDS, fixingRecordNotFounderrors when fetching records published by other users (e.g. schemas fromfoundation.acwhile logged in asmaxine.science)
- Typed content metadata: Datasets can now carry a metadata schema describing dataset-level properties (description, license, creation date) alongside the sample schema (#58)
- Atmosphere label integration:
Index.insert_dataset()publishes a label record alongside the dataset record, enabling@handle/namepath resolution throughget_label()andget_dataset()(#59) - E2E lens integration tests: Comprehensive tests for the lens publish → retrieve → execute lifecycle against a real ATProto PDS (#55)
- E2E session management tests: Integration tests covering ATProto login, session export/import round-trip, unauthenticated reads, and error handling (#57)
- XRPC query workaround docs: Documented client-side
list_records()+ filter patterns used as temporary workarounds pending AppView support (#56)
- Schema lexicon rename:
getLatestSchemarenamed toresolveSchemato matchresolveLabelsemantics; added$atdataSchemaVersionproperty for format versioning (#53) - Schema API consolidation:
decode_schema,load_schema,decode_schema_as, andschema_to_typeconsolidated intoget_schema_type()with deprecation shims for old names (#54) - Build config cleanup:
.chainlink/and.claude/directories excluded from sdist builds (#54) - Security hardening: Sanitized schema field names in
_schema_codec.pyand_stub_manager.py; removedallow_pickle=Truefrom numpy save/load in_helpers.py
- Lens discovery pagination:
find_by_schemas()now paginates withlimit=100+ cursor instead of exceeding ATProto'slist_recordscap of 100 - Truncated session test: Rewrote
test_truncated_session_string_raisesto handle the atproto SDK silently accepting truncated sessions via its 4-field backward compat path (MarshalX/atproto#656) - Exception chaining: Proper
raise ... fromexception chaining and best-effort label publish error handling - Content metadata validation: Metadata validated before writing files; dead code removed
- Chainlink db protection: Added git hooks (
.githooks/) to prevent worktree symlinks from overwriting the issues database on merge
- Atmosphere schema codec:
_convert_atmosphere_schema()now handles the flattened ATProto wire format whereproperties/requiredare at the top level of theschemadict, not nested under aschemaBodykey. Previously, schemas fetched from a PDS produced types with zero fields.
- Atmosphere schema routing:
get_schema()/get_schema_record()now handleat://URI refs by delegating to the atmosphere backend instead of raisingValueError - Blob URL resolution:
AtmosphereIndexEntry.data_urlsresolvesstorageBlobsCIDs to PDS HTTP URLs viaplc.directory(with caching), replacing the empty-list placeholder - Indexed path routing:
load_dataset('@handle/dataset')now passes the full@handle/namepath through toIndex._resolve_prefix()so atmosphere routing works correctly - Atmosphere schema codec:
schema_to_type()handles atmosphere JSON Schema format by converting to local field format before type reconstruction - Structural lens fallback:
Dataset.as_type()falls back to structural field mapping when no registered lens exists between structurally compatible types (e.g. dynamic vs user-defined classes) - Blob shard checksums: SHA-256 digests from
PDSBlobStoreare now attached per-blob inBlobEntry.checksuminstead of being buried inmetadata.custom
- Checksum extraction: Deduplicated checksum extraction logic into
_extract_blob_checksums()helper with warning on count mismatch - Test nomenclature: Renamed
test_integration_*.py→test_workflow_*.pyacross the test suite - CI workflow triggers: Updated workflow file references to match renamed test files
- Redis live integration tests: 16 tests covering entry CRUD, schema operations, label operations, lens operations, and concurrent access against a real Redis 7 service container
- Redis service in CI: Integration workflow now includes a Redis 7 service container alongside PostgreSQL and MinIO
- Persistent lens storage:
Index.store_lens()/Index.load_lens()serialize lens definitions (field-mapping or code-reference) to all provider backends (SQLite, Redis, PostgreSQL) with automatic reconstitution, version auto-increment,find_lenses_by_schemas(), and stub generation - Typed dataset metadata:
DatasetMetadatadataclass replaces opaque bytes metadata in atmosphere records, with structured fields for description, license, tags, and custom key-value pairs
- Metadata encoding: Atmosphere metadata now uses ATProto
$bytesformat to preventPydanticSerializationErroron binary payloads
- Metadata falsy values:
DatasetMetadata.from_dict()now preserves falsy values (empty strings, zero, False) instead of dropping them - CI: Correct benchmark output path in bench workflow
- Lint: Fix unused imports and undefined forward references in lens persistence code
- Live integration tests: New
tests/integration/suite with ATProto, S3, and PostgreSQL live tests against remote sandboxes, plus dedicated CI workflow (.github/workflows/integration.yml)
- Content checksums: Per-shard SHA-256 digests computed at write time across all storage backends (
LocalDiskStore,S3DataStore,PDSBlobStore). Checksums are carried viaShardWriteResultand automatically merged into index entry metadata verify_checksums(): Utility function to verify stored checksums against shard files on disk; remote URLs (s3://,at://,http://) are gracefully skippedatdata verifyCLI command: Verify content integrity of indexed datasets from the command line- AT URI support in
load_dataset():load_dataset("at://did:plc:abc/.../rkey")now fetches dataset records from ATProto and resolves storage (blobs, HTTP, S3) into streamable datasets with automatic schema decoding - Lens composition operators:
@(compose) and|(pipe) operators for chaining lenses, plusidentity_lens()factory for pass-through transforms
- Coverage improvements (92% → 94%): 61 new tests across atmosphere client (swap_commit, model conversion fallbacks), DatasetLoader (HTTP/S3 storage paths,
get_typed,to_dataset, checksum validation), DatasetPublisher (publish_with_s3), Redis/Postgres provider label CRUD, Redis schema edge cases (bytes decoding, legacy format), and lexicon loading/validation
- CI: Use
cp -fin bench workflow to avoid interactive prompt on file overwrite
- Dataset labels: Named, versioned pointers to dataset records — separating identity (CID-addressed) from naming (mutable labels).
store_label(),get_label(),list_labels(),delete_label()across all index providers (SQLite, Redis, PostgreSQL) - Atmosphere label records:
LabelPublisherandLabelLoaderfor publishing and resolvingac.foundation.dataset.labelrecords on ATProto PDS, withac.foundation.dataset.resolveLabelquery lexicon - Label-aware
load_dataset(): Path resolution now tries label lookup before falling back to dataset name, enablingload_dataset("@local/mnist")to resolve through labels
- Git flow: Adopted standard git flow branching model —
developas integration branch,feature/*fromdevelop,release/*cut fromdevelop. Updated/release,/feature,/publish, and/featreeskills accordingly - Worktree chainlink sharing:
/featreenow symlinks.chainlink/issues.dbto the base clone's copy so all worktrees share a single authoritative issue database
Atmosphere.upload_blob()TypeError: The timeout heuristic passedtimeout=toClient.upload_blob()which only accepts(data: bytes). Switched to the namespace methodcom.atproto.repo.upload_blob()which forwards kwargs through to httpx
- ATProto SDK signature compatibility tests: New
test_atproto_compat.pywith 7 tests that instantiate a real atprotoClient(withClientRaw._invokepatched) to validate method signatures without network I/O. Coversupload_blob,create_record,list_records,get_record,delete_record, andexport_session
- Lexicon-mirror type system:
StorageHttp,StorageS3,StorageBlobs,BlobEntry,ShardChecksumdataclasses that mirror ATProto lexicon definitions, withstorage_from_record()union deserializer ShardUploadResult: Typed return fromPDSBlobStore.write_shards()carrying both AT URIs and blob ref dicts- Lexicon reference docs: Auto-generated documentation page for the
ac.foundation.dataset.*namespace - Example docs: dataset-profiler, lens-graph, and query-cookbook with plots and interactive tabsets
- Typed proxy DSL for manifest queries (
foundation-ac #43)
DatasetPublisherrefactored: Extracted_create_record()helper, fixing a bug wherepublish()useddataset.urlinstead ofdataset.list_shards()for multi-shard datasetsPDSBlobStore.write_shards()returnsShardUploadResultinstead of using a_last_blob_refsside-channel- Blob storage uploads: PDS blob uploads now use
storageBlobswith embedded blob ref objects instead of string AT URIs instorageExternal, preventing PDS garbage collection of uploaded blobs - Replaced lexicon symlinks with real files
- Guarded
redisimports behindTYPE_CHECKINGinindex/_entry.pyandindex/_index.py - Standardized benchmark outputs to
.benchmarks/directory
publish()multi-shard bug: was passing single URL instead of full shard list- Double-write eliminated in
PDSBlobStore - Lens-graph example: removed float rounding in calibrate lens that broke law assertions
- Unused imports and E402 violations in atmosphere module
- Unused variable and import in test files
- Strengthened weak mock assertions with argument verification across 4 test files
- Fixed misleading unicode tests: real emoji (🌍🎉🚀) and CJK characters (日本語テスト, 中文测试, 한국어시험) instead of ASCII placeholders
- Exact shard count assertions instead of
>= 2bounds - Fixed self-referential assertion in
test_publish_schema - Removed unnecessary
isinstancebuiltin patch - Added content assertions for empty/corrupted shard recovery tests
Index.write()→Index.write_samples(): Renamed with atmosphere-aware defaults — automatic PDS blob upload, 50 MB per-shard limit, 1 GB total dataset guard- New
forceflag bypasses PDS size limits for large datasets - New
copyflag forces data transfer from private/remote sources to destination store - New
data_storekwarg to override the default storage backend
- New
Index.insert_dataset()overhaul: Smart source routing for atmosphere targets- Local files auto-upload via
PDSBlobStore - Remote HTTP/HTTPS URLs referenced as external storage (zero-copy)
- Credentialed
S3Sourceerrors by default to prevent leaking private endpoints; passcopy=Trueto copy data to the destination store
- Local files auto-upload via
- PDS constants:
PDS_BLOB_LIMIT_BYTES(50 MB) andPDS_TOTAL_DATASET_LIMIT_BYTES(1 GB) inatmosphere/store.py;PDSBlobStore.write_shards()defaults to 50 MB shard size - CI overhaul: Sequential Lint → Pilot → Matrix flow; codecov uploads once per run instead of per-matrix-cell; benchmarks split to separate workflow
- Lazy-import
pandasandrequestsindataset.pyto reduce import time
- Atmosphere blob uploads:
Index.write_samples()targeting atmosphere now uploads data as PDS blobs instead of publishing local temp file paths in the ATProto record
Index.add_entry()— useIndex.insert_dataset()insteadIndex.promote_entry()andIndex.promote_dataset()— useIndex.insert_dataset()with an atmosphere-backed Index insteadURLSource.shard_listandS3Source.shard_listproperties — uselist_shards()method instead
- Lexicon packaging: ATProto lexicon JSON files bundled in
src/atdata/lexicons/withimportlib.resourcesaccess viaatdata.lexicons.get_lexicon()andlist_lexicons() DatasetDictsingle-split proxy: When aDatasetDicthas one split,.ordered(),.shuffled(),.list_shards(), and otherDatasetmethods are proxied directlywrite_samples(manifest=True): Opt-in manifest generation during sample writing for query-based access- Example documentation: Five executable Quarto example docs covering typed pipelines, lens transforms, manifest queries, index workflows, and multi-split datasets
- Bounds checking in
bytes_to_array()for truncated/corrupted input buffers
- Production hardening: observability and checkpoint/resume (GH#39 5.1/5.2) (#590)
- Expand logging coverage across write/read/index/atmosphere paths (#593)
- Add checkpoint/resume and on_shard_error to process_shards (#592)
- Add log_operation context manager to _logging.py (#591)
- Add reference documentation for atdata's atproto lexicons (#589)
- Add version auto-suggest to /release and /publish skills (#588)
- Create /publish skill for post-merge release tagging and PyPI publish (#587)
- Fix wheel build: duplicate filename in ZIP archive rejected by PyPI (#586)
- Update /release skill to run ruff format --check before committing (#585)
AtmosphereClient→Atmosphere: Renamed with factory classmethodsAtmosphere.login()andAtmosphere.from_env();AtmosphereClientremains as a deprecated aliassampleSchema→schema: Lexicon record type renamed fromac.foundation.dataset.sampleSchematoac.foundation.dataset.schema(clean break, no backward compat)- Module reorganization:
local/split intoindex/(Index, entries, schema management) andstores/(LocalDiskStore, S3DataStore);local/remains as backward-compat re-export shim - CLI rename:
atdata localsubcommand renamed toatdata infra - Uniform Repository model:
Indexnow treats"local"as a regularRepository, collapsing 3-way routing (local/named/atmosphere) to 2-way (repo/atmosphere) SampleBatchaggregation usesnp.stack()instead ofnp.array(list(...))for efficiency- Numpy scalar coercion in
_make_packable— numpy scalars are now extracted to Python primitives before msgpack serialization - Removed dead legacy aliases in
StubManager(_stub_filename,_stub_path,_stub_is_current,_write_stub_atomic) - Streamlined homepage and updated docs site to reflect new APIs
- Schema round-trip in
Index.write()— schemas with NDArray fields now survive publish/decode correctly - Test isolation: protocol tests now use temporary SQLite databases instead of shared default
LocalDiskStore: Local filesystem data store implementingAbstractDataStoreprotocol- Writes WebDataset shards to disk with
write_shards() - Default root at
~/.atdata/data/, configurable via constructor
- Writes WebDataset shards to disk with
write_samples(): Module-level function to write samples directly to WebDataset tar files- Single tar or sharded output via
maxcount/maxsizeparameters - Returns typed
Dataset[ST]wrapping the written files
- Single tar or sharded output via
Index.write(): Write samples and create an index entry in one step- Combines
write_samples()+insert_dataset()into a single call - Auto-creates
LocalDiskStorewhen no data store is configured
- Combines
Index.promote_entry()andIndex.promote_dataset(): Atmosphere promotion via Index- Promote locally-indexed datasets to ATProto without standalone functions
- Schema deduplication and automatic publishing
- Top-level exports:
atdata.Index,atdata.LocalDiskStore,atdata.write_samples write()method added toAbstractIndexprotocol- 38 new tests:
test_write_samples.py,test_disk_store.py,test_index_write.py
promote.pyupdated as backward-compat wrapper delegating toIndex.promote_entry()- Trimmed
_protocols.pydocstrings by 30% (487 → 343 lines) - Trimmed verbose test docstrings across test suite (−173 lines)
- Strengthened weak test assertions (isinstance checks, tautological tests)
- Removed dead code:
parse_cid()function and tests - Added
@pytest.mark.filterwarningsto tests exercising deprecated APIs
- Structured logging:
atdata.configure_logging()with pluggable logger protocol - Partial failure handling:
PartialFailureErrorand shard-level error handling inDataset.map() - Testing utilities:
atdata.testingmodule with mock clients, fixtures, and helpers - Per-shard manifest and query system (GH#35)
ManifestBuilder,ManifestWriter,QueryExecutor,SampleLocationManifestFieldannotation andresolve_manifest_fields()- Aggregate collectors (categorical, numeric, set)
- Integrated into write path and
Dataset.query()
- Performance benchmark suite:
bench_dataset_io,bench_index_providers,bench_query,bench_atmosphere- HTML benchmark reports with CI integration
- Median/IQR statistics with per-sample columns
- SQLite/PostgreSQL index providers (GH#42)
SqliteIndexProvider,PostgresIndexProvider,RedisIndexProviderIndexProviderprotocol in_protocols.py- SQLite as default provider (replacing Redis)
- Developer experience improvements (GH#38)
- CLI:
atdata inspect,atdata preview,atdata schema show/diff Dataset.head(),Dataset.__iter__,Dataset.__len__,Dataset.select()Dataset.filter(),Dataset.map(),Dataset.describe()Dataset.get(key),Dataset.schema,Dataset.column_namesDataset.to_pandas(),Dataset.to_dict()- Custom exception hierarchy with actionable error messages
- CLI:
- Consolidated Index with Repository system
Repositorydataclass and_AtmosphereBackend- Prefix routing for multi-backend index operations
- Default
Indexsingleton withload_datasetintegration AtmosphereIndexdeprecated in favor ofIndex(atmosphere=client)
- Comprehensive test coverage: 1155 tests
- Split
local.pymonolith intolocal/package (_index.py,_entry.py,_schema.py,_s3.py,_repo_legacy.py) - Migrated CLI from argparse to typer
- Migrated type annotations from
PackableSampletoPackableprotocol - Multiple adversarial review passes with code quality improvements
- CI: fixed duplicate runs, scoped permissions, benchmark auto-commit
- Blob storage for atmosphere datasets: Full support for storing dataset shards as ATProto blobs via PDS
DatasetPublisher.publish_with_blobs()for uploading shards as blobsDatasetLoader.get_blobs()andget_blob_urls()for retrievalAtmosphereClient.upload_blob()andget_blob()wrappers
- HuggingFace-style API:
load_dataset()function with path resolution, split handling, and streaming support- WebDataset brace notation, glob patterns, local directories, remote URLs
DatasetDictclass for multi-split datasets@handle/datasetpath resolution via atmosphere index
- Protocol-based architecture: Abstract protocols for backend interoperability
IndexEntry,AbstractIndex,AbstractDataStoreprotocols- Enables polymorphic code across local and atmosphere backends
- Local to atmosphere promotion:
promote_to_atmosphere()workflow with schema deduplication - Quarto documentation site: Tutorials, reference docs, and API reference at docs/
- Comprehensive integration test suite: 593 tests covering E2E flows, error handling, edge cases
- Architecture refactor:
LocalIndex+S3DataStorecomposable patternLocalIndexnow accepts optionaldata_storeparameterS3DataStoreimplementsAbstractDataStorefor S3 operations
- Deprecated
Repoclass: UseLocalIndex(data_store=S3DataStore(...))insteadReporemains as thin backwards-compatibility wrapper with deprecation warning
- Renamed
BasicIndexEntrytoLocalDatasetEntrywith CID-based identity - Added ATProto-compatible CID generation via libipld
- Performance improvements: cached
sample_typeproperty, precompiled regex patterns
- Dark theme styling for callouts and code blocks in Quarto docs
- Browser chrome color updates on dark/light mode toggle
- Initial atmosphere module with ATProto integration
- Schema, dataset, and lens publishing to ATProto PDS
AtmosphereClientfor ATProto authentication and record managementAtmosphereIndexfor querying published datasets and schemas- Dynamic sample type reconstruction from published schemas
- Improved type hint coverage throughout codebase
- Enhanced error messages for common failure modes
- Core
PackableSampleand@packabledecorator for typed samples Dataset[ST]generic typed dataset with WebDataset backendSampleBatch[DT]with automatic attribute aggregationLens[S, V]bidirectional transformations- Local storage with Redis index and S3 data store
- WebDataset tar file reading and writing
- NumPy array serialization via msgpack