This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
atdata is a Python library that implements a loose federation of distributed, typed datasets built on top of WebDataset and ATProto. It provides:
- Typed samples with automatic serialization via msgpack
- Local and atmosphere storage with pluggable index providers (SQLite, Redis, PostgreSQL)
- Lens-based transformations between different dataset schemas
- ATProto integration for publishing and discovering datasets on the atmosphere
- HuggingFace-style API with
load_dataset()for convenient access - WebDataset integration for efficient large-scale dataset storage
# Uses uv for dependency management
python -m pip install uv # if not already installed
uv sync# Always run tests through uv to use the correct virtual environment
# Run all tests with coverage
uv run pytest
# Run specific test file
uv run pytest tests/test_dataset.py
uv run pytest tests/test_local.py
# Run single test
uv run pytest tests/test_dataset.py::test_create_sample -v# Build the package
uv buildDevelopment tasks are managed with just, a command runner. Available commands:
just test # Run all tests with coverage
just test tests/test_dataset.py # Run specific test file
just lint # Run ruff check + format check
just docs # Build documentation (runs quartodoc + quarto)
just bench # Run full benchmark suite
just bench-io # Run I/O benchmarks only
just bench-index # Run index provider benchmarks
just bench-query # Run query benchmarks
just bench-report # Generate HTML benchmark report
just bench-save <name> # Save benchmark results
just bench-compare a b # Compare two benchmark runsNote on
just docs: The recipe setsQUARTO_PYTHONto the project venv's Python so that Quarto uses the correct interpreter with project dependencies. Without this, Quarto may pick up a system Python that lacksquartodocand other required packages.
The justfile is in the project root. Add new dev tasks there rather than creating shell scripts.
# Always use uv run for Python commands to use the correct virtual environment
uv run python -c "import atdata; print(atdata.__version__)"
uv run python script.py
# Never use bare python/python3 - it may not have project dependencies
# BAD: python3 -c "import webdataset"
# GOOD: uv run python -c "import webdataset"The codebase lives under src/atdata/ with these main components:
Core modules:
dataset.py—PackableSample,DictSample,Dataset[ST],SampleBatch[DT],@packable,write_samples()lens.py—Lens[S, V],LensNetwork,@lensdecorator_protocols.py— Protocol definitions:Packable,IndexEntry,AbstractIndex,AbstractDataStore,DataSource_hf_api.py—load_dataset(),DatasetDict, HuggingFace-style path resolution_exceptions.py— Custom exception hierarchy (AtdataError,SchemaError,ShardError, etc.)
Index and storage:
index/—Index,LocalDatasetEntry,LocalSchemaRecord, schema managementstores/—LocalDiskStore,S3DataStore(data store implementations)local/— Backward-compat shim re-exporting fromindex/andstores/providers/— Pluggable index backends:SqliteProvider(default),RedisProvider,PostgresProviderrepository.py—Repositorydataclass pairing provider + data store, prefix routing
ATProto integration:
atmosphere/—Atmosphere, schema/dataset/lens publishers and loaders,PDSBlobStorepromote.py— Local-to-atmosphere promotion (deprecated in favor ofIndex.promote_entry())
Data pipeline:
_sources.py—URLSource,S3Source,BlobSource(streaming shard data to Dataset)manifest/— Per-shard metadata manifests for query-based access (ManifestField,QueryExecutor)
Utilities:
_helpers.py— NumPy array serialization (array_to_bytes/bytes_to_array)_cid.py— ATProto-compatible CID generation via libipld_schema_codec.py— Dynamic Python type generation from stored schemas_stub_manager.py— IDE stub file generation for dynamic types_type_utils.py— Shared type conversion utilities_logging.py— Pluggable structured loggingtesting.py— Mock clients, fixtures, and test helpers
CLI:
cli/— Typer-based CLI:atdata inspect,atdata preview,atdata schema show/diff,atdata local up/down/status,atdata diagnose
Sample Type Definition
Two approaches for defining sample types:
# Approach 1: Explicit inheritance
@dataclass
class MySample(atdata.PackableSample):
field1: str
field2: NDArray
# Approach 2: Decorator (recommended)
@atdata.packable
class MySample:
field1: str
field2: NDArrayWriting and Indexing Data
# Write samples directly to tar files
ds = atdata.write_samples(samples, "output/data.tar")
# Or use Index for managed storage
index = atdata.Index(data_store=atdata.LocalDiskStore())
entry = index.write(samples, name="my-dataset")Index with Pluggable Storage
# SQLite backend (default, zero dependencies)
index = atdata.Index()
# With local disk storage
index = atdata.Index(data_store=atdata.LocalDiskStore())
# With S3 storage
from atdata.local import S3DataStore
index = atdata.Index(data_store=S3DataStore(credentials, bucket="my-bucket"))
# With atmosphere backend
from atdata.atmosphere import AtmosphereClient
client = AtmosphereClient.login("handle", "password")
index = atdata.Index(atmosphere=client)NDArray Handling
Fields annotated as NDArray or NDArray | None are automatically:
- Converted from bytes during deserialization
- Converted to bytes during serialization (via
_helpers.array_to_bytes) - Handled by
_ensure_good()method inPackableSample.__post_init__
Lens Transformations
Lenses enable viewing datasets through different type schemas:
@atdata.lens
def my_lens(source: SourceType) -> ViewType:
return ViewType(...)
@my_lens.putter
def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
return SourceType(...)
# Use with datasets
ds = atdata.Dataset[SourceType](url).as_type(ViewType)The LensNetwork singleton (in lens.py) maintains a global registry of all lenses decorated with @lens.
Batch Aggregation
SampleBatch uses __getattr__ magic to aggregate sample attributes:
- For
NDArrayfields: stacks into numpy array with batch dimension - For other fields: creates list
- Results are cached in
_aggregate_cache
Datasets use WebDataset brace-notation URLs:
- Single shard:
path/to/file-000000.tar - Multiple shards:
path/to/file-{000000..000009}.tar
Property vs Method Pattern for Collections
When exposing collections of items, follow this convention:
foo.xs-@propertyreturningIterator[X](lazy iteration)foo.list_xs()- method returninglist[X](eager, fully evaluated)
Examples:
index.datasets/index.list_datasets()index.schemas/index.list_schemas()dataset.shards/dataset.list_shards()
The lazy property enables memory-efficient iteration over large collections, while the method provides a concrete list when needed.
Type Parameters
The codebase uses Python 3.12+ generics heavily:
Dataset[ST]whereSTis the sample typeSampleBatch[DT]whereDTis the sample type- Uses
__orig_class__.__args__[0]at runtime to extract type parameters
Serialization Flow
- Sample →
as_wdsproperty → dict with__key__andmsgpackbytes - Msgpack bytes created by
packedproperty calling_make_packable()on fields - Deserialization:
from_bytes()→from_data()→__init__→_ensure_good()
WebDataset Integration
- Uses
wds.writer.ShardWriter/wds.writer.TarWriterfor writing- Important: Always import from
wds.writer(e.g.,wds.writer.TarWriter) instead ofwds.TarWriter - This avoids linting issues while functionally equivalent
- Important: Always import from
- Dataset iteration via
wds.DataPipelinewith customwrap()/wrap_batch()methods - Supports
ordered()andshuffled()iteration modes
- 1550+ tests across 40+ test files
- Tests use parametrization via
@pytest.mark.parametrizewhere appropriate - Temporary WebDataset tar files created in
tmp_pathfixture - Shared sample types defined in
conftest.py(SharedBasicSample,SharedNumpySample) - Lens tests verify well-behavedness (GetPut/PutGet/PutPut laws)
- Integration tests cover local, atmosphere, cross-backend, and error handling scenarios
tests/test_atproto_compat.py validates that our Atmosphere wrapper calls
the atproto SDK with compatible method signatures. It uses a real atproto
Client instance (not a mock) with ClientRaw._invoke patched so no network
I/O occurs. This catches TypeErrors like passing unsupported kwargs that
unspecced mocks would silently accept.
Key details:
- The fixture injects a mock session with a far-future JWT expiry so the SDK's session-refresh guard passes through without attempting a token refresh
- Both
client._session(SDK) andatmo._session(wrapper) must be set - The
Client.upload_blob()wrapper does NOT accept**kwargs; use the namespace methodClient.com.atproto.repo.upload_blob()which forwards kwargs through to httpx
Keep warning suppression local to individual tests, not global.
When tests generate expected warnings (e.g., from deprecated APIs or third-party library incompatibilities), suppress them using @pytest.mark.filterwarnings decorators on each affected test rather than global suppression in conftest.py. This:
- Documents which specific tests have known warning behaviors
- Makes it easier to track when warnings appear in unexpected places
- Avoids masking genuine warnings from new code
Example for deprecated API tests:
@pytest.mark.filterwarnings("ignore::DeprecationWarning")
class TestAtmosphereIndex:
"""Tests for deprecated AtmosphereIndex backward compat."""
...This project uses Google-style docstrings with quartodoc for API documentation generation. The most important formatting requirement is for Examples sections.
Use Examples: (plural) for code examples. This is recognized by griffe's Google docstring parser and rendered with proper syntax highlighting by quartodoc:
def my_function():
"""Short description.
Longer description if needed.
Args:
param: Description of parameter.
Returns:
Description of return value.
Examples:
>>> result = my_function()
>>> print(result)
'output'
"""Key formatting rules:
- Use
Examples:(plural, notExample:singular) - Code examples are indented 8 spaces (4 more than
Examples:) - Use
>>>for Python prompts and...for continuation lines - No
::marker needed - griffe handles the parsing automatically
Incorrect format (will not render with syntax highlighting):
Example: # Wrong - singular form is treated as an admonition
:: # Wrong - reST literal block marker not needed
>>> code_here()Correct format:
Examples:
>>> code_here() # Correct - plural form, proper indentationFor multiple examples, continue in the same section:
Examples:
>>> # First example
>>> x = create_thing()
>>> # Second example
>>> y = other_thing()Apply the same format to class docstrings and method docstrings:
class MyClass:
"""Class description.
Examples:
>>> obj = MyClass()
>>> obj.do_something()
"""
def method(self):
"""Method description.
Examples:
>>> self.method()
"""This project uses chainlink for issue tracking. Chainlink commands do NOT need to be prefixed with uv run:
# Correct - run chainlink directly
chainlink list
chainlink close 123
chainlink show 123
# Incorrect - don't use uv run
uv run chainlink list # Not neededProject-level Claude Code skills are defined in .claude/commands/:
/release <version>— Full release flow: branch from previous release, merge develop, version bump, changelog, PR tomain/publish— Post-merge: create GitHub release, monitor PyPI publish, syncdevelop/feature <description>— Create a feature branch fromdevelopwith a slugified name and chainlink issue/featree <description>— Create a feature branch in a new git worktree (symlinks chainlink db)/kickoff <description>— Create a worktree via/featree, write a self-contained prompt, and launch an autonomous agent in a tmux session/check [session]— Check status of background feature agents (reads tmux panes and.kickoff-statussentinel files)/adr— Adversarial review with docstring-preservation rules for quartodoc/changelog— Generate clean CHANGELOG entry from chainlink history/commit— Analyze changes and create a well-formatted commit
User-level skills (in ~/.claude/commands/) take precedence over project-level skills with the same name.
The /kickoff command automates feature implementation by launching an autonomous Claude agent in an isolated worktree:
/kickoff <description>creates a worktree (via/featree), writes a detailed prompt file, and launchesclaude --model opusin a tmux session namedfeat-<slug>.- The background agent works autonomously: reads
CLAUDE.md, implements the feature, runs tests, uses/commit, runs/adr, fixes issues, and writesDONEto.kickoff-statuswhen finished. /checkmonitors progress by reading the tmux pane output and checking the sentinel file. It reports status (Working/Waiting/Done/Error) and suggests next actions.
Common issues with background agents:
- Agents get stuck on Claude Code's trust/permission prompts — approve with
tmux send-keys -t <session> Enter - Context compaction happens around 5% remaining — agents with large changes may compact mid-work
- Monitor directly:
tmux capture-pane -t <session> -p | tail -40 - Attach interactively:
tmux attach -t <session>
This project follows git flow branching:
| Branch | Purpose | Branches from | Merges to |
|---|---|---|---|
main |
Production releases, always deployable | — | — |
develop |
Integration branch, all features land here | main (initial) |
main (via release) |
feature/* |
Individual work items | develop |
develop |
release/* |
Release prep (version bump, changelog) | develop |
main (via PR) |
hotfix/* |
Urgent fixes to production | main |
main + develop |
- Branch from
develop:git checkout develop && git checkout -b feature/my-feature - Do work, commit
- Merge back to
developwith--no-ff - Delete the feature branch
Releases follow this pattern (automated by /release skill):
- Create
release/v<version>branch from the previous release branch (e.g.,release/v0.4.0b2) - Merge
developinto the release branch with--no-ff - Bump version in
pyproject.toml, runuv lock - Write CHANGELOG entry (Keep a Changelog format)
- Push and create PR to
main - After merge, use
/publishto create GitHub release, then sync develop:git checkout develop && git merge main --no-ff
When using the /commit command or creating commits:
- Always include
.chainlink/issues.dbin commits alongside code changes - This ensures issue tracking history is preserved across sessions
- The issues.db file tracks all chainlink issues, comments, and status changes
The .githooks/ directory contains shared hooks. Activate them after cloning:
just setup # sets core.hooksPath to .githooks/Included hooks:
pre-commit— Blocks commits whereissues.dbis staged as a symlink (mode 120000). Prevents worktree artifacts from overwriting the real database on merge.pre-merge-commit— Backs upissues.dbtoissues.db.pre-mergebefore every merge, so the database can be restored if a merge corrupts it.
When using git worktrees (via /featree), the worktree's .chainlink/issues.db
is replaced with a symlink to the base clone's copy. This ensures all
worktrees share a single authoritative database on the develop branch.
Protection layers (in order of defense):
/featreeadds.chainlink/issues.dbto the worktree's.git/info/excludeso the symlink is never staged.- The
pre-commithook blocks any commit that stagesissues.dbas a symlink (mode 120000). - The
pre-merge-commithook backs up the db before merges so it can be restored if a symlink slips through.
If the database is corrupted after a merge:
cp .chainlink/issues.db.pre-merge .chainlink/issues.db- Track
src/atdata/cli/— Always include the CLI module in commits - The CLI is built with typer and provides
atdata inspect,atdata preview,atdata schema,atdata local, andatdata diagnosecommands - Changes to CLI should be committed with the related feature changes
- Track
.planning/directory in git — Do not ignore planning documents - Planning documents are organized by phase in
.planning/phases/:01-atproto-foundation/— Initial ATProto integration design, lexicon definitions, architecture decisions02-v0.2-review/— Human review assessments from v0.2 cycle03-v0.3-roadmap/— Codebase review and synthesis roadmap for v0.3
- Track
.reference/directory in git — Include reference documentation in commits - The
.reference/directory contains external specifications and reference materials - This includes API specs, lexicon definitions, and other reference documentation used for development