From 047c6efece84bf78018fee262f49efefd9e08ba8 Mon Sep 17 00:00:00 2001 From: Maxine Levesque <170461181+maxinelevesque@users.noreply.github.com> Date: Wed, 28 Jan 2026 13:43:17 -0800 Subject: [PATCH 01/30] fix(types): change @packable return type to expose PackableSample methods to type checkers The decorator now returns type[PackableSample] instead of type[_T]. Combined with @dataclass_transform(), this allows IDEs to recognize both: - Original class fields (via dataclass_transform) - PackableSample methods: packed, as_wds, from_bytes, from_data Co-Authored-By: Claude Opus 4.5 --- .chainlink/issues.db | Bin 483328 -> 487424 bytes CHANGELOG.md | 2 ++ src/atdata/dataset.py | 9 ++++++++- 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/.chainlink/issues.db b/.chainlink/issues.db index 12b36ae2d0358448d5941f0e0e33262998fc97f3..5cd1165dd5c81b379b04810d6cbdc9140cff4e6e 100644 GIT binary patch delta 1276 zcmb_bO>7%Q6yDk22FI~0$4T8J%>*zFq_H<^Cw45PCbTIa)m7R=T$(~RyVH0Tdw1Dc zC(WrfMxX{oYGuiD3KE4&kSGd4+5<>H4;&&m(i0U3!3`?7^bij4ViysRxN&$pZ)V)#LD>^r>8aop-7LmbzA@z#k9+TOnWUgS`BWCJd^YNIpy)S<`*`VviX zn~foC>}*DV?ULcjug7;9gB?*c)7VTOL+g!C)6X285e5(?lv<8W#gHT>#qBlzVs>Kzf7lwgXBz7eHCDG_|@qDiq+5;zELeVcD4s|bt-6TrkE zS(U!q6*~A`K#)H{B7YCuyVJ6Wj)Z`%QFbUX5j)^F*$>ajGpHF1pMnI$f*()Icm#dL zBv*cy7vE=Y6rE!aU3n{b=T-S3KhRSlrd6$=@P3iD-yy{LlC@NVDj)muY@e_z-wu?U zas%=5ZD>%I*qx>vg5t2ebfZfO<+!e`@V(9lp{-C30+OLYr~a7C@wL7pV$WOt3)My3 zk`@~t(S2evH!jAnU2&?msrw-RsBWZnAACLC9zi;*X{t6cs*a6n`FU-ukktyATq--M zsk!`!suooBwdr~XV7z{^h??^y(grBtDOWd^^~Ey59`UQ5O|ie`l8dj*U&5;;%P3*f zA=EzZV{6$q%2ku#*+&U;dc}pb3h_&hN%0lOYo%hNM2uzP(G;FtgDd-ZyCkM-)F%~O zhIEQ8+lLTh?03^jgYY@0T;1On@F{lrYdPNEV)9S8OPE@=LGTp&a6{_3;Ch4-Z-oH7 zMam{%*nWvn3-+E@9w?CQ!fnq<6WNBMO2{BQVn7DXER!C3yg?xsf+Vk3mU8-P!4+( z2th3^LK(cFhNd7}T52qb#*~_hT6#k?eBjIX4}3Rk`sTHs4>jimL0JC-1fgZRFgOIc zv-C-zDHw2hAY@%d?QIgL;0EGC$&DC`m6H1Ir}rS@iW~N*Fy@w`ov`g*Mf*F(kPh~e zZD&?jY{@I5b}?hy&XNtq^H}l&QgkK)J>W1q0vq5-kyg{(3h$-pPZwz1ZtI3O4Q)$x zM&i(nDiPxWSbY*&2T)NqDY4*z)hrm1+?b)}Ob0k32XQ23I`VTB5%Gnu@g=aEaV!CE zYP6cbURqD!Q7E#$37$cLy-s3d0Pd)4;bUs4FhjpAdU~jhTuQ-Jt_^~F!!Gn))*wgAcI+w_R)nTIbh%OWI6CJvG@!0 CqI2>9 diff --git a/CHANGELOG.md b/CHANGELOG.md index feeea7e..05c2463 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -25,6 +25,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). - **Comprehensive integration test suite**: 593 tests covering E2E flows, error handling, edge cases ### Changed +- Add SQLite/PostgreSQL providers for LocalIndex (in addition to Redis) (#407) +- Fix type hints for @atdata.packable decorator to show PackableSample methods (#406) - Review GitHub workflows and recommend CI improvements (#405) - Fix type signatures for Dataset.ordered and Dataset.shuffled (GH#28) (#404) - Investigate quartodoc Example section rendering - missing CSS classes on pre/code tags (#401) diff --git a/src/atdata/dataset.py b/src/atdata/dataset.py index cefda6f..c067027 100644 --- a/src/atdata/dataset.py +++ b/src/atdata/dataset.py @@ -1009,7 +1009,7 @@ def wrap_batch(self, batch: WDSRawBatch) -> SampleBatch[ST]: @dataclass_transform() -def packable(cls: type[_T]) -> type[_T]: +def packable(cls: type[_T]) -> type[PackableSample]: """Decorator to convert a regular class into a ``PackableSample``. This decorator transforms a class into a dataclass that inherits from @@ -1020,6 +1020,13 @@ def packable(cls: type[_T]) -> type[_T]: with all atdata APIs that accept packable types (e.g., ``publish_schema``, lens transformations, etc.). + Type Checking: + The return type is annotated as ``type[PackableSample]`` so that IDEs + and type checkers recognize the ``PackableSample`` methods (``packed``, + ``as_wds``, ``from_bytes``, etc.). The ``@dataclass_transform()`` + decorator ensures that field access from the original class is also + preserved for type checking. + Args: cls: The class to convert. Should have type annotations for its fields. From f6665a58c9e4c1e1bd3391647174498fdc91e926 Mon Sep 17 00:00:00 2001 From: Maxine Levesque <170461181+maxinelevesque@users.noreply.github.com> Date: Thu, 29 Jan 2026 10:29:28 -0800 Subject: [PATCH 02/30] chore(dev): add just test/lint commands and regenerate documentation Add test and lint recipes to justfile, update CLAUDE.md to document all available just commands, and regenerate docs with updated quarto theme/styling. Co-Authored-By: Claude Opus 4.5 --- .chainlink/issues.db | Bin 487424 -> 487424 bytes CHANGELOG.md | 1 + CLAUDE.md | 6 +- docs/api/AbstractDataStore.html | 220 ++----------- docs/api/AbstractIndex.html | 220 ++----------- docs/api/AtUri.html | 220 ++----------- docs/api/AtmosphereClient.html | 220 ++----------- docs/api/AtmosphereIndex.html | 220 ++----------- docs/api/AtmosphereIndexEntry.html | 220 ++----------- docs/api/BlobSource.html | 220 ++----------- docs/api/DataSource.html | 220 ++----------- docs/api/Dataset.html | 310 +++++------------- docs/api/DatasetDict.html | 220 ++----------- docs/api/DatasetLoader.html | 220 ++----------- docs/api/DatasetPublisher.html | 220 ++----------- docs/api/DictSample.html | 220 ++----------- docs/api/IndexEntry.html | 220 ++----------- docs/api/Lens.html | 220 ++----------- docs/api/LensLoader.html | 220 ++----------- docs/api/LensPublisher.html | 220 ++----------- docs/api/PDSBlobStore.html | 220 ++----------- docs/api/Packable-protocol.html | 220 ++----------- docs/api/PackableSample.html | 220 ++----------- docs/api/S3Source.html | 220 ++----------- docs/api/SampleBatch.html | 220 ++----------- docs/api/SchemaLoader.html | 220 ++----------- docs/api/SchemaPublisher.html | 220 ++----------- docs/api/URLSource.html | 220 ++----------- docs/api/index.html | 220 ++----------- docs/api/load_dataset.html | 220 ++----------- docs/api/local.Index.html | 220 ++----------- docs/api/local.LocalDatasetEntry.html | 220 ++----------- docs/api/local.S3DataStore.html | 220 ++----------- docs/api/packable.html | 231 ++----------- docs/api/promote_to_atmosphere.html | 220 ++----------- docs/index.html | 252 +++----------- docs/reference/architecture.html | 258 +++------------ docs/reference/atmosphere.html | 274 ++++------------ docs/reference/datasets.html | 256 +++------------ docs/reference/deployment.html | 230 ++----------- docs/reference/lenses.html | 252 +++----------- docs/reference/load-dataset.html | 254 +++----------- docs/reference/local-storage.html | 252 +++----------- docs/reference/packable-samples.html | 256 +++------------ docs/reference/promotion.html | 246 +++----------- docs/reference/protocols.html | 254 +++----------- docs/reference/troubleshooting.html | 230 ++----------- docs/reference/uri-spec.html | 234 ++----------- docs/robots.txt | 2 +- docs/search.json | 195 +---------- ...p-62bce24ca844314e7bb1a34dbdfe05cc.min.css | 12 - ...p-62ce3d63edf8507b4d15f75c6b92352a.min.css | 12 + ...k-7964ffd8887b0991fe8d71c6c8bc75d6.min.css | 12 - ...ting-b854dd4081d6110d4acfde180236d7b2.css} | 4 +- ...-dark-8dcd8563ea6803ab7cbb3d71ca5772e1.css | 210 ------------ docs/sitemap.xml | 156 ++++----- docs/styles.css | 50 +++ docs/tutorials/atmosphere.html | 258 +++------------ docs/tutorials/local-workflow.html | 246 +++----------- docs/tutorials/promotion.html | 254 +++----------- docs/tutorials/quickstart.html | 242 +++----------- docs_src/_brand.yml | 73 +++++ docs_src/_quarto.yml | 66 +++- docs_src/api/Dataset.qmd | 50 ++- docs_src/api/packable.qmd | 18 +- docs_src/index.qmd | 4 +- docs_src/styles.css | 50 +++ docs_src/theme-dark.scss | 1 + docs_src/theme-light.scss | 15 + justfile | 11 +- prototyping/human-review-atmosphere.ipynb | 45 ++- prototyping/human-review-local.ipynb | 208 +++++++----- 72 files changed, 2267 insertions(+), 10323 deletions(-) delete mode 100644 docs/site_libs/bootstrap/bootstrap-62bce24ca844314e7bb1a34dbdfe05cc.min.css create mode 100644 docs/site_libs/bootstrap/bootstrap-62ce3d63edf8507b4d15f75c6b92352a.min.css delete mode 100644 docs/site_libs/bootstrap/bootstrap-dark-7964ffd8887b0991fe8d71c6c8bc75d6.min.css rename docs/site_libs/quarto-html/{quarto-syntax-highlighting-9582434199d49cc9e91654cdeeb4866b.css => quarto-syntax-highlighting-b854dd4081d6110d4acfde180236d7b2.css} (94%) delete mode 100644 docs/site_libs/quarto-html/quarto-syntax-highlighting-dark-8dcd8563ea6803ab7cbb3d71ca5772e1.css create mode 100644 docs/styles.css create mode 100644 docs_src/_brand.yml create mode 100644 docs_src/styles.css create mode 100644 docs_src/theme-dark.scss create mode 100644 docs_src/theme-light.scss diff --git a/.chainlink/issues.db b/.chainlink/issues.db index 5cd1165dd5c81b379b04810d6cbdc9140cff4e6e..76ae18ac21fb274815616eff874f3b3a6e194adf 100644 GIT binary patch delta 543 zcmZp8AlvXjc7n9vUIqq+awukGU=T^1sAJ5ycVmK$KBN2Ox%!?`?#!}`#i_-`nfZCe zEZnTjl1!P!#igmmj59VF_Gy!wo8_>zCOjw#k6lby93*PMkc0x(~mi_ zeVcyDiH)0a28fDfoUvWQneC1aM h13}m27o_Hq?lug!H`+9db2BuGGe~at@@L8z5|*$WtN zZhut3Xvi`#fTLZqlo5!TfS4JGS++}-vc5jW%*k|WJG%qhe#Yr19NE524B(r-&xwtD z`c)^kAjWCiHJsV*=rEp{&RW3sfBHTF#yQ&;2r_aqZeLl%_Ks0lEH^bJvox2HmmBCR z2z&dg61J$*JWRa57=&7ReHo_v`?I^U Zbh3IJClS)Bj? diff --git a/CHANGELOG.md b/CHANGELOG.md index 05c2463..804fa5d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -25,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). - **Comprehensive integration test suite**: 593 tests covering E2E flows, error handling, edge cases ### Changed +- Add just lint command to justfile (#408) - Add SQLite/PostgreSQL providers for LocalIndex (in addition to Redis) (#407) - Fix type hints for @atdata.packable decorator to show PackableSample methods (#406) - Review GitHub workflows and recommend CI improvements (#405) diff --git a/CLAUDE.md b/CLAUDE.md index 349c268..6b096b4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -46,8 +46,10 @@ uv build Development tasks are managed with [just](https://github.com/casey/just), a command runner. Available commands: ```bash -# Build documentation (runs quartodoc + quarto) -just docs +just test # Run all tests with coverage +just test tests/test_dataset.py # Run specific test file +just lint # Run ruff check + format check +just docs # Build documentation (runs quartodoc + quarto) ``` The `justfile` is in the project root. Add new dev tasks there rather than creating shell scripts. diff --git a/docs/api/AbstractDataStore.html b/docs/api/AbstractDataStore.html index b4b9393..4ceb07c 100644 --- a/docs/api/AbstractDataStore.html +++ b/docs/api/AbstractDataStore.html @@ -71,14 +71,10 @@ - - - + - - - + + + - +
@@ -290,10 +145,6 @@
-
@@ -411,7 +261,7 @@

On this page

- +
@@ -419,7 +269,6 @@

On this page

-

AbstractDataStore

AbstractDataStore()
@@ -602,19 +451,6 @@

- - - + - - - + + + - +
@@ -290,10 +145,6 @@
-
@@ -417,7 +267,7 @@

On this page

- +
@@ -425,7 +275,6 @@

On this page

-

AbstractIndex

AbstractIndex()
@@ -925,19 +774,6 @@

- - - + - - - + + + - +
@@ -290,10 +145,6 @@
-
@@ -410,7 +260,7 @@

On this page

- +
@@ -418,7 +268,6 @@

On this page

-

AtUri

atmosphere.AtUri(authority, collection, rkey)
@@ -547,19 +396,6 @@

Rais

- - - + - - - + + + - +
@@ -290,10 +145,6 @@
-
@@ -424,7 +274,7 @@

On this page

- +
@@ -432,7 +282,6 @@

On this page

-

AtmosphereClient

atmosphere.AtmosphereClient(base_url=None, *, _client=None)
@@ -1460,19 +1309,6 @@

Ra

- - - + - - - + + + - +
@@ -290,10 +145,6 @@
-
@@ -416,7 +266,7 @@

On this page

- +
@@ -424,7 +274,6 @@

On this page

-

AtmosphereIndex

atmosphere.AtmosphereIndex(client, *, data_store=None)
@@ -930,19 +779,6 @@

- - - + - - - + + + - +
@@ -290,10 +145,6 @@
-
@@ -405,7 +255,7 @@

On this page

  • Attributes
  • - +
    @@ -413,7 +263,6 @@

    On this page

    -

    AtmosphereIndexEntry

    atmosphere.AtmosphereIndexEntry(uri, record)
    @@ -449,19 +298,6 @@

    window.document.addEventListener("DOMContentLoaded", function (event) { - // Ensure there is a toggle, if there isn't float one in the top right - if (window.document.querySelector('.quarto-color-scheme-toggle') === null) { - const a = window.document.createElement('a'); - a.classList.add('top-right'); - a.classList.add('quarto-color-scheme-toggle'); - a.href = ""; - a.onclick = function() { try { window.quartoToggleColorScheme(); } catch {} return false; }; - const i = window.document.createElement("i"); - i.classList.add('bi'); - a.appendChild(i); - window.document.body.appendChild(a); - } - setColorSchemeToggle(hasAlternateSentinel()) const icon = ""; const anchorJS = new window.AnchorJS(); anchorJS.options = { @@ -532,7 +368,7 @@

    { return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); } @@ -867,7 +703,7 @@

      - + diff --git a/docs/api/BlobSource.html b/docs/api/BlobSource.html index 5b15b69..6797e32 100644 --- a/docs/api/BlobSource.html +++ b/docs/api/BlobSource.html @@ -71,14 +71,10 @@ - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -412,7 +262,7 @@

    On this page

    - +
    @@ -420,7 +270,6 @@

    On this page

    -

    BlobSource

    BlobSource(blob_refs, pds_endpoint=None, _endpoint_cache=dict())
    @@ -640,19 +489,6 @@

    Ra

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    DataSource

    DataSource()
    @@ -573,19 +422,6 @@

    Rais

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -418,7 +268,7 @@

    On this page

    - +
    @@ -426,7 +276,6 @@

    On this page

    -

    Dataset

    Dataset(source=None, metadata_url=None, *, url=None)
    @@ -513,7 +362,7 @@

    Methods

    ordered -Iterate over the dataset in order +Iterate over the dataset in order. shuffled @@ -639,16 +488,10 @@

    ordered

    Dataset.ordered(batch_size=None)
    -

    Iterate over the dataset in order

    +

    Iterate over the dataset in order.

    Parameters

    ------ @@ -659,10 +502,10 @@

    -

    - - - + + + +
    Namebatch_size (obj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.requiredbatch_sizeint | NoneThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.None
    @@ -680,21 +523,38 @@

    -Iterable[ST] -obj:webdataset.DataPipeline A data pipeline that iterates over +Iterable[ST] | Iterable[SampleBatch[ST]] +A data pipeline that iterates over the dataset in its original -Iterable[ST] -the dataset in its original sample order +Iterable[ST] | Iterable[SampleBatch[ST]] +sample order. When batch_size is None, yields individual + + + +Iterable[ST] | Iterable[SampleBatch[ST]] +samples of type ST. When batch_size is an integer, yields + + + +Iterable[ST] | Iterable[SampleBatch[ST]] +SampleBatch[ST] instances containing that many samples.

    +
    +

    Examples

    +
    >>> for sample in ds.ordered():
    +...     process(sample)  # sample is ST
    +>>> for batch in ds.ordered(batch_size=32):
    +...     process(batch)  # batch is SampleBatch[ST]
    +

    shuffled

    -
    Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)
    +
    Dataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)

    Iterate over the dataset in random order.

    Parameters

    @@ -742,31 +602,38 @@

    -Iterable[ST] -A WebDataset data pipeline that iterates over the dataset in +Iterable[ST] | Iterable[SampleBatch[ST]] +A data pipeline that iterates over the dataset in randomized order. -Iterable[ST] -randomized order. If batch_size is not None, yields +Iterable[ST] | Iterable[SampleBatch[ST]] +When batch_size is None, yields individual samples of type -Iterable[ST] -SampleBatch[ST] instances; otherwise yields individual ST +Iterable[ST] | Iterable[SampleBatch[ST]] +ST. When batch_size is an integer, yields SampleBatch[ST] -Iterable[ST] -samples. +Iterable[ST] | Iterable[SampleBatch[ST]] +instances containing that many samples.

    +
    +

    Examples

    +
    >>> for sample in ds.shuffled():
    +...     process(sample)  # sample is ST
    +>>> for batch in ds.shuffled(batch_size=32):
    +...     process(batch)  # batch is SampleBatch[ST]
    +

    to_parquet

    -
    Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)
    +
    Dataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)

    Export dataset contents to parquet format.

    Converts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.

    @@ -816,19 +683,19 @@

    Wa ds.to_parquet("output.parquet", maxcount=10000)

    This creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.

    -
    -

    Examples

    -
    >>> ds = Dataset[MySample]("data.tar")
    ->>> # Small dataset - load all at once
    ->>> ds.to_parquet("output.parquet")
    ->>>
    ->>> # Large dataset - process in chunks
    ->>> ds.to_parquet("output.parquet", maxcount=50000)
    +
    +

    Examples

    +
    >>> ds = Dataset[MySample]("data.tar")
    +>>> # Small dataset - load all at once
    +>>> ds.to_parquet("output.parquet")
    +>>>
    +>>> # Large dataset - process in chunks
    +>>> ds.to_parquet("output.parquet", maxcount=50000)

    wrap

    -
    Dataset.wrap(sample)
    +
    Dataset.wrap(sample)

    Wrap a raw msgpack sample into the appropriate dataset-specific type.

    Parameters

    @@ -878,7 +745,7 @@

    wrap_batch

    -
    Dataset.wrap_batch(batch)
    +
    Dataset.wrap_batch(batch)

    Wrap a batch of raw msgpack samples into a typed SampleBatch.

    Parameters

    @@ -938,19 +805,6 @@

    Note - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -407,7 +257,7 @@

    On this page

  • Attributes
  • - +
    @@ -415,7 +265,6 @@

    On this page

    -

    DatasetDict

    DatasetDict(splits=None, sample_type=None, streaming=False)
    @@ -490,19 +339,6 @@

    Attributes

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -416,7 +266,7 @@

    On this page

    - +
    @@ -424,7 +274,6 @@

    On this page

    -

    DatasetLoader

    atmosphere.DatasetLoader(client)
    @@ -991,19 +840,6 @@

    window.document.addEventListener("DOMContentLoaded", function (event) { - // Ensure there is a toggle, if there isn't float one in the top right - if (window.document.querySelector('.quarto-color-scheme-toggle') === null) { - const a = window.document.createElement('a'); - a.classList.add('top-right'); - a.classList.add('quarto-color-scheme-toggle'); - a.href = ""; - a.onclick = function() { try { window.quartoToggleColorScheme(); } catch {} return false; }; - const i = window.document.createElement("i"); - i.classList.add('bi'); - a.appendChild(i); - window.document.body.appendChild(a); - } - setColorSchemeToggle(hasAlternateSentinel()) const icon = ""; const anchorJS = new window.AnchorJS(); anchorJS.options = { @@ -1074,7 +910,7 @@

    { return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); } @@ -1409,7 +1245,7 @@

      - + diff --git a/docs/api/DatasetPublisher.html b/docs/api/DatasetPublisher.html index 2e69763..da5a1aa 100644 --- a/docs/api/DatasetPublisher.html +++ b/docs/api/DatasetPublisher.html @@ -71,14 +71,10 @@ - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    DatasetPublisher

    atmosphere.DatasetPublisher(client)
    @@ -802,19 +651,6 @@

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -417,7 +267,7 @@

    On this page

    - +
    @@ -425,7 +275,6 @@

    On this page

    -

    DictSample

    DictSample(_data=None, **kwargs)
    @@ -678,19 +527,6 @@

    values

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -406,7 +256,7 @@

    On this page

  • Attributes
  • - +
    @@ -414,7 +264,6 @@

    On this page

    -

    IndexEntry

    IndexEntry()
    @@ -460,19 +309,6 @@

    Attributes

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -414,7 +264,7 @@

    On this page

    - +
    @@ -422,7 +272,6 @@

    On this page

    -

    lens

    lens

    @@ -941,19 +790,6 @@

    window.document.addEventListener("DOMContentLoaded", function (event) { - // Ensure there is a toggle, if there isn't float one in the top right - if (window.document.querySelector('.quarto-color-scheme-toggle') === null) { - const a = window.document.createElement('a'); - a.classList.add('top-right'); - a.classList.add('quarto-color-scheme-toggle'); - a.href = ""; - a.onclick = function() { try { window.quartoToggleColorScheme(); } catch {} return false; }; - const i = window.document.createElement("i"); - i.classList.add('bi'); - a.appendChild(i); - window.document.body.appendChild(a); - } - setColorSchemeToggle(hasAlternateSentinel()) const icon = ""; const anchorJS = new window.AnchorJS(); anchorJS.options = { @@ -1024,7 +860,7 @@

    { return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); } @@ -1359,7 +1195,7 @@

      - + diff --git a/docs/api/LensLoader.html b/docs/api/LensLoader.html index b3a3073..adf9921 100644 --- a/docs/api/LensLoader.html +++ b/docs/api/LensLoader.html @@ -71,14 +71,10 @@ - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    LensLoader

    atmosphere.LensLoader(client)
    @@ -643,19 +492,6 @@

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    LensPublisher

    atmosphere.LensPublisher(client)
    @@ -697,19 +546,6 @@

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -413,7 +263,7 @@

    On this page

    - +
    @@ -421,7 +271,6 @@

    On this page

    -

    PDSBlobStore

    atmosphere.PDSBlobStore(client)
    @@ -759,19 +608,6 @@

    Note

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    Packable

    Packable()
    @@ -498,19 +347,6 @@

    from_data

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    PackableSample

    PackableSample()
    @@ -576,19 +425,6 @@

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -413,7 +263,7 @@

    On this page

    - +
    @@ -421,7 +271,6 @@

    On this page

    -

    S3Source

    S3Source(
    @@ -767,19 +616,6 @@ 

    Ra

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -408,7 +258,7 @@

    On this page

  • Note
  • - +
    @@ -416,7 +266,6 @@

    On this page

    -

    SampleBatch

    SampleBatch(samples)
    @@ -486,19 +335,6 @@

    Note

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -410,7 +260,7 @@

    On this page

    - +
    @@ -418,7 +268,6 @@

    On this page

    -

    SchemaLoader

    atmosphere.SchemaLoader(client)
    @@ -582,19 +431,6 @@

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -409,7 +259,7 @@

    On this page

    - +
    @@ -417,7 +267,6 @@

    On this page

    -

    SchemaPublisher

    atmosphere.SchemaPublisher(client)
    @@ -569,19 +418,6 @@

    Rais

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    URLSource

    URLSource(url)
    @@ -548,19 +397,6 @@

    Rais

    - - - + - - - + + + - +
    @@ -255,10 +110,6 @@
    -
    @@ -375,7 +225,7 @@

    On this page

  • Promotion
  • - +
    @@ -383,7 +233,6 @@

    On this page

    -

    API Reference

    @@ -569,19 +418,6 @@

    Promotion

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -408,7 +258,7 @@

    On this page

  • Examples
  • - +
    @@ -416,7 +266,6 @@

    On this page

    -

    load_dataset

    load_dataset(
    @@ -566,19 +415,6 @@ 

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -424,7 +274,7 @@

    On this page

    - +
    @@ -432,7 +282,6 @@

    On this page

    -

    local.Index

    local.Index(
    @@ -1487,19 +1336,6 @@ 

    Ra

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -410,7 +260,7 @@

    On this page

    - +
    @@ -418,7 +268,6 @@

    On this page

    -

    local.LocalDatasetEntry

    local.LocalDatasetEntry(
    @@ -591,19 +440,6 @@ 

    window.document.addEventListener("DOMContentLoaded", function (event) { - // Ensure there is a toggle, if there isn't float one in the top right - if (window.document.querySelector('.quarto-color-scheme-toggle') === null) { - const a = window.document.createElement('a'); - a.classList.add('top-right'); - a.classList.add('quarto-color-scheme-toggle'); - a.href = ""; - a.onclick = function() { try { window.quartoToggleColorScheme(); } catch {} return false; }; - const i = window.document.createElement("i"); - i.classList.add('bi'); - a.appendChild(i); - window.document.body.appendChild(a); - } - setColorSchemeToggle(hasAlternateSentinel()) const icon = ""; const anchorJS = new window.AnchorJS(); anchorJS.options = { @@ -674,7 +510,7 @@

    { return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); } @@ -1009,7 +845,7 @@

      -

    + diff --git a/docs/api/local.S3DataStore.html b/docs/api/local.S3DataStore.html index 95026d1..4f89741 100644 --- a/docs/api/local.S3DataStore.html +++ b/docs/api/local.S3DataStore.html @@ -71,14 +71,10 @@ - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -411,7 +261,7 @@

    On this page

    - +
    @@ -419,7 +269,6 @@

    On this page

    -

    local.S3DataStore

    local.S3DataStore(credentials, *, bucket)
    @@ -644,19 +493,6 @@

    Rais

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -402,12 +252,13 @@

    On this page

    - +
    @@ -415,13 +266,16 @@

    On this page

    -

    packable

    packable(cls)

    Decorator to convert a regular class into a PackableSample.

    This decorator transforms a class into a dataclass that inherits from PackableSample, enabling automatic msgpack serialization/deserialization with special handling for NDArray fields.

    The resulting class satisfies the Packable protocol, making it compatible with all atdata APIs that accept packable types (e.g., publish_schema, lens transformations, etc.).

    +
    +

    Type Checking

    +

    The return type is annotated as type[PackableSample] so that IDEs and type checkers recognize the PackableSample methods (packed, as_wds, from_bytes, etc.). The @dataclass_transform() decorator ensures that field access from the original class is also preserved for type checking.

    +

    Parameters

    @@ -456,17 +310,17 @@

    Re

    - + - + - + @@ -493,19 +347,6 @@

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -408,7 +258,7 @@

    On this page

  • Examples
  • - +
    @@ -416,7 +266,6 @@

    On this page

    -

    promote_to_atmosphere

    promote.promote_to_atmosphere(
    @@ -552,19 +401,6 @@ 

    - - - + - - - + + + - +
    @@ -290,10 +145,6 @@
    -
    @@ -395,9 +245,9 @@ - - +
    @@ -541,8 +391,7 @@

    On this page

    - +
    -
    -

    atdata

    +

    atdata

    A loose federation of distributed, typed datasets built on WebDataset

    @@ -587,11 +435,12 @@

    atdata

    -
    -

    atdata

    -

    A loose federation of distributed, typed datasets built on WebDataset.

    +
    +

    The Challenge

    Machine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means:

    @@ -666,7 +515,7 @@

    Quick Example

    1. Define a Sample Type

    The @packable decorator creates a serializable dataclass:

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -681,7 +530,7 @@ 

    1. Define a Sample Ty

    2. Create and Write Samples

    Use WebDataset’s standard TarWriter:

    -
    +
    import webdataset as wds
     
     samples = [
    @@ -701,7 +550,7 @@ 

    2. Create and Wri

    3. Load and Iterate with Type Safety

    The generic Dataset[T] provides typed access:

    -
    +
    dataset = atdata.Dataset[ImageSample]("data-000000.tar")
     
     for batch in dataset.shuffled(batch_size=32):
    @@ -716,7 +565,7 @@ 

    Scaling Up

    Team Storage with Redis + S3

    When you’re ready to share with your team:

    -
    +
    from atdata.local import LocalIndex, S3DataStore
     
     # Connect to team infrastructure
    @@ -740,7 +589,7 @@ 

    Team Storage wi

    Federation with ATProto

    For public or cross-organization sharing:

    -
    +
    from atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore
     from atdata.promote import promote_to_atmosphere
     
    @@ -762,7 +611,7 @@ 

    Federation with AT

    HuggingFace-Style Loading

    For convenient access to datasets:

    -
    +
    from atdata import load_dataset
     
     # Load from local files
    @@ -847,19 +696,6 @@ 

    Next Steps

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -581,15 +431,14 @@

    On this page

  • Summary
  • Related
  • - +
    - -
    +
    -

    Architecture Overview

    +

    Architecture Overview

    @@ -657,7 +506,7 @@

    Core Components

    PackableSample: The Foundation

    Everything in atdata starts with PackableSample—a base class that makes Python dataclasses serializable with msgpack:

    -
    +
    @atdata.packable
     class ImageSample:
         image: NDArray       # Automatically converted to/from bytes
    @@ -680,7 +529,7 @@ 

    PackableSamp

    Dataset: Typed Iteration

    The Dataset[T] class wraps WebDataset tar archives with type information:

    -
    +
    dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")
     
     for batch in dataset.shuffled(batch_size=32):
    @@ -704,7 +553,7 @@ 

    Dataset: Typed Ite

    SampleBatch: Automatic Aggregation

    When iterating with batch_size, atdata returns SampleBatch[T] objects that aggregate sample attributes:

    -
    +
    batch = SampleBatch[ImageSample](samples)
     
     # NDArray fields → stacked numpy array with batch dimension
    @@ -718,7 +567,7 @@ 

    SampleBa

    Lens: Schema Transformations

    Lenses enable viewing datasets through different schemas without duplicating data:

    -
    +
    @atdata.packable
     class SimplifiedSample:
         label: str
    @@ -755,7 +604,7 @@ 

    Local Index (Redis +
  • WebDataset tar shards
  • Any S3-compatible storage (AWS, MinIO, Cloudflare R2)
  • -
    +
    store = S3DataStore(credentials=creds, bucket="datasets")
     index = LocalIndex(data_store=store)
     
    @@ -783,7 +632,7 @@ 

    Atmosphere Index
  • Store actual data shards as ATProto blobs
  • Fully decentralized—no external dependencies
  • -
    +
    client = AtmosphereClient()
     client.login("handle.bsky.social", "app-password")
     
    @@ -801,7 +650,7 @@ 

    Protocol Abstraction

    AbstractIndex

    Common interface for both LocalIndex and AtmosphereIndex:

    -
    +
    def process_dataset(index: AbstractIndex, name: str):
         entry = index.get_dataset(name)
         schema = index.decode_schema(entry.schema_ref)
    @@ -817,7 +666,7 @@ 

    AbstractIndex

    AbstractDataStore

    Common interface for S3DataStore and PDSBlobStore:

    -
    +
    def write_to_store(store: AbstractDataStore, dataset: Dataset):
         urls = store.write_shards(dataset, prefix="data/v1")
         # Works with S3 or PDS blob storage
    @@ -838,7 +687,7 @@

    Data Flow: L

    A typical workflow progresses through three stages:

    Stage 1: Local Development

    -
    +
    # Define type and create samples
     @atdata.packable
     class MySample:
    @@ -856,7 +705,7 @@ 

    Stage 1: Local D

    Stage 2: Team Storage

    -
    +
    # Set up team storage
     store = S3DataStore(credentials=team_creds, bucket="team-datasets")
     index = LocalIndex(data_store=store)
    @@ -871,7 +720,7 @@ 

    Stage 2: Team Storage

    Stage 3: Federation

    -
    +
    # Promote to atmosphere
     client = AtmosphereClient()
     client.login("handle.bsky.social", "app-password")
    @@ -904,7 +753,7 @@ 

    Extension Points

    Custom DataSources

    Implement the DataSource protocol to add new storage backends:

    -
    +
    class MyCustomSource:
         def list_shards(self) -> list[str]: ...
         def open_shard(self, shard_id: str) -> IO[bytes]: ...
    @@ -916,7 +765,7 @@ 

    Custom DataSources

    Custom Lenses

    Register transformations between any PackableSample types:

    -
    +
    @atdata.lens
     def my_transform(src: SourceType) -> TargetType:
         return TargetType(...)
    @@ -929,7 +778,7 @@ 

    Custom Lenses

    Schema Extensions

    The schema format supports custom metadata for domain-specific needs:

    -
    +
    index.publish_schema(
         MySample,
         version="1.0.0",
    @@ -1004,19 +853,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -577,15 +427,14 @@

    On this page

  • Complete Example
  • Related
  • - +
    - -
    +
    -

    Atmosphere (ATProto Integration)

    +

    Atmosphere (ATProto Integration)

    @@ -626,7 +475,7 @@

    Overview

    AtmosphereClient

    The client handles authentication and record operations:

    -
    +
    from atdata.atmosphere import AtmosphereClient
     
     client = AtmosphereClient()
    @@ -653,7 +502,7 @@ 

    AtmosphereClient

    Session Management

    Save and restore sessions to avoid re-authentication:

    -
    +
    # Export session for later
     session_string = client.export_session()
     
    @@ -665,7 +514,7 @@ 

    Session Management

    Custom PDS

    Connect to a custom PDS instead of bsky.social:

    -
    +
    client = AtmosphereClient(base_url="https://pds.example.com")
    @@ -673,7 +522,7 @@

    Custom PDS

    PDSBlobStore

    Store dataset shards as ATProto blobs for fully decentralized storage:

    -
    +
    from atdata.atmosphere import AtmosphereClient, PDSBlobStore
     
     client = AtmosphereClient()
    @@ -696,7 +545,7 @@ 

    PDSBlobStore

    Size Limits

    PDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use maxcount and maxsize parameters to control shard sizes:

    -
    +
    urls = store.write_shards(
         dataset,
         prefix="large-data/v1",
    @@ -709,7 +558,7 @@ 

    Size Limits

    BlobSource

    Read datasets stored as PDS blobs:

    -
    +
    from atdata import BlobSource
     
     # From blob references
    @@ -730,7 +579,7 @@ 

    BlobSource

    AtmosphereIndex

    The unified interface for ATProto operations, implementing the AbstractIndex protocol:

    -
    +
    from atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore
     
     client = AtmosphereClient()
    @@ -745,7 +594,7 @@ 

    AtmosphereIndex

    Publishing Schemas

    -
    +
    import atdata
     from numpy.typing import NDArray
     
    @@ -766,7 +615,7 @@ 

    Publishing Schemas

    Publishing Datasets

    -
    +
    dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")
     
     entry = index.insert_dataset(
    @@ -784,7 +633,7 @@ 

    Publishing Datasets

    Listing and Retrieving

    -
    +
    # List your datasets
     for entry in index.list_datasets():
         print(f"{entry.name}: {entry.schema_ref}")
    @@ -810,7 +659,7 @@ 

    Lower-Level Publish

    For more control, use the individual publisher classes:

    SchemaPublisher

    -
    +
    from atdata.atmosphere import SchemaPublisher
     
     publisher = SchemaPublisher(client)
    @@ -826,7 +675,7 @@ 

    SchemaPublisher

    DatasetPublisher

    -
    +
    from atdata.atmosphere import DatasetPublisher
     
     publisher = DatasetPublisher(client)
    @@ -846,7 +695,7 @@ 

    Blob Storage

    There are two approaches to storing data as ATProto blobs:

    Approach 1: PDSBlobStore (Recommended)

    Use PDSBlobStore with AtmosphereIndex for automatic shard management:

    -
    +
    from atdata.atmosphere import PDSBlobStore, AtmosphereIndex
     
     store = PDSBlobStore(client)
    @@ -865,7 +714,7 @@ 

    Blob Storage

    Approach 2: Manual Blob Publishing

    For more control, use DatasetPublisher.publish_with_blobs() directly:

    -
    +
    import io
     import webdataset as wds
     
    @@ -885,7 +734,7 @@ 

    Blob Storage

    )

    Loading Blob-Stored Datasets

    -
    +
    from atdata.atmosphere import DatasetLoader
     from atdata import BlobSource
     
    @@ -909,7 +758,7 @@ 

    Blob Storage

    LensPublisher

    -
    +
    from atdata.atmosphere import LensPublisher
     
     publisher = LensPublisher(client)
    @@ -952,7 +801,7 @@ 

    Lower-Level LoadersFor direct access to records, use the loader classes:

    SchemaLoader

    -
    +
    from atdata.atmosphere import SchemaLoader
     
     loader = SchemaLoader(client)
    @@ -968,7 +817,7 @@ 

    SchemaLoader

    DatasetLoader

    -
    +
    from atdata.atmosphere import DatasetLoader
     
     loader = DatasetLoader(client)
    @@ -996,7 +845,7 @@ 

    DatasetLoader

    LensLoader

    -
    +
    from atdata.atmosphere import LensLoader
     
     loader = LensLoader(client)
    @@ -1021,7 +870,7 @@ 

    LensLoader

    AT URIs

    ATProto records are identified by AT URIs:

    -
    +
    from atdata.atmosphere import AtUri
     
     # Parse an AT URI
    @@ -1088,7 +937,7 @@ 

    Supported Field Type

    Complete Example

    This example shows the full workflow using PDSBlobStore for decentralized storage:

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -1159,7 +1008,7 @@ 

    Complete Example

    break

    For external URL storage (without PDSBlobStore):

    -
    +
    # Use AtmosphereIndex without data_store
     index = AtmosphereIndex(client)
     
    @@ -1189,19 +1038,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -570,15 +420,14 @@

    On this page

  • Related
  • - +
    - -
    +
    -

    Datasets

    +

    Datasets

    @@ -603,7 +452,7 @@

    Datasets

    The Dataset class provides typed iteration over WebDataset tar files with automatic batching and lens transformations.

    Creating a Dataset

    -
    +
    import atdata
     from numpy.typing import NDArray
     
    @@ -626,7 +475,7 @@ 

    Data Sources

    URL Source (default)

    When you pass a string to Dataset, it automatically wraps it in a URLSource:

    -
    +
    # These are equivalent:
     dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")
     dataset = atdata.Dataset[ImageSample](atdata.URLSource("data-{000000..000009}.tar"))
    @@ -635,7 +484,7 @@

    URL Source (default)

    S3 Source

    For private S3 buckets or S3-compatible storage (Cloudflare R2, MinIO), use S3Source:

    -
    +
    # From explicit credentials
     source = atdata.S3Source(
         bucket="my-bucket",
    @@ -673,7 +522,7 @@ 

    Iteration Modes

    Ordered Iteration

    Iterate through samples in their original order:

    -
    +
    # With batching (default batch_size=1)
     for batch in dataset.ordered(batch_size=32):
         images = batch.image  # numpy array (32, H, W, C)
    @@ -687,7 +536,7 @@ 

    Ordered Iteration

    Shuffled Iteration

    Iterate with randomized order at both shard and sample levels:

    -
    +
    for batch in dataset.shuffled(batch_size=32):
         # Samples are shuffled
         process(batch)
    @@ -718,7 +567,7 @@ 

    Shuffled Iteration

    SampleBatch

    When iterating with a batch_size, each iteration yields a SampleBatch with automatic attribute aggregation.

    -
    +
    @atdata.packable
     class Sample:
         features: NDArray  # shape (256,)
    @@ -738,7 +587,7 @@ 

    SampleBatch

    Type Transformations with Lenses

    View a dataset through a different sample type using registered lenses:

    -
    +
    @atdata.packable
     class SimplifiedSample:
         label: str
    @@ -760,7 +609,7 @@ 

    Dataset Properties

    Shard List

    Get the list of individual tar files:

    -
    +
    dataset = atdata.Dataset[Sample]("data-{000000..000009}.tar")
     shards = dataset.shard_list
     # ['data-000000.tar', 'data-000001.tar', ..., 'data-000009.tar']
    @@ -769,7 +618,7 @@

    Shard List

    Metadata

    Datasets can have associated metadata from a URL:

    -
    +
    dataset = atdata.Dataset[Sample](
         "data-{000000..000009}.tar",
         metadata_url="https://example.com/metadata.msgpack"
    @@ -783,7 +632,7 @@ 

    Metadata

    Writing Datasets

    Use WebDataset’s TarWriter or ShardWriter to create datasets:

    -
    +
    import webdataset as wds
     import numpy as np
     
    @@ -806,7 +655,7 @@ 

    Writing Datasets

    Parquet Export

    Export dataset contents to parquet format:

    -
    +
    # Export entire dataset
     dataset.to_parquet("output.parquet")
     
    @@ -857,7 +706,7 @@ 

    Dataset Properties

    Source

    Access the underlying DataSource:

    -
    +
    dataset = atdata.Dataset[Sample]("data.tar")
     source = dataset.source  # URLSource instance
     print(source.shard_list)  # ['data.tar']
    @@ -866,7 +715,7 @@

    Source

    Sample Type

    Get the type parameter used to create the dataset:

    -
    +
    dataset = atdata.Dataset[ImageSample]("data.tar")
     print(dataset.sample_type)  # <class 'ImageSample'>
     print(dataset.batch_type)   # SampleBatch[ImageSample]
    @@ -888,19 +737,6 @@

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -566,15 +416,14 @@

    On this page

  • S3 IAM Policy Example
  • - +
    - -
    +
    -

    Deployment Guide

    +

    Deployment Guide

    @@ -922,19 +771,6 @@

    S3 IAM Policy Exampl

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -553,15 +403,14 @@

    On this page

  • Example: Feature Extraction
  • Related
  • - +
    - -
    +
    -

    Lenses

    +

    Lenses

    @@ -595,7 +444,7 @@

    Overview

    Creating a Lens

    Use the @lens decorator to define a getter:

    -
    +
    import atdata
     from numpy.typing import NDArray
     
    @@ -625,7 +474,7 @@ 

    Creating a Lens

    Adding a Putter

    To enable bidirectional updates, add a putter:

    -
    +
    @simplify.putter
     def simplify_put(view: SimpleSample, source: FullSample) -> FullSample:
         return FullSample(
    @@ -645,7 +494,7 @@ 

    Adding a Putter

    Using Lenses with Datasets

    Lenses integrate with Dataset.as_type():

    -
    +
    dataset = atdata.Dataset[FullSample]("data-{000000..000009}.tar")
     
     # View through a different type
    @@ -660,7 +509,7 @@ 

    Using Lenses wi

    Direct Lens Usage

    Lenses can also be called directly:

    -
    +
    import numpy as np
     
     full = FullSample(
    @@ -685,25 +534,25 @@ 

    Direct Lens Usage

    Lens Laws

    Well-behaved lenses should satisfy these properties:

    - +

    If you get a view and immediately put it back, the source is unchanged:

    -
    +
    view = lens.get(source)
     assert lens.put(view, source) == source

    If you put a view, getting it back yields that view:

    -
    +
    updated = lens.put(view, source)
     assert lens.get(updated) == view

    Putting twice is equivalent to putting once with the final value:

    -
    +
    result1 = lens.put(v2, lens.put(v1, source))
     result2 = lens.put(v2, source)
     assert result1 == result2
    @@ -715,7 +564,7 @@

    Lens Laws

    Trivial Putter

    If no putter is defined, a trivial putter is used that ignores view updates:

    -
    +
    @atdata.lens
     def extract_label(src: FullSample) -> SimpleSample:
         return SimpleSample(label=src.label, confidence=src.confidence)
    @@ -729,7 +578,7 @@ 

    Trivial Putter

    LensNetwork Registry

    The LensNetwork is a singleton that stores all registered lenses:

    -
    +
    from atdata.lens import LensNetwork
     
     network = LensNetwork()
    @@ -746,7 +595,7 @@ 

    LensNetwork Registry<

    Example: Feature Extraction

    -
    +
    @atdata.packable
     class RawSample:
         audio: NDArray
    @@ -788,19 +637,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -561,15 +411,14 @@

    On this page

  • Complete Example
  • Related
  • - +
    - -
    +
    -

    load_dataset API

    +

    load_dataset API

    @@ -604,7 +453,7 @@

    Overview

    Basic Usage

    -
    +
    import atdata
     from atdata import load_dataset
     from numpy.typing import NDArray
    @@ -627,7 +476,7 @@ 

    Basic Usage

    Path Formats

    WebDataset Brace Notation

    -
    +
    # Range notation
     ds = load_dataset("data-{000000..000099}.tar", MySample, split="train")
     
    @@ -637,7 +486,7 @@ 

    WebDataset Brace

    Glob Patterns

    -
    +
    # Match all tar files
     ds = load_dataset("path/to/*.tar", MySample)
     
    @@ -647,14 +496,14 @@ 

    Glob Patterns

    Local Directory

    -
    +
    # Scans for .tar files
     ds = load_dataset("./my-dataset/", MySample)

    Remote URLs

    -
    +
    # S3 (public buckets)
     ds = load_dataset("s3://bucket/data-{000..099}.tar", MySample, split="train")
     
    @@ -680,7 +529,7 @@ 

    Remote URLs

    Index Lookup

    -
    +
    from atdata.local import LocalIndex
     
     index = LocalIndex()
    @@ -747,7 +596,7 @@ 

    Split Detection

    DatasetDict

    When loading without split=, returns a DatasetDict:

    -
    +
    ds_dict = load_dataset("path/to/data/", MySample)
     
     # Access splits
    @@ -767,7 +616,7 @@ 

    DatasetDict

    Explicit Data Files

    Override automatic detection with data_files:

    -
    +
    # Single pattern
     ds = load_dataset(
         "path/to/",
    @@ -796,7 +645,7 @@ 

    Explicit Data Files

    Streaming Mode

    The streaming parameter signals intent for streaming mode:

    -
    +
    # Mark as streaming
     ds_dict = load_dataset("path/to/data.tar", MySample, streaming=True)
     
    @@ -821,7 +670,7 @@ 

    Streaming Mode

    Auto Type Resolution

    When using index lookup, the sample type can be resolved automatically:

    -
    +
    from atdata.local import LocalIndex
     
     index = LocalIndex()
    @@ -835,7 +684,7 @@ 

    Auto Type Resolution<

    Error Handling

    -
    +
    try:
         ds = load_dataset("path/to/data.tar", MySample, split="train")
     except FileNotFoundError:
    @@ -851,7 +700,7 @@ 

    Error Handling

    Complete Example

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -905,19 +754,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -560,15 +410,14 @@

    On this page

  • Complete Workflow Example
  • Related
  • - +
    - -
    +
    -

    Local Storage

    +

    Local Storage

    @@ -603,7 +452,7 @@

    Overview

    LocalIndex

    The index tracks datasets in Redis:

    -
    +
    from atdata.local import LocalIndex
     
     # Default connection (localhost:6379)
    @@ -619,7 +468,7 @@ 

    LocalIndex

    Adding Entries

    -
    +
    import atdata
     from numpy.typing import NDArray
     
    @@ -644,7 +493,7 @@ 

    Adding Entries

    Listing and Retrieving

    -
    +
    # Iterate all entries
     for entry in index.entries:
         print(f"{entry.name}: {entry.cid}")
    @@ -676,7 +525,7 @@ 

    Repo (Deprecated)

    The Repo class combines S3 storage with Redis indexing:

    -
    +
    from atdata.local import Repo
     
     # From credentials file
    @@ -696,7 +545,7 @@ 

    Repo (Deprecated)

    )

    Preferred approach - Use LocalIndex with S3DataStore:

    -
    +
    from atdata.local import LocalIndex, S3DataStore
     
     store = S3DataStore(
    @@ -734,7 +583,7 @@ 

    Credentials File F

    Inserting Datasets

    -
    +
    import webdataset as wds
     import numpy as np
     
    @@ -764,7 +613,7 @@ 

    Inserting Datasets

    Insert Options

    -
    +
    entry, ds = repo.insert(
         dataset,
         name="my-dataset",
    @@ -778,7 +627,7 @@ 

    Insert Options

    LocalDatasetEntry

    Index entries provide content-addressable identification:

    -
    +
    entry = index.get_entry_by_name("my-dataset")
     
     # Core properties (IndexEntry protocol)
    @@ -811,7 +660,7 @@ 

    LocalDatasetEntry

    Schema Storage

    Schemas can be stored and retrieved from the index:

    -
    +
    # Publish a schema
     schema_ref = index.publish_schema(
         ImageSample,
    @@ -842,7 +691,7 @@ 

    Schema Storage

    S3DataStore

    For direct S3 operations without Redis indexing:

    -
    +
    from atdata.local import S3DataStore
     
     store = S3DataStore(
    @@ -864,7 +713,7 @@ 

    S3DataStore

    Complete Workflow Example

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -932,19 +781,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -564,15 +414,14 @@

    On this page

  • Best Practices
  • Related
  • - +
    - -
    +
    -

    Packable Samples

    +

    Packable Samples

    @@ -598,7 +447,7 @@

    Packable Samples

    The @packable Decorator

    The recommended way to define a sample type is with the @packable decorator:

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -620,7 +469,7 @@ 

    The @packable

    Supported Field Types

    Primitives

    -
    +
    @atdata.packable
     class PrimitiveSample:
         name: str
    @@ -633,7 +482,7 @@ 

    Primitives

    NumPy Arrays

    Fields annotated as NDArray are automatically converted:

    -
    +
    @atdata.packable
     class ArraySample:
         features: NDArray          # Required array
    @@ -655,7 +504,7 @@ 

    NumPy Arrays

    Lists

    -
    +
    @atdata.packable
     class ListSample:
         tags: list[str]
    @@ -667,7 +516,7 @@ 

    Lists

    Serialization

    Packing to Bytes

    -
    +
    sample = ImageSample(
         image=np.random.rand(224, 224, 3).astype(np.float32),
         label="cat",
    @@ -681,7 +530,7 @@ 

    Packing to Bytes

    Unpacking from Bytes

    -
    +
    # Deserialize from bytes
     restored = ImageSample.from_bytes(packed_bytes)
     
    @@ -693,12 +542,12 @@ 

    Unpacking from Bytes<

    WebDataset Format

    The as_wds property returns a dict ready for WebDataset:

    -
    +
    wds_dict = sample.as_wds
     # {'__key__': '1234...', 'msgpack': b'...'}

    Write samples to a tar file:

    -
    +
    import webdataset as wds
     
     with wds.writer.TarWriter("data-000000.tar") as sink:
    @@ -711,7 +560,7 @@ 

    WebDataset Format

    Direct Inheritance (Alternative)

    You can also inherit directly from PackableSample:

    -
    +
    from dataclasses import dataclass
     
     @dataclass
    @@ -726,7 +575,7 @@ 

    How It Works

    Serialization Flow

    - +
      @@ -749,7 +598,7 @@

      Serialization Flow

      The _ensure_good() Method

      This method runs automatically after construction and handles NDArray conversion:

      -
      +
      def _ensure_good(self):
           for field in dataclasses.fields(self):
               if _is_possibly_ndarray_type(field.type):
      @@ -765,7 +614,7 @@ 

      Best Practices

      -
      +
      @atdata.packable
       class GoodSample:
           features: NDArray           # Clear type annotation
      @@ -775,7 +624,7 @@ 

      Best Practices

      -
      +
      @atdata.packable
       class BadSample:
           # DON'T: Nested dataclasses not supported
      @@ -804,19 +653,6 @@ 

      Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -552,15 +402,14 @@

    On this page

  • Requirements
  • Related
  • - +
    - -
    +
    -

    Promotion Workflow

    +

    Promotion Workflow

    @@ -594,7 +443,7 @@

    Overview

    Basic Usage

    -
    +
    from atdata.local import LocalIndex
     from atdata.atmosphere import AtmosphereClient
     from atdata.promote import promote_to_atmosphere
    @@ -614,7 +463,7 @@ 

    Basic Usage

    With Metadata

    -
    +
    at_uri = promote_to_atmosphere(
         entry,
         local_index,
    @@ -629,7 +478,7 @@ 

    With Metadata

    Schema Deduplication

    The promotion workflow automatically checks for existing schemas:

    -
    +
    # First promotion: publishes schema
     uri1 = promote_to_atmosphere(entry1, local_index, client)
     
    @@ -645,11 +494,11 @@ 

    Schema Deduplication<

    Data Storage Options

    - +

    By default, promotion keeps the original data URLs:

    -
    +
    # Data stays in original S3 location
     at_uri = promote_to_atmosphere(entry, local_index, client)
    @@ -662,7 +511,7 @@

    Data Storage Options<

    To copy data to a different storage location:

    -
    +
    from atdata.local import S3DataStore
     
     # Create new data store
    @@ -690,7 +539,7 @@ 

    Data Storage Options<

    Complete Workflow Example

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -761,7 +610,7 @@ 

    Complete Workflo

    Error Handling

    -
    +
    try:
         at_uri = promote_to_atmosphere(entry, local_index, client)
     except KeyError as e:
    @@ -795,19 +644,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -571,15 +421,14 @@

    On this page

  • Complete Example
  • Related
  • - +
    - -
    +
    -

    Protocols

    +

    Protocols

    @@ -615,7 +464,7 @@

    Overview

    IndexEntry Protocol

    Represents a dataset entry in any index:

    -
    +
    from atdata._protocols import IndexEntry
     
     def process_entry(entry: IndexEntry) -> None:
    @@ -669,7 +518,7 @@ 

    Implementations

    AbstractIndex Protocol

    Defines operations for managing schemas and datasets:

    -
    +
    from atdata._protocols import AbstractIndex
     
     def list_all_datasets(index: AbstractIndex) -> None:
    @@ -679,7 +528,7 @@ 

    AbstractIndex Proto

    Dataset Operations

    -
    +
    # Insert a dataset
     entry = index.insert_dataset(
         dataset,
    @@ -697,7 +546,7 @@ 

    Dataset Operations

    Schema Operations

    -
    +
    # Publish a schema
     schema_ref = index.publish_schema(
         MySample,
    @@ -728,7 +577,7 @@ 

    Implementations

    AbstractDataStore Protocol

    Abstracts over different storage backends:

    -
    +
    from atdata._protocols import AbstractDataStore
     
     def write_dataset(store: AbstractDataStore, dataset) -> list[str]:
    @@ -738,7 +587,7 @@ 

    AbstractDataSto

    Methods

    -
    +
    # Write dataset shards
     urls = store.write_shards(
         dataset,
    @@ -765,7 +614,7 @@ 

    Implementations

    DataSource Protocol

    Abstracts over different data source backends for streaming dataset shards:

    -
    +
    from atdata._protocols import DataSource
     
     def load_from_source(source: DataSource) -> None:
    @@ -778,7 +627,7 @@ 

    DataSource Protocol

    Methods

    -
    +
    # Get list of shard identifiers
     shard_ids = source.shard_list  # ['data-000000.tar', 'data-000001.tar', ...]
     
    @@ -801,7 +650,7 @@ 

    Implementations

    Creating Custom Data Sources

    Implement the DataSource protocol for custom backends:

    -
    +
    from typing import Iterator, IO
     from atdata._protocols import DataSource
     
    @@ -839,7 +688,7 @@ 

    Creating Cust

    Using Protocols for Polymorphism

    Write code that works with any backend:

    -
    +
    from atdata._protocols import AbstractIndex, IndexEntry
     from atdata import Dataset
     
    @@ -910,7 +759,7 @@ 

    Schema Reference

    Type Checking

    Protocols are runtime-checkable:

    -
    +
    from atdata._protocols import IndexEntry, AbstractIndex
     
     # Check if object implements protocol
    @@ -924,7 +773,7 @@ 

    Type Checking

    Complete Example

    -
    +
    import atdata
     from atdata.local import LocalIndex, S3DataStore
     from atdata.atmosphere import AtmosphereClient, AtmosphereIndex
    @@ -986,19 +835,6 @@ 

    Related

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -564,15 +414,14 @@

    On this page

  • Getting Help
  • - +
    - -
    +
    -

    Troubleshooting & FAQ

    +

    Troubleshooting & FAQ

    @@ -837,19 +686,6 @@

    Getting Help

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -557,15 +407,14 @@

    On this page

  • Relationship to AT Protocol URIs
  • Legacy Format
  • - +
    - -
    +
    -

    URI Specification

    +

    URI Specification

    @@ -685,7 +534,7 @@

    Version Specifiers

    Examples

    Local Development

    -
    +
    from atdata.local import Index
     
     index = Index()
    @@ -704,7 +553,7 @@ 

    Local Development

    Atmosphere (ATProto Federation)

    -
    +
    from atdata.atmosphere import Client
     
     client = Client()
    @@ -758,19 +607,6 @@ 

    Legacy Format

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -564,15 +414,14 @@

    On this page

  • The Full Picture
  • Next Steps
  • - +
    - -
    +
    -

    Atmosphere Publishing

    +

    Atmosphere Publishing

    @@ -658,7 +507,7 @@

    Prerequisites

    Setup

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -678,7 +527,7 @@ 

    Setup

    Define Sample Types

    -
    +
    @atdata.packable
     class ImageSample:
         """A sample containing image data with metadata."""
    @@ -697,7 +546,7 @@ 

    Define Sample Types

    Type Introspection

    See what information is available from a PackableSample type:

    -
    +
    from dataclasses import fields, is_dataclass
     
     print(f"Sample type: {ImageSample.__name__}")
    @@ -732,7 +581,7 @@ 

    AT URI Parsing

    Understanding AT URIs is essential for working with atmosphere datasets, as they’re how you reference schemas, datasets, and lenses.

    ATProto records are identified by AT URIs:

    -
    +
    uris = [
         "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789",
         "at://alice.bsky.social/ac.foundation.dataset.record/my-dataset",
    @@ -750,7 +599,7 @@ 

    AT URI Parsing

    Authentication

    The AtmosphereClient handles ATProto authentication. When you authenticate, you’re proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS).

    Connect to ATProto:

    -
    +
    client = AtmosphereClient()
     client.login("your.handle.social", "your-app-password")
     
    @@ -761,7 +610,7 @@ 

    Authentication

    Publish a Schema

    When you publish a schema to ATProto, it becomes a public, immutable record that others can reference. The schema CID ensures that anyone can verify they’re using exactly the same type definition you published.

    -
    +
    schema_publisher = SchemaPublisher(client)
     schema_uri = schema_publisher.publish(
         ImageSample,
    @@ -774,7 +623,7 @@ 

    Publish a Schema

    List Your Schemas

    -
    +
    schema_loader = SchemaLoader(client)
     schemas = schema_loader.list_all(limit=10)
     print(f"Found {len(schemas)} schema(s)")
    @@ -787,7 +636,7 @@ 

    List Your Schemas

    Publish a Dataset

    With External URLs

    -
    +
    dataset_publisher = DatasetPublisher(client)
     dataset_uri = dataset_publisher.publish_with_urls(
         urls=["s3://example-bucket/demo-data-{000000..000009}.tar"],
    @@ -809,7 +658,7 @@ 

    With PDS
  • Federated replication: Relays can mirror your blobs for availability
  • For fully decentralized storage, use PDSBlobStore to store dataset shards directly as ATProto blobs in your PDS:

    -
    +
    # Create store and index with blob storage
     store = PDSBlobStore(client)
     index = AtmosphereIndex(client, data_store=store)
    @@ -853,7 +702,7 @@ 

    With PDS

    Use BlobSource to stream directly from PDS blobs:

    -
    +
    # Create source from the blob URLs
     source = store.create_source(entry.data_urls)
     
    @@ -874,7 +723,7 @@ 

    With PDS

    With External URLs

    For larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the index entry while the actual data lives elsewhere.

    For larger datasets or when using existing object storage:

    -
    +
    dataset_publisher = DatasetPublisher(client)
     dataset_uri = dataset_publisher.publish_with_urls(
         urls=["s3://example-bucket/demo-data-{000000..000009}.tar"],
    @@ -890,7 +739,7 @@ 

    With External URLs

    List and Load Datasets

    -
    +
    dataset_loader = DatasetLoader(client)
     datasets = dataset_loader.list_all(limit=10)
     print(f"Found {len(datasets)} dataset(s)")
    @@ -905,7 +754,7 @@ 

    List and Load Datas

    Load a Dataset

    -
    +
    # Check storage type
     storage_type = dataset_loader.get_storage_type(str(blob_dataset_uri))
     print(f"Storage type: {storage_type}")
    @@ -933,7 +782,7 @@ 

    Complete Publ

    Notice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend.

    This example shows the recommended workflow using PDSBlobStore for fully decentralized storage:

    -
    +
    # 1. Define and create samples
     @atdata.packable
     class FeatureSample:
    @@ -1061,19 +910,6 @@ 

    Next Steps

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -557,15 +407,14 @@

    On this page

  • What You’ve Learned
  • Next Steps
  • - +
    - -
    +
    -

    Local Workflow

    +

    Local Workflow

    @@ -644,7 +493,7 @@

    Prerequisites

    Setup

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -654,7 +503,7 @@ 

    Setup

    Define Sample Types

    -
    +
    @atdata.packable
     class TrainingSample:
         """A sample containing features and label for training."""
    @@ -678,7 +527,7 @@ 

    LocalDatasetEntry

    CIDs are computed from the entry’s schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it’s stored.

    Create entries with content-addressable CIDs:

    -
    +
    # Create an entry manually
     entry = LocalDatasetEntry(
         _name="my-dataset",
    @@ -711,7 +560,7 @@ 

    LocalDatasetEntry

    LocalIndex

    The LocalIndex is your team’s dataset registry. It implements the AbstractIndex protocol, meaning code written against LocalIndex will also work with AtmosphereIndex when you’re ready for federated sharing.

    The index tracks datasets in Redis:

    -
    +
    from redis import Redis
     
     # Connect to Redis
    @@ -724,7 +573,7 @@ 

    LocalIndex

    Schema Management

    Schema publishing is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference.

    This enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code.

    -
    +
    # Publish a schema
     schema_ref = index.publish_schema(TrainingSample, version="1.0.0")
     print(f"Published schema: {schema_ref}")
    @@ -753,7 +602,7 @@ 

    S3DataStore

    The data store handles uploading tar shards and creating signed URLs for streaming access.

    For direct S3 operations:

    -
    +
    creds = {
         "AWS_ENDPOINT": "http://localhost:9000",
         "AWS_ACCESS_KEY_ID": "minioadmin",
    @@ -779,7 +628,7 @@ 

    Complete Index Wor

    The index composition pattern (LocalIndex(data_store=S3DataStore(...))) is deliberate—it separates the concern of “where is metadata?” from “where is data?”, making it easy to swap storage backends.

    Use LocalIndex with S3DataStore to store datasets with S3 storage and Redis indexing:

    -
    +
    # 1. Create sample data
     samples = [
         TrainingSample(
    @@ -829,7 +678,7 @@ 

    Complete Index Wor

    Using load_dataset with Index

    The load_dataset() function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve @local/ prefixed paths to the actual data URLs and apply the correct credentials automatically.

    The load_dataset() function supports index lookup:

    -
    +
    from atdata import load_dataset
     
     # Load from local index
    @@ -903,19 +752,6 @@ 

    Next Steps

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -558,15 +408,14 @@

    On this page

  • The Complete Journey
  • Next Steps
  • - +
    - -
    +
    -

    Promotion Workflow

    +

    Promotion Workflow

    @@ -621,7 +470,7 @@

    Overview

    Setup

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -634,7 +483,7 @@ 

    Setup

    Prepare a Local Dataset

    First, set up a dataset in local storage:

    -
    +
    # 1. Define sample type
     @atdata.packable
     class ExperimentSample:
    @@ -684,7 +533,7 @@ 

    Prepare a Local Da

    Basic Promotion

    Promote the dataset to ATProto:

    -
    +
    # Connect to atmosphere
     client = AtmosphereClient()
     client.login("myhandle.bsky.social", "app-password")
    @@ -697,7 +546,7 @@ 

    Basic Promotion

    Promotion with Metadata

    Add description, tags, and license:

    -
    +
    at_uri = promote_to_atmosphere(
         local_entry,
         local_index,
    @@ -713,7 +562,7 @@ 

    Promotion with Met

    Schema Deduplication

    The promotion workflow automatically checks for existing schemas:

    -
    +
    from atdata.promote import _find_existing_schema
     
     # Check if schema already exists
    @@ -725,7 +574,7 @@ 

    Schema Deduplication< print("No existing schema found, will publish new one")

    When you promote multiple datasets with the same sample type:

    -
    +
    # First promotion: publishes schema
     uri1 = promote_to_atmosphere(entry1, local_index, client)
     
    @@ -736,11 +585,11 @@ 

    Schema Deduplication<

    Data Migration Options

    - +

    By default, promotion keeps the original data URLs:

    -
    +
    # Data stays in original S3 location
     at_uri = promote_to_atmosphere(local_entry, local_index, client)
    @@ -753,7 +602,7 @@

    Data Migration Opti

    To copy data to a different storage location:

    -
    +
    from atdata.local import S3DataStore
     
     # Create new data store
    @@ -783,7 +632,7 @@ 

    Data Migration Opti

    Verify on Atmosphere

    After promotion, verify the dataset is accessible:

    -
    +
    from atdata.atmosphere import AtmosphereIndex
     
     atm_index = AtmosphereIndex(client)
    @@ -804,7 +653,7 @@ 

    Verify on Atmosphere<

    Error Handling

    -
    +
    try:
         at_uri = promote_to_atmosphere(local_entry, local_index, client)
     except KeyError as e:
    @@ -828,7 +677,7 @@ 

    Requirements Checkl

    Complete Workflow

    -
    +
    # Complete local-to-atmosphere workflow
     import numpy as np
     from numpy.typing import NDArray
    @@ -944,19 +793,6 @@ 

    Next Steps

    - - - + - - - + + + - +
    @@ -291,10 +146,6 @@
    -
    @@ -396,9 +246,9 @@ - - +
    @@ -553,15 +403,14 @@

    On this page

  • What You’ve Learned
  • Next Steps
  • - +
    - -
    +
    -

    Quick Start

    +

    Quick Start

    @@ -606,7 +455,7 @@

    Define a Sample Type<
  • Round-trip fidelity: Data survives serialization without loss
  • Use the @packable decorator to create a typed sample:

    -
    +
    import numpy as np
     from numpy.typing import NDArray
     import atdata
    @@ -627,7 +476,7 @@ 

    Define a Sample Type<

    Create Sample Instances

    -
    +
    # Create a single sample
     sample = ImageSample(
         image=np.random.rand(224, 224, 3).astype(np.float32),
    @@ -655,7 +504,7 @@ 

    Write a Dataset

    The as_wds property on your sample provides the dictionary format WebDataset expects:

    Use WebDataset’s TarWriter to create dataset files:

    -
    +
    import webdataset as wds
     
     # Create 100 samples
    @@ -686,7 +535,7 @@ 

    Load and Iterate

    This eliminates boilerplate collation code and works automatically with any PackableSample type.

    Create a typed Dataset and iterate with batching:

    -
    +
    # Load dataset with type
     dataset = atdata.Dataset[ImageSample]("my-dataset-000000.tar")
     
    @@ -713,7 +562,7 @@ 

    Shuffled Iteration

    This approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset.

    For training, use shuffled iteration:

    -
    +
    for batch in dataset.shuffled(batch_size=32):
         # Samples are shuffled at shard and sample level
         images = batch.image
    @@ -734,7 +583,7 @@ 

    Use Le
  • Derived features: Compute fields on-the-fly during iteration
  • View datasets through different schemas:

    -
    +
    # Define a simplified view type
     @atdata.packable
     class SimplifiedSample:
    @@ -814,19 +663,6 @@ 

    Next Steps

    type[_T]type[PackableSample] A new dataclass that inherits from PackableSample with the same
    type[_T]type[PackableSample] name and annotations as the original class. The class satisfies the
    type[_T]type[PackableSample] Packable protocol and can be used with Type[Packable] signatures.