Skip to content

S3DataStore shards published without endpoint in ATProto records #81

@maxine-at-forecast

Description

@maxine-at-forecast

Summary

When using Index.write_samples() with an S3DataStore and an atmosphere target (@handle/name), the resulting ATProto record uses StorageHttp with raw s3://bucket/path URLs instead of StorageS3 with the bucket, keys, and endpoint.

This means consumers of the published dataset record have no way to resolve the shard URLs — they lack the S3 endpoint needed to connect.

Reproduction

import atdata
from atdata.stores._s3 import S3DataStore
from atdata.atmosphere.client import Atmosphere

store = S3DataStore(
    credentials={
        "AWS_ENDPOINT": "https://account.r2.cloudflarestorage.com",
        "AWS_ACCESS_KEY_ID": "...",
        "AWS_SECRET_ACCESS_KEY": "...",
    },
    bucket="my-bucket",
)
atmo = Atmosphere.login("handle", "password", base_url="https://my-pds.example.com")
index = atdata.Index(atmosphere=atmo, data_store=store)

entry = index.write_samples(
    samples,
    name="@handle/my-dataset",
    data_store=store,
)
# entry.data_urls contains: ["s3://my-bucket/my-dataset/data--uuid--000000.tar"]
# ATProto record uses StorageHttp with these as URLs — no endpoint info

Expected behavior

When data_store is an S3DataStore, the published ATProto record should use StorageS3 with:

  • bucket: from the S3DataStore
  • endpoint: from the S3DataStore credentials
  • shards[].key: the key portion of each shard URL

This is already supported by DatasetPublisher.publish_with_s3() — it just isn't wired up in the Index.write_samplesinsert_dataset path.

Workaround

Manually call S3DataStore.write_shards() then DatasetPublisher.publish_with_s3() with explicit bucket/keys/endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions