Summary
When using Index.write_samples() with an S3DataStore and an atmosphere target (@handle/name), the resulting ATProto record uses StorageHttp with raw s3://bucket/path URLs instead of StorageS3 with the bucket, keys, and endpoint.
This means consumers of the published dataset record have no way to resolve the shard URLs — they lack the S3 endpoint needed to connect.
Reproduction
import atdata
from atdata.stores._s3 import S3DataStore
from atdata.atmosphere.client import Atmosphere
store = S3DataStore(
credentials={
"AWS_ENDPOINT": "https://account.r2.cloudflarestorage.com",
"AWS_ACCESS_KEY_ID": "...",
"AWS_SECRET_ACCESS_KEY": "...",
},
bucket="my-bucket",
)
atmo = Atmosphere.login("handle", "password", base_url="https://my-pds.example.com")
index = atdata.Index(atmosphere=atmo, data_store=store)
entry = index.write_samples(
samples,
name="@handle/my-dataset",
data_store=store,
)
# entry.data_urls contains: ["s3://my-bucket/my-dataset/data--uuid--000000.tar"]
# ATProto record uses StorageHttp with these as URLs — no endpoint info
Expected behavior
When data_store is an S3DataStore, the published ATProto record should use StorageS3 with:
bucket: from the S3DataStore
endpoint: from the S3DataStore credentials
shards[].key: the key portion of each shard URL
This is already supported by DatasetPublisher.publish_with_s3() — it just isn't wired up in the Index.write_samples → insert_dataset path.
Workaround
Manually call S3DataStore.write_shards() then DatasetPublisher.publish_with_s3() with explicit bucket/keys/endpoint.
Summary
When using
Index.write_samples()with anS3DataStoreand an atmosphere target (@handle/name), the resulting ATProto record usesStorageHttpwith raws3://bucket/pathURLs instead ofStorageS3with the bucket, keys, and endpoint.This means consumers of the published dataset record have no way to resolve the shard URLs — they lack the S3 endpoint needed to connect.
Reproduction
Expected behavior
When
data_storeis anS3DataStore, the published ATProto record should useStorageS3with:bucket: from the S3DataStoreendpoint: from the S3DataStore credentialsshards[].key: the key portion of each shard URLThis is already supported by
DatasetPublisher.publish_with_s3()— it just isn't wired up in theIndex.write_samples→insert_datasetpath.Workaround
Manually call
S3DataStore.write_shards()thenDatasetPublisher.publish_with_s3()with explicit bucket/keys/endpoint.