Summary
Build an archival service that watches for new dataset records, downloads and mirrors the actual data shards, and publishes mirror records with provenance tracking. Prevents link rot for datasets stored on ephemeral or unreliable storage.
Architecture
Runs alongside the AppView (separate process or integrated worker):
- Watches for new dataset records via the indexed database
- Downloads shard data from the storage location (HTTP, S3, or PDS blobs)
- Stores mirrored data in a centrally-hosted archive (object storage)
- Publishes
ac.foundation.dataset.mirror records linking originals to mirrors
Design considerations
Summary
Build an archival service that watches for new dataset records, downloads and mirrors the actual data shards, and publishes mirror records with provenance tracking. Prevents link rot for datasets stored on ephemeral or unreliable storage.
Architecture
Runs alongside the AppView (separate process or integrated worker):
ac.foundation.dataset.mirrorrecords linking originals to mirrorsDesign considerations
Storage costs scale with dataset size — need a policy for what to mirror (e.g. only datasets under a size threshold, or only from verified publishers)
Provenance tracking: mirrors should clearly indicate they're copies, not originals
May require a new
ac.foundation.dataset.mirrorlexicon (see Design and implement atdata-app: AppView service for network-wide dataset discovery and archival atdata#33)Graceful handling of unreachable storage (retry, mark as unavailable)
Ref: Design and implement atdata-app: AppView service for network-wide dataset discovery and archival atdata#33 (non-MVP: "archival sidecar with shard mirroring")