Skip to content

Task: Support content-identified datasets #203

@orlandohohmeier

Description

@orlandohohmeier

Summary

Introduce support for globally unique dataset identification via a content-derived ID.

  • Define a deterministic content-based ID for datasets (e.g., hash of dataset manifest or slice list).
  • Implement a way to resolve dataset names into these IDs.
  • Ensure the training and scheduling flow can work entirely off this content ID to enable reproducibility and deduplication.

Background

Right now, dataset names serve as identifiers, but they are not content-stable or unique. With content-addressed slices now in place, we should extend this to the dataset level for stronger integrity, reproducibility, and cacheability guarantees.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions