-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call
- Versioning (each version get's a new DOI - at the repository level each version is a separate record)
- Revisions (edits to a single version - at the repository level this a single record).
In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.
They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.
Example
Imagine these actions:
- Publish first version 10.5281/zenodo.1234 with two very large (let's just say 100TB to exaggerate) files:
data-01.zipandmishap.zip - Publish new version 10.5281/zenodo.4321 with one new file:
data-02.zip(files is thus:data-01.zipanddata-02.zip). - Remove
mishap.zipfrom 10.5281/zenodo.1234
The OCFL objects would be:
[10.5821/zenodo.1234]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── content
│ ├── data-01.zip
│ └── mishap.zip
└── v2
├── inventory.json
├── inventory.json.sha512
└── content
[10.5821/zenodo.4321]
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
└── v1
├── inventory.json
├── inventory.json.sha512
└── content
├── data-01.zip (duplicatied 100TB of data!!!)
└── data-02.zip
What I would like is not having to duplicate data-01.zip in 10.5821/zenodo.4321 OCFL object?
Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status