Skip to content

OCFL Object Forking #44

@lnielsen

Description

@lnielsen

In Zenodo we have a use case where we have two layers of versioning. A user can publish a dataset on Zenodo which will get a DOI. A new version of the dataset can be published by the user, which will get it a new DOI. This way a DOI always point to a locked set of digital files. Occasionally, however, we have the need to change files of an already published dataset with a DOI (e.g. user accidentally included personal data in the dataset and discovered 2 months later). Essentially this means we have two layers of versioning in Zenodo, which I'll call

  • Versioning (each version get's a new DOI - at the repository level each version is a separate record)
  • Revisions (edits to a single version - at the repository level this a single record).

In the Zenodo of case, our need for deduplication is essentially between versions, because' that's where a user may only add 1GB to a 100TB dataset.

They way we have thought about mapping Zenodo to OCFL is that each DOI is essentially an OCFL object. Because OCFL object only supports deduplication within an OCFL object, and not between OCFL objects, nor does OCFL allow symlinks, then we cannot do this deduplication.

Example

Imagine these actions:

  1. Publish first version 10.5281/zenodo.1234 with two very large (let's just say 100TB to exaggerate) files: data-01.zip and mishap.zip
  2. Publish new version 10.5281/zenodo.4321 with one new file: data-02.zip (files is thus: data-01.zip and data-02.zip).
  3. Remove mishap.zip from 10.5281/zenodo.1234

The OCFL objects would be:

[10.5821/zenodo.1234]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    ├── v1
    │   ├── inventory.json
    │   ├── inventory.json.sha512
    │   └── content
    │       ├── data-01.zip
    │       └── mishap.zip
    └── v2
        ├── inventory.json
        ├── inventory.json.sha512
        └── content


[10.5821/zenodo.4321]
    ├── 0=ocfl_object_1.0
    ├── inventory.json
    ├── inventory.json.sha512
    └── v1
        ├── inventory.json
        ├── inventory.json.sha512
        └── content
            ├── data-01.zip (duplicatied 100TB of data!!!)
            └── data-02.zip

What I would like is not having to duplicate data-01.zip in 10.5821/zenodo.4321 OCFL object?

Is there a solution for this in OCFL, or a different way to construct our OCFL objects that could support this?

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Component: SpecificationConfirmed: In-scopeUse case will be included in the upcoming version of the spec or implementation notes.

    Type

    No type

    Projects

    Status

    Ready

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions