Skip to content
This repository was archived by the owner on Sep 27, 2022. It is now read-only.
This repository was archived by the owner on Sep 27, 2022. It is now read-only.

On detecting deleted files across versions #3

@marcolarosa

Description

@marcolarosa

Conversation moved from OCFL/spec#525

In OCFL/spec#522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).

In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.

This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).

Consider the following:

  v1                                     v2
  |- File A - hash X                     |- File A - hash X

No change; do not create new version.
  v1                                     v2
  |- File A - hash X                     |- File A - hash X
                                         |- File B - hash Y

New file; create new version referencing File A -> v1 and File B -> v2
  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     |- File B - hash Y
    
File changed (File A); create new version referencing File B -> v1, File A -> v2

Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     
    
File changed (File A); create new version referencing File A -> v2 but File B 
ends up removed from the new version.

So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.

Is there another way to detect file deletions across versions without needing all of the object data up to that point?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions