-
Notifications
You must be signed in to change notification settings - Fork 3
On detecting deleted files across versions #3
Description
Conversation moved from OCFL/spec#525
In OCFL/spec#522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).
In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.
This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).
Consider the following:
v1 v2
|- File A - hash X |- File A - hash X
No change; do not create new version.
v1 v2
|- File A - hash X |- File A - hash X
|- File B - hash Y
New file; create new version referencing File A -> v1 and File B -> v2
v1 v2
|- File A - hash X |- File A - hash Z
|- File B - hash Y |- File B - hash Y
File changed (File A); create new version referencing File B -> v1, File A -> v2
Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.
v1 v2
|- File A - hash X |- File A - hash Z
|- File B - hash Y
File changed (File A); create new version referencing File A -> v2 but File B
ends up removed from the new version.
So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.
Is there another way to detect file deletions across versions without needing all of the object data up to that point?