-
Notifications
You must be signed in to change notification settings - Fork 7
Description
I work with a group that uses the Python Bagit library to package downloaded software applications. On some occasions, we've encountered an inability to use the package because of odd behaviors with soft links in Linux and macOS environments. (Regular links became copies of the referenced file; broken links caused the script to fail when it tried hashing.) Discussing this with @edsu a bit, he helped me realize that the Bagit specification does not cover soft links.
Soft links are worth recording, as their ability to reference arbitrary paths can provide important context to capture of some file systems. (E.g. Apache web server configurations once, and possibly still do, recommend soft links to configuration files. I recently observed some game data that does the same.) However, I appreciate that determining a specification form might not be straightforward, as soft links don't intrinsically have contents to hash like regular files, and there are decisions that need to be made clear about whether to follow links that exit the file system scope of the bagging target.
I have a branch of the Bagit Python library that has a shell script that drafts soft link support requirements. The usage comments at the top of that script show my first draft of the test matrix, but I realized later that there's a bit more combinatoric expansion to do. For instance, a bag's manifest file should be able to represent a link whether it points to:
- A regular file
- A directory
- A path to a non-existent file
- A link to a regular file
- A link to a directory
- All of the above when the link references a file either (a) still within a bag, or (b) external to the bag
- All of the above when the link is absolute or relative
- Not capturing any file or directory references that are under a link to a directory.
- The applicable of the above when the file exists but without read permissions (i.e. where file system metadata can be captured, but not file contents)
That brainstorming implies these test dimensions:
- name type - r, d, -, l(r), l(d), l(-)
- relativity - relative, absolute
- containment (within bag target directory) - internal, external
- under soft link to directory - yes, no
- link target has read permission - yes, no
Before the kind of testing can be implemented, some kind of special manifest is needed for soft links. That Bash test script assumes a file manifest-links.txt (living alongside manifest-$hashname.txt), that has a two-column tab-delimited format, column 0 the link contents, column 1 the path to the link file. This follows the "content, whitespace, path" summarizing pattern of the Bag hash manifests, but relies on tab as a safe character that should not appear in soft links. (Classic HFS allowed tabs in file names, but supported aliases, a subtly different file type from soft links. No more-modern file systems, to my knowledge, allow tabs.)
On a related note, I also work with another representation for file system metadata, Digital Forensics XML. That language represents file metadata as extracted from the file system, with some summaries of file content available (including coverage of the hashes in the Bagit spec). Its solution to representing soft links is to give them a designated type, code "l". In discussion with someone, I recently realized the language could also use a representation for the type of a link's target. So, I've since also come to believe the manifest-links.txt format I'd originally drafted could use an embedded type for the link target.
Is this data/metadata-recording feature support something the Bagit community would be interested in developing further? I'm happy to help spell out the test conditions.