Proposed changes for 1.0 (updated source repo)#19
Open
acdha wants to merge 169 commits intojkunze:masterfrom
Open
Proposed changes for 1.0 (updated source repo)#19acdha wants to merge 169 commits intojkunze:masterfrom
acdha wants to merge 169 commits intojkunze:masterfrom
Conversation
This appears to have been commented out since at least 2008.
This follows the guidelines in RFC-3629
* Note the existence of namespaces in the security considerations section * Update previously un-displayed list of reserved DOS/Windows filenames
Clarify that this section is part of the specification but is not considered a hard requirement for an implementation.
Update the section describing md5sum’s output format and clarify that it is strictly optional to accept bags which are produced using md5sum and will not pass a strict validation.
This adds background information for problems related to case-sensitivity and Unicode normalization and adds a list of recommendations for implementors.
This adds the note that, unlike other metadata tags, this element must not be repeated and clarifies that the Payload-Oxum value is not sufficient for validation.
This triggers the standard formatting in HTML, etc. outputs
* Use <organization> for relevant <author> entries * Omit empty <date> attributes
* Remove reference to GRABIT since the spec is now returning HTTP 404 and there are no known public implementations. * Add METALINK (RFC 5854) as an alternative which supports mirrors and protocols such as BitTorrent.
This wording is shorter and doesn’t distinguish between validation for payload and tag files.
The spec shouldn't need to include mechanistic transfer details: if the results validate, it's a bag.
…rror handling "Upon discovering errors in bags, an implementation is free to take action (for example, logging or reporting) in an application-specific manner. This document does not mandate any particular action."
some displays ended with an extra blank line
Per reviewer comment: > Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and > zero or more additional tag files (see Section 2.2). The tag files > in the optional tag directories are arbitrary file hierarchies and > the tag directories MAY have any name that is not reserved for a file > or directory in this specification. Above (2) seems to say that all tag directories are optional. Hence constantly including the word 'optional' for them, in the rest of the document, is distracting. > > The base directory MAY have any name. > > <base directory>/ > | bagit.txt > | manifest-<algorithm>.txt > | [optional additional tag files] > \--- data/ > | [payload files] > \--- [optional tag directories]/ > | [optional tag files] The square brackets are probably enough to indicate being optional. The word just makes things wordier. _The word “optional” has been removed as redundant, given the bracketing and that all tag directories have been described previously as optional._
acdha
commented
May 25, 2018
bagit.xml
Outdated
| | | ||
| +-- [optional tag files] | ||
| </artwork> | ||
| +-- [optional tag files] </artwork> |
Author
There was a problem hiding this comment.
Was the intention of this to avoid an extra line in the rendered text output? I'm not a huge fan of the closing tag being on the end of the line like this but I'm not sure it's worth changing everything to go the other way.
Owner
There was a problem hiding this comment.
Exactly, and it definitely makes the XML ugly. It may be a defect in the xml2rfc tool. If you can find a way around it, go for it, but if push comes to shove, I think getting the rendered human-oriented document consistent and correct is more important than making the XML pretty.
Per reviewer comment:
> A payload manifest is a tag file that lists payload files and
probably:
that lists payload file names and
Clarified.
Saying "lists" does imply names and not the file contents, but for some
reason I think the modified form will be clearer.
> checksums for those payload files generated using a particular bag
I'm pretty sure it's not the payload files that are generated using a
checksum algorithm... I assume it's a manifest payload file listing...
_That sentence was stricken during recent editing rounds. A similar sentence has been reworded: “Every payload manifest MUST list every payload file name exactly once.”_
Per reviewer comment:
> checksum algorithm. Every bag MUST contain one payload manifest
> file, and MAY contain more than one. A payload manifest file MUST
I think this is unusual enough to warrant, again, an initial, summary
statement. If I'm understanding, it should be something like:
A bag can have more than one data integrity manifest, with each
using a different validation algorithm.
_This sentence has been added: A bag can have more than one payload manifest, with each
using a different validation algorithm._
Per reviewer comment:
> Source-Organization Organization transferring the content.
...
> Organization-Address Mailing address of the organization.
organization -> source organization
> Contact-Name Person at the source organization who is responsible
> for the content transfer.
>
> Contact-Phone International format telephone number of person or
> position responsible.
>
> Contact-Email Fully qualified email address of person or position
> responsible.
> ...
> External-Description A brief explanation of the contents and
> provenance.
...
> Bagging-Date Date (YYYY-MM-DD) that the content was prepared for
> delivery.
I think you mean 'transfer' rather than 'delivery'...
Per reviewer comment: > The "fetch.txt" file allows a bag to be transmitted with "holes" in > it, which can be practical for several reasons. For example, it > obviates the need for the sender to stage a large serialized copy of > the content while the bag is transferred to the receiver. Also, this > method allows a sender to construct a bag from components that are > either a subset of logically related components (e.g., the localized > logical object could be much larger than what is intended for export) > or assembled from logically distributed sources (e.g., the object > components for export are not stored locally under one filesystem > tree). This paragraph would be a better introduction to the section. _Done._
Per reviewer comment: > Implementors of tools that complete bags by retrieving URLs listed in > a "fetch.txt" file need to be aware that some of those URLs may point > to hosts, intentionally or unintentionally, that are not under > control of the bag's sender. Checksums are intended as a reasonable > guarantee against corruption during transit, not a strong > cryptographic protection against intentional spoofing. Oh? _This wording was meant to apply to checksums as they are used in bags, as well as to address criticism that many legacy bags used easily broken MD5 checksums. That last sentence has now been reworded to: Moreover, older checksum algorithms, even if reasonable for detecting corruption during transit, may not offer strong cryptographic protection against intentional spoofing._
Per reviewer comment:
> In all text tag files except for the bag declaration file, text MUST
> be encoded in the character encoding specified in the "bagit.txt" bag
be encoded in the character encoding -> use the character encoding
_Done._
Per reviewer comment: > The size of files, as optionally reported in the "fetch.txt" file, > cannot be guaranteed to match the actual file size to be downloaded. > Implementors SHOULD take care to appropriately handle cases where the > actual file size does not match the file size reported in the > fetch.txt. Implementors SHOULD NOT use the file size in the > "fetch.txt" file for critical resource allocation, such as buffer > sizing or storage requisitioning. Absent specification of what "appropriately handle" means, this guidance lacks substance. _Reworded the second sentence to be: Implementers SHOULD take steps to monitor and abort transfer when the received file size exceeds the file size reported in the fetch file._
Update Justin's contact info
updated email address
Changed reference to character set registry.
Added clarification about malicious attackers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a replacement for #17 reflecting the move from the old loc-rdc organization to the primary LibraryOfCongress. The primary notable change from #17 is restoring the
fetch.txtsection following discussion with @jkunze, @dbrunton, and @johnscancella.