Skip to content

Add support for Processing instructions#14

Open
Holzhaus wants to merge 5 commits intomckamey:masterfrom
RUB-NDS:processing-instructions
Open

Add support for Processing instructions#14
Holzhaus wants to merge 5 commits intomckamey:masterfrom
RUB-NDS:processing-instructions

Conversation

@Holzhaus
Copy link

@Holzhaus Holzhaus commented May 22, 2017

This adds support for Processing Instructions. These are defined in the XML 1.0 Specification as follows:

[16] PI            ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
[17] PITarget      ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

Rationale

The JsonML website states:

Processing instruction nodes are inherently ignored in JsonML. There does not seem to be a useful analogy which is consistent with the intent of JsonML.

But to provide a truly lossless way to convert arbitrary XML to JsonML and back, IMHO they should be preserved:

  • As per Section 2.6 of the XML 1.0 Specification, passing PIs through to an application is a "MUST" requirement.

  • Accordingly, PIs are also preserved during XML Canonicalization (C14N), which is a way to normalize the physical appearance of an XML document while retaining its logic. The retention of PIs indicates that they are not insignificant (by contrast, comment support is optional and things like DTD or XML Declaration are lost during C14N).

JsonML grammar modifications

Thus, I added support for Processing Instructions:

element
   = '[' tag-name ',' attributes ',' element-list ']'
   | '[' tag-name ',' attributes ']'
   | '[' tag-name ',' element-list ']'
   | '[' tag-name ']'
   | '[' pi-target ',' string ']'
   | string
   ;
pi-target
   = "?" + string

Processing Instructions can't be mistaken for elements since #x3F (= Question Mark [?]) is not a valid NameStartChar (which means that a tag-name can't begin with a question mark):

[2]  Char          ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[3]  S             ::= (#x20 | #x9 | #xD | #xA)+
[4]  NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar	   ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]  Name          ::= NameStartChar (NameChar)*
[40] STag          ::= '<' Name (S Attribute)* S? '>'

One sublety is that when a document contains one or more Processing Instructions as top-level constructs (besides the documentElement), then the resulting JsonML text will have a top level element with an empty string as tag-name and an element-list consisting of the PIs and the documentElement in the correct order.
This behaviour is consistent with the handling of fragments in JsonML.

Please have a look at the examples below and let me know what you think. Thanks in advance!

Examples

Here's a simple example:

<foo><?a-single-pi with data?></foo>

The above XML document represented as JsonML:

[
    "foo",
    [
        "?a-single-pi",
        "with data"
    ]
]

Here's a more complex example with has Processing Instructions at the document level:

<?some-pi and its data?>
<foo>
    <?another-pi with data?>
</foo>
<?third-pi?>

The above file will map to:

[
    "",
    [
        "?some-pi",
        "and its data"
    ],
    "\n",
    [
        "foo",
        "\n    ",
        [
            "?another-pi",
            "with data"
        ],
        "\n"
    ],
    "\n",
    [
        "?third-pi",
        ""
    ]
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant