Skip to content

RDFa parser produces unexpected results with CDATA sections and entity references #4

@nxg

Description

@nxg

Consider the examples below, which parses an RDFa document, producing predicates content1, ..., content5.

(This is using raptor, rather than using librdfa directly, and indeed is copied from raptor bug report 495 at the suggestion of the raptor maintainer; that is, this is a slightly indirect bugreport -- I hope that's OK)

Tests content1, 2, 4 and 5 are, I think wrong.

For content1, 2, 4 and 5, the CDATA marked section is simply omitted. Although http://www.w3.org/TR/rdfa-syntax/ doesn't mention CDATA marked sections, there's nothing there that seems to warrant ignoring them.

Tests content1, 2 and 5 produce XMLLiteral data which includes both elements and entities. However in each of the three cases, the Turtle output has the characters denoted by entities (the &<>) appearing literally in the rdf:XMLLiteral, making it not valid XML. Ie they're not escaped in any way. I can't find anything, in either http://www.w3.org/TR/REC-rdf-syntax/ (which I suppose is the definition of rdf:XMLLiteral) or http://www.w3.org/TeamSubmission/turtle/ which spells out what the content of an rdf:XMLLiteral should be, but I would be surprised if invalid XML is allowed. I don't believe this is a (raptor) Turtle serialisation problem, since looking at the post-parse result programmatically shows that the CDATA marked sections have been removed, and the &<> are sitting unescaped in a string which should be an XMLLiteral.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML+RDFa 1.0//EN' 'http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:ns='urn:ns#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<head>
<title property='ns:title'>T</title>
<meta about='' property='ns:abstract' content='Abstract &lt;&gt;&amp;%' />
</head>
<body>
<!-- for cases below, see http://www.w3.org/TR/rdfa-syntax/ Sect. 6.3.1.3 -->
<!-- explicit XMLLiteral @datatype -->
<p property='ns:content1'
   datatype='rdf:XMLLiteral'
   >content1: <![CDATA[cdata<>&]]> <span>not</span>&amp;&lt;&gt;</p>
<!-- no @datatype, presence of elements implies it -->
<p property='ns:content2'
   >content2: <![CDATA[cdata<>&]]> <span>not</span>&#38;&#60;&#62;</p>
<!-- no @datatype, but no XML elements, so plain literal -->
<p property='ns:content3'
   >content3: plain content</p>
<!-- explicit empty @datatype, so interpreted as a plain literal -->
<p property='ns:content4'
   datatype=''
   >content4: <![CDATA[cdata<>&]]> <span>not</span>&amp;&#60;&#62;</p>
<!-- basically same as content2 above -->
<div property='ns:content5'
     ><p>content5: <![CDATA[cdata<>&]]> <span>not</span>&amp;&#60;&#62;</p></div>
</body></html>

And yes, I agree that the world would be a nicer place, if CDATA marked sections did not exist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions