RDFa parser produces unexpected results with CDATA sections and entity references

Consider the examples below, which parses an RDFa document, producing predicates content1, ..., content5.

(This is using raptor, rather than using librdfa directly, and indeed is copied from [raptor bug report 495](http://bugs.librdf.org/mantis/view.php?id=495) at the suggestion of the raptor maintainer; that is, this is a slightly indirect bugreport -- I hope that's OK)

Tests content1, 2, 4 and 5 are, I think wrong.

For content1, 2, 4 and 5, the CDATA marked section is simply omitted. Although http://www.w3.org/TR/rdfa-syntax/ doesn't mention CDATA marked sections, there's nothing there that seems to warrant ignoring them.

Tests content1, 2 and 5 produce XMLLiteral data which includes both elements and entities. However in each of the three cases, the Turtle output has the characters denoted by entities (the &<>) appearing literally in the rdf:XMLLiteral, making it not valid XML. Ie they're not escaped in any way. I can't find anything, in either http://www.w3.org/TR/REC-rdf-syntax/ (which I suppose is the definition of rdf:XMLLiteral) or http://www.w3.org/TeamSubmission/turtle/ which spells out what the content of an rdf:XMLLiteral should be, but I would be surprised if invalid XML is allowed. I don't believe this is a (raptor) Turtle serialisation problem, since looking at the post-parse result programmatically shows that the CDATA marked sections have been removed, and the `&<>` are sitting unescaped in a string which should be an XMLLiteral. 

```
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML+RDFa 1.0//EN' 'http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:ns='urn:ns#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<head>
<title property='ns:title'>T</title>
<meta about='' property='ns:abstract' content='Abstract &lt;&gt;&amp;%' />
</head>
<body>


content1: <![CDATA[cdata<>&]]> not&amp;&lt;&gt;

content2: <![CDATA[cdata<>&]]> not&#38;&#60;&#62;

content3: plain content

content4: <![CDATA[cdata<>&]]> not&amp;&#60;&#62;

<div property='ns:content5'
 >content5: <![CDATA[cdata<>&]]> not&amp;&#60;&#62;</div>
</body></html>
```

And yes, I agree that the world would be a nicer place, if CDATA marked sections did not exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RDFa parser produces unexpected results with CDATA sections and entity references #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RDFa parser produces unexpected results with CDATA sections and entity references #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions