Skip to content
This repository was archived by the owner on Jan 23, 2020. It is now read-only.
This repository was archived by the owner on Jan 23, 2020. It is now read-only.

Canonicalizing comments in RAWTEXT state #40

@adon-at-work

Description

@adon-at-work

@yukinying

I realize a bug in canonicalizing comments in RAWTEXT state. Documented here are the observation, bug, and suggested fixes.

Facts/Observation:

  • HTML5: no HTML comments are assumed inside RAWTEXT <style> <!-- </style> --> {{insideData}} </style>
  • HTML4: comments can nullify the effect of end tag <style> <!-- </style> --> {{insideStyle}} </style>

Imagine, the placeholder {{insideData}} holds an attack vector x:expression(alert(1)). the filter applied by contextual escaping has no effect to the value, since inHTMLData/yd filter does not touch anything. Given a HTML5 browser, there's no XSS concern. But this placeholder could be interpreted as in inside style tag in HTML4 and become vulnerable to XSS. We can see this hurts users of older browsers.

The current mitigation is exactly the same as in RCDATA state. The core principle is to turn <! into &lt;!. The replacement fulfills the HTML5 spec since it's an equiv. representation, and that the HTML5 spec won't consider them as comments anyway (i.e., no transitions possible to comment states). In addition, the resulted HTML won't have a chance to be interpreted as jumping into comment state, when rendered in HTML4 browser. As a result, this means the parsing experience is aligned across HTML4/5 browsers.

Such implementation worked properly in the mentioned inputs. But what I just realize is that it may break usability in the following use case: <style> div:after{content:'yo <!'} </style>, where it will become <style> selector:after{content:'yo &lt;!'} </style> after canonicalization. When rendered in browser, the user will see really &lt; instead of <. This is bad.

tl;dr.

The cause of such is actually due to the fact that RAWTEXT actually doesn't support char references, according to the html5 spec (what we overlooked). This motivates me to revisit every different RAWTEXT element for a better fix.

HTML5 spec documents that the following elements will result in RAWTEXT state:
style, xmp, iframe, noembed, or noframes + noscript

  • Inside <style>, we should turn <! to \3c !, therefore using the css escaped version. that solved the problem above1.
  • Inside <iframe>, no handling needed, or we can simply drop <! if found, since ultimately all innerHTML will be dropped as browser will render content supplied via the src attr.
  • Inside <noscript>, no handling needed. HTML 5 quits once encountered the first close tag in <noscript><!-- </noscript> A --> B </noscript> and can apply proper contextual filtering after it, i.e. in both positions A and B. Whereas in HTML4, the first close tag inside comment is ignored, and that means more content (A upto B) will get considered as inside noscript tag. So, no matter what filters are applied in them, they won't get rendered if script is enabled. And when script is disabled, we can be sure there're no XSS anyway.

The remaining ones <xmp>, <noembed>, and <noframes> are considered obsolete according to HTML5.

  • <xmp> works as if it's a <pre> with all < escaped as &lt;. therefore, no tags inside it can render. Similar to <noscript>, HTML4 ignoring an commented end tag would consider more content as in xmp, i.e., more content becoming inert to rendering.
  • my final recommendation is to drop <noframes> and <noembed> (as if they're blacklisted tags). Rationale: (1) we do not (and hardly we can) support contextual filtering for contents inside the tags. Imagine <noframes><a href="{{url}}">vulnerable</a></noframes>. We might need to hack the state machine as to treat the inner contents as placed inside DATA state for contextual filtering. (2) those browsers that can possibly render the inner contents are super old, and are not supported in our browser list.

1 discussed with @neraliu. he also considers the new approach makes sense, but would create an extra consideration of allowing \3c when we apply the strict CSS parser in style tags in the future.

note: tested samples

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions