Skip to content

Conversation

@sirreal
Copy link
Member

@sirreal sirreal commented Dec 15, 2025

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.

  • JSON and JavaScript script tags may be detected according to the HTML standard.
  • Script tag contents are escaped only when <script or </script (case-insensitive) is found.

In JSON

The JavaScript escaping strategy is also applicable to JSON and produces more minimal, readable, and HTML-safe output. The JavaScript escaping is applied to JSON content as well.

Outdated description

< is replaced with \u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.

This is proposed as a simple character replacement with strtr. This should be highly performant. A less invasive replacement could be done to only replace < in <script or </script where it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).

In JavaScript

<script and </script (followed by a necessary tag termination character \t\n\r\f/>) the s is replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).

From the ticket:

The HTML API prevents setting SCRIPT tag that could modify the tree either by closing the SCRIPT element prematurely, or by preventing the SCRIPT element from closing at the expected close tag.

This is handled by rejecting any script tag contents that are potentially dangerous and is safe. There are some improvements that could be made.

If the contents are found to be unsafe and the type of the script tag is JSON or JavaScript (this is well specified in the HTML standard), it should be possible to apply a syntactic transformation to the contents in such a way that the script contents become safe without semantically altering the script.

If the HTML API can safely and automatically escape the majority of SCRIPT tag contents, it can then be used to for SCRIPT tag creation and has the potential to eliminate the class of problem from #40737, #62797, and #63851. It also has the potential to address part of #51159 where SCRIPT tag escaping becomes less of an issue.

JSON

In JSON SCRIPT tags, the transformation is a simple replacement of < with its Unicode escape sequence \u003C. This can be applied to the entire contents of the script or specifically in case-insensitive matches for <script and </script.

JavaScript

JavaScript SCRIPT tags are more difficult because the language has vastly more syntax. Fortunately, there is prior art described in this 2022 blog post (external) from React team member Sophie Alpert. It's the same the JavaScript SCRIPT tag contents escaping strategy that React continues to employ today. In summary, the problematic text <script and </script syntactically appear in places where Unicode escape sequences can be used in the script part (Strings, Identifiers, and RegExp literals). React takes the approach of replacing the s character, resulting in <\u0073cript or </\u0073cript, completely safe in a Script tag.

There are a few notable exceptions where the transformed JavaScript has observably different runtime behavior. These are the only examples I'm aware of. They're more esoteric parts of the language and the likelihood of them being used in inline JavaScript with the problematic text sequences seems an acceptable tradeoff to me to enable cheap, automatic JavaScript escaping.

String.raw does not process escape sequences.

'<script>' === '<\u0073cript>'; // true
String.raw`<script>` === String.raw`<\u0073cript>`; // false

Tagged templates can also access the raw strings, again a form without processing escape sequences.

function taggedCooked( strings ) {
    return strings[0];
}
taggedCooked`<script>` === taggedCooked`<\u0073cript>`; // true

function taggedRaw( strings ) {
    return strings.raw[0];
}
taggedRaw`<script>` === taggedRaw`<\u0073cript>`; // false

The source property of RegExp contains a string representation of the pattern. JavaScript RegExp support Unicode escape sequences, but the Unicode escape sequence is not transformed in the source.

const rPlain = /<script>/;
const rEscaped = /<\u0073cript>/

rPlain.test('<script>'); // true
rEscaped.test('<script>'); // true

rPlain.source === rEscaped.source; // false
rPlain.source; // '<script>'
rEscaped.source; // '<\\u0073cript>'

Any better JavaScript escaping would likely require a complete JavaScript parser and much more invasive changes. It would be much more costly to perform. Even then, I'm not sure that the escaping could be done faithfully.

String.raw() could be split and joined:

String.raw`<script>` === String.raw`<s` + String.raw`cript>`; true 

Tagged template raw and RegExp source seem much more challenging.

Trac ticket: https://core.trac.wordpress.org/ticket/64419


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell
Copy link
Member

dmsnell commented Jan 12, 2026

@sirreal I updated the .dot file to clarify labels and sizing of the nodes in the default graphical presentation of the file, and to harmonize with the change I made replacing the dagger with a superscript. all I intended on doing was adding the link to the HTML spec…but then I ran dot to verify that the comment wouldn’t break it and saw the visual issues)

in the process I accidentally pushed out the build-change-revert that I’ve been using to test and develop. since I can’t force-push to your branch we’ll just have to remember to git rebase -i --rebase-merges --keep-base trunk and remove that merge commit before we merge this PR.

before after
Screenshot 2026-01-12 at 2 42 03 PM Screenshot 2026-01-12 at 2 39 58 PM

dmsnell

This comment was marked as outdated.

@sirreal sirreal force-pushed the html-api/auto-escape-javascript-json branch from 18b3ca2 to 3f1ba32 Compare January 12, 2026 22:38
);

$processor->set_modifiable_text( "\n{$importmap}\n" );
$decoded_importmap = json_decode( $processor->get_modifiable_text(), true );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be handles post-merge or now, but do we really want to assert identical serialization?

with the check below for equality of the input and decoded values, why do we care if the escaping was different?

on the other hand, if we are wanting to assert specific kinds of escapes, maybe they get dedicated unit tests with clear assertions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's sort it out in follow-up work.

Are you referring to the assertEqualHTML below on the HTML serialization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I’m talking about JSON-serialization. I don’t really care if they are different but decode identically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what exactly you're referring to. If you help me to understand, I'm happy to discuss and consider changes.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mon-u-mental effort @sirreal.

I left a comment and think that the tests/phpunit/tests/html-api/wpHtmlTagProcessorModifiableText.php file is a bit…distracted…but it’s not a blocker IMO. Would be nice to follow-up with that but some of the other existing HTML API tests are also a bit chaotic for practical reasons.

This is so exciting I can hardly stand it. Let’s get it in and testing at large!

* @param string $sourcecode Raw contents intended to be serialized into an HTML SCRIPT element.
* @return string Escaped form of input contents which will not lead to premature closing of the containing SCRIPT element.
*/
public static function escape_javascript_script_contents( string $sourcecode ): string {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was made public in bfa60cf.

@westonruter asked about this new public method in #10639 (comment), and it was something I had planned to review.

I'm going to make it private for now and we can consider opening it up in follow-up work.

@@ -0,0 +1,33 @@
// https://html.spec.whatwg.org/multipage/parsing.html#script-data-state
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had simplified this graph to produce the ASCII chart that's including in this PR. I used an online graphviz ASCII converter and it rejects some of the syntax in this version. I don't know whether it's a problem with the converter I used or if this ASCII output mode for graphviz does not support some of the features used here.

I want to make sure this is able to produce an ASCII version of the chart. We can revisit in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing should reject the syntax. do you know which parts it rejects? in unsupported output formats we should hope that it simply ignores what it can’t support.

perhaps we can find another converter. graphviz has a native one.

pento pushed a commit that referenced this pull request Jan 13, 2026
When setting JavaScript or JSON script tag content, automatically escape sequences like `<script>` and `</script>`. This renders the content safe for HTML. The semantics of any JSON and virtually any JavaScript are preserved.

Script type detection follows the HTML standard for identifying JavaScript and JSON script tags. Other script types continue to reject potentially dangerous content.

Developed in #10635.

Props jonsurrell, dmsnell, westonruter.
Fixes #64419. See #63851, #51159.



git-svn-id: https://develop.svn.wordpress.org/trunk@61477 602fd350-edb4-49c9-b593-d223f7449a82
@github-actions
Copy link

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 61477
GitHub commit: c55a0f2

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Jan 13, 2026
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Jan 13, 2026
When setting JavaScript or JSON script tag content, automatically escape sequences like `<script>` and `</script>`. This renders the content safe for HTML. The semantics of any JSON and virtually any JavaScript are preserved.

Script type detection follows the HTML standard for identifying JavaScript and JSON script tags. Other script types continue to reject potentially dangerous content.

Developed in WordPress/wordpress-develop#10635.

Props jonsurrell, dmsnell, westonruter.
Fixes #64419. See #63851, #51159.


Built from https://develop.svn.wordpress.org/trunk@61477


git-svn-id: http://core.svn.wordpress.org/trunk@60789 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@sirreal sirreal deleted the html-api/auto-escape-javascript-json branch January 13, 2026 13:28
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Jan 13, 2026
When setting JavaScript or JSON script tag content, automatically escape sequences like `<script>` and `</script>`. This renders the content safe for HTML. The semantics of any JSON and virtually any JavaScript are preserved.

Script type detection follows the HTML standard for identifying JavaScript and JSON script tags. Other script types continue to reject potentially dangerous content.

Developed in WordPress/wordpress-develop#10635.

Props jonsurrell, dmsnell, westonruter.
Fixes #64419. See #63851, #51159.


Built from https://develop.svn.wordpress.org/trunk@61477


git-svn-id: https://core.svn.wordpress.org/trunk@60789 1a063a9b-81f0-0310-95a4-ce76da25c4cd
sirreal added a commit to sirreal/wordpress-develop that referenced this pull request Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants