gh-135661: Fix parsing start and end tags in HTMLParser by serhiy-storchaka · Pull Request #135930 · python/cpython

serhiy-storchaka · 2025-06-25T11:46:03Z

Whitespaces no longer accepted between </ and the tag name. E.g. </ script> does not end the script section.
Vertical tabulation (\v) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are \t\n\r\f .
Null character (U+0000) no longer ends the tag name.
End tag can have attributes and slashes after tag name. It no longer ends after the first > in quoted attribute value. E.g. </script/foo=">"/>.
Multiple slashes and whitespaces between the last attribute and closing > are now accepted in both start and end tags. E.g. <a foo=bar/ //>.
Multiple = between attribute name and value are no longer collapsed. E.g. <a foo==bar> produces attribute "foo" with value "=bar".
Whitespaces between the = separator and attribute name or value are no longer ignored. E.g. <a foo =bar> produces two attributes "foo" and "=bar", both with value None; <a foo= bar> produces two attributes: "foo" with value "" and "bar" with value None.

Issue: HTMLParser differences from the HTML5 specification #135661

* Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * End tag can have attributes and slashes after tag name. It no longer ends after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now accepted in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None.

serhiy-storchaka · 2025-06-25T12:44:22Z

I tried to minimize changes and split this PR on several PRs, but they would not be independent, and all these changes are needed to fix the possible XSS.

I am planning further refactoring, but this is only for the main branch.

ezio-melotti · 2025-07-02T14:36:08Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


I don't know if you saw and heeded the warning or if you just got lucky, but it looks like you were able to change these regex!
Since you renamed locatestarttagend, the comment at line 34 should also be updated.

In addition, make sure that existing comments are still relevant. In particular I would appreciate this for comments linking to specific sections of the HTML5 standard.

There are links below, they still work, although they now redirect to other address. I updated them.

On other hand, section numbers were changed. I updated them in places which I touched.

ezio-melotti · 2025-07-02T14:52:17Z

Lib/html/parser.py

+     )?
+    [\t\n\r\f /]*                   # possibly followed by a space
+   )*
+   >?


These changes make sense to me.

I also noticed that you removed the start from locatestarttagend_tolerant, presumably because you are now using it to find the end of end tags too (which can contain attributes, even if they are invalid).

This variable is not documented however I can see two options:

we consider it private and just rename it;

we create an alias to the old name for backward compatibility, in case someone was using it;

Note that before there was also a set of *_strict variable that got removed, so the _tolerant suffix is no longer needed and it was kept for backward compatibility. Since you are refactoring/renaming (some of) these variables, you might want to consider dropping the _tolerant suffix altogether (and possibly adding aliases to preserve backward compatibility), either in this or in a separate PR.

Restored the removed variables. I will remove them in the main branch in the following PR.

ezio-melotti · 2025-07-02T14:54:18Z

Lib/html/parser.py

        self.cdata_elem = elem.lower()
-        self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I)
+        self.interesting = re.compile(r'</%s(?=[\t\n\r\f />])' % self.cdata_elem,
+                                      re.IGNORECASE|re.ASCII)


Any reason for adding re.ASCII here?

Yes, it affects case-insensitive mode. Otherwise 'ſ' ~ 's' and 'ı' ~ 'i'. There may be more cases after adding support for title and textarea.

This is not actually a problem in the current code, but future changes could make this important.

ezio-melotti · 2025-07-02T15:01:43Z

Lib/html/parser.py

        assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
-        match = endendtag.search(rawdata, i+1) # >
-        if not match:
+        if rawdata.find('>', i+2) < 0:


Suggested change

if rawdata.find('>', i+2) < 0:

if rawdata.rfind('>', i+2) < 0:

Probably inconsequential performance-wise, but using rfind seems more logical here (and possibly elsewhere).

This check is not actually needed. It is simply an optimization for the case of truncated end tag, because it is faster than endtagopen.match() + locatetagend.match(). I do not know whether it really helps, but I left it as insurance against unpredicted performance degradation.

find may be faster than rfind in general, and in case of end tag, there is large chance to find ">" in first few characters.

Lib/html/parser.py

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Lib/test/test_htmlparser.py

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

…o htmlparser-tag

serhiy-storchaka

Thank you for review, @ezio-melotti.

serhiy-storchaka · 2025-07-02T17:43:12Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


There are links below, they still work, although they now redirect to other address. I updated them.

On other hand, section numbers were changed. I updated them in places which I touched.

serhiy-storchaka · 2025-07-02T17:45:55Z

Lib/html/parser.py

+     )?
+    [\t\n\r\f /]*                   # possibly followed by a space
+   )*
+   >?


Restored the removed variables. I will remove them in the main branch in the following PR.

serhiy-storchaka · 2025-07-02T17:51:01Z

Lib/html/parser.py

        self.cdata_elem = elem.lower()
-        self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I)
+        self.interesting = re.compile(r'</%s(?=[\t\n\r\f />])' % self.cdata_elem,
+                                      re.IGNORECASE|re.ASCII)


Yes, it affects case-insensitive mode. Otherwise 'ſ' ~ 's' and 'ı' ~ 'i'. There may be more cases after adding support for title and textarea.

This is not actually a problem in the current code, but future changes could make this important.

serhiy-storchaka · 2025-07-02T17:59:14Z

Lib/html/parser.py

        assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
-        match = endendtag.search(rawdata, i+1) # >
-        if not match:
+        if rawdata.find('>', i+2) < 0:


This check is not actually needed. It is simply an optimization for the case of truncated end tag, because it is faster than endtagopen.match() + locatetagend.match(). I do not know whether it really helps, but I left it as insurance against unpredicted performance degradation.

find may be faster than rfind in general, and in case of end tag, there is large chance to find ">" in first few characters.

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Lib/test/test_htmlparser.py

ezio-melotti · 2025-07-02T21:39:16Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


miss-islington-app · 2025-07-03T20:33:05Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14.
🐍🍒⛏🤖

bedevere-app · 2025-07-03T20:33:15Z

GH-136255 is a backport of this pull request to the 3.14 branch.

serhiy-storchaka requested a review from ezio-melotti as a code owner June 25, 2025 11:46

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 25, 2025

bedevere-app bot added the awaiting core review label Jun 25, 2025

bedevere-app bot mentioned this pull request Jun 25, 2025

HTMLParser differences from the HTML5 specification #135661

Open

Fix Sphinx errors.

182b16f

ezio-melotti reviewed Jul 2, 2025

View reviewed changes

serhiy-storchaka and others added 4 commits July 2, 2025 20:17

Merge branch 'main' into htmlparser-tag

436a8a9

Apply suggestions from code review

ebf8ce3

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

Merge remote-tracking branch 'refs/remotes/origin/htmlparser-tag' int…

d05303b

…o htmlparser-tag

Address review comments.

955db4e

serhiy-storchaka commented Jul 2, 2025

View reviewed changes

ezio-melotti approved these changes Jul 2, 2025

View reviewed changes

Lib/html/parser.py

@@ -36,29 +36,33 @@

# explode, so don't do it.

Copy link

Member

ezio-melotti Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bedevere-app bot added awaiting merge and removed awaiting core review labels Jul 2, 2025

serhiy-storchaka added 2 commits July 3, 2025 18:22

Merge branch 'main' into htmlparser-tag

66ec1a0

Move to Security.

f38ad41

serhiy-storchaka added needs backport to 3.9 needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Jul 3, 2025

serhiy-storchaka merged commit 0243f97 into python:main Jul 3, 2025
48 checks passed

bedevere-app bot removed the awaiting merge label Jul 3, 2025

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Jul 3, 2025

serhiy-storchaka mentioned this pull request Jul 21, 2025

gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser #136908

Merged

This was referenced Jul 21, 2025

[3.13] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) #136918

Merged

[3.12] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) #136919

Merged

miss-islington mentioned this pull request Jul 21, 2025

[3.11] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) #136920

Merged

miss-islington mentioned this pull request Jul 21, 2025

[3.10] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) #136921

Merged

miss-islington mentioned this pull request Jul 21, 2025

[3.9] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) #136922

Merged

serhiy-storchaka mentioned this pull request Jul 21, 2025

[3.14] gh-135661: Fix parsing attributes with whitespaces around the "=" separator in HTMLParser (GH-136908) #136927

Merged

waylan mentioned this pull request Jul 21, 2025

More HTML parsing test failures in Python 3.14 beta 4 Python-Markdown/markdown#1547

Closed

efimov-mikhail mentioned this pull request Nov 1, 2025

html.parser: check_for_whole_start_tag crashes with AssertionError on empty string #140877

Closed

	if rawdata.find('>', i+2) < 0:
	if rawdata.rfind('>', i+2) < 0:

Uh oh!

Comments

Conversation

serhiy-storchaka commented Jun 25, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miss-islington-app bot commented Jul 3, 2025

Uh oh!

bedevere-app bot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

serhiy-storchaka commented Jun 25, 2025 •

edited by bedevere-app bot

Loading