Skip to content

Fixes HTML to plain text conversion#592

Merged
idlira merged 7 commits intomainfrom
ili/OX-12298
Jan 16, 2026
Merged

Fixes HTML to plain text conversion#592
idlira merged 7 commits intomainfrom
ili/OX-12298

Conversation

@idlira
Copy link
Contributor

@idlira idlira commented Jan 16, 2026

Description

StringCleanup#htmlToPlainText is expected to convert p and br tags to line breaks.
This works if these tags were not followed by other tags. For example:

➡️ Input: Hello<br />World
✅ Output:

Hello
World

but
➡️ Input: <strong>Hello</strong><br />World
🚫 Output: HelloWorld

This happens because StringCleanup.PATTERN_STRIP_XML will look for white-spaces leading or trailing a tag, effectively replacing </strong>\n in the example above by an empty string.

To not disrupt this pattern, and possible cause a breaking change/behavioral change, which is used in several other places, we iterate and clean each line.

Additional Notes

  • This PR fixes or works on following ticket(s): OX-12298

Checklist

  • Code change has been tested and works locally
  • Code was formatted via IntelliJ and follows SonarLint & best practices

including combined extra tags

Fixes: OX-12298
preserving the line breaks converted previously

Fixes: OX-12298
@idlira idlira added 🐛 Bugfix Contains only a small fix for an existing bug 🧬 Enhancement Contains new features labels Jan 16, 2026
@idlira idlira requested a review from Copilot January 16, 2026 09:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an issue where StringCleanup#htmlToPlainText incorrectly handled line breaks when HTML tags preceded <br /> or <p> tags. The fix processes each line individually to preserve line breaks that would otherwise be removed by the XML stripping regex pattern.

Changes:

  • Added line-by-line processing in htmlToPlainText to prevent the XML stripping pattern from removing intentional line breaks
  • Introduced Strings.iterateLines utility method to support line-by-line iteration with platform-independent line ending handling
  • Added test case to verify the fix for HTML with inline tags followed by line breaks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/main/java/sirius/kernel/commons/StringCleanup.java Modified htmlToPlainText to process lines individually, preserving converted line breaks
src/main/java/sirius/kernel/commons/Strings.java Added iterateLines utility method for platform-independent line iteration
src/test/kotlin/sirius/kernel/commons/StringsTest.kt Added test case verifying correct handling of inline tags followed by <br />

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@idlira idlira requested a review from jakobvogel January 16, 2026 10:21
Comment on lines +428 to +434
assertEquals(
"bold\nmove",
Strings.cleanup(
"<strong>bold</strong><br />move",
StringCleanup::htmlToPlainText
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my other comment: Maybe also add special cases as tests, such as the following one.

<br />
<br />
Hello<br />
World

And I know, the test method was already like that before, but IMHO this is a case for a parameterized test. Right now, the list of unrelated tests in one method gives less reliable statistics if one of the tests fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exist already:

        assertEquals(
            """
            first

            second
        """.trimIndent(),
            Strings.cleanup("<p>first<br><br/>second</p>", StringCleanup::htmlToPlainText, StringCleanup::trim)
        )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test is not equivalent to my example, as it does not contain line breaks, but it includes trimming. Therefore, it does not test the behaviour of your new approach. 🙃

I played around with this a little:

  • An extension of your test case by adding a line break: <strong>bold</strong><br />\nmove gives bold\n move (with a space in front of move)
  • My example <br />\n<br />\nHello<br />\nWorld gives \n Hello\n World (with a space at the front of each of the three lines!)

I am not sure whether these really are the results we want. Therefore, I would appreciate to have respective additional test cases defining the gold standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic in this class is so broken... will need review and posterior PR

@idlira idlira merged commit 8d171db into main Jan 16, 2026
4 checks passed
@idlira idlira deleted the ili/OX-12298 branch January 16, 2026 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐛 Bugfix Contains only a small fix for an existing bug 🧬 Enhancement Contains new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants