Skip to content

Fix Incomplete CSI Final-Byte And Intermediate-Byte Matching#66

Open
MRayermannMSFT wants to merge 2 commits intochalk:mainfrom
MRayermannMSFT:bug-incomplete-csi-final-byte-class
Open

Fix Incomplete CSI Final-Byte And Intermediate-Byte Matching#66
MRayermannMSFT wants to merge 2 commits intochalk:mainfrom
MRayermannMSFT:bug-incomplete-csi-final-byte-class

Conversation

@MRayermannMSFT
Copy link
Copy Markdown

@MRayermannMSFT MRayermannMSFT commented Feb 15, 2026

What

Replace the hand-enumerated CSI final-byte character class with the full ECMA-48 range (0x40–0x7E), add support for intermediate bytes (0x20–0x2F), and extract ESC<, ESC=, ESC> into a dedicated pattern so they are no longer conflated with CSI sequences.

Why

The CSI final-byte class [\dA-PR-TZcf-nq-uy=><~] has gaps — it omits valid final bytes like X (ECH), I (CHT), Z (CBT), b (REP), d (VPA), e (VPR), and others. It also incorrectly treats digits as valid final bytes, causing sequences like ESC[31X to be partially matched as ESC[31 (with 1 consumed as the final byte), leaving a stray X in the stripped output. This breaks any downstream code that relies on clean ANSI stripping — for example, Windows ConPTY emits ECH sequences that were being corrupted after stripping.

Using the spec-defined range [@-~] (equivalently \x40–\x7E) eliminates these gaps.

@MRayermannMSFT MRayermannMSFT changed the title Fix Incomplete Csi Final-Byte And Intermediate-Byte Matching Fix Incomplete CSI Final-Byte And Intermediate-Byte Matching Feb 15, 2026
@MRayermannMSFT MRayermannMSFT force-pushed the bug-incomplete-csi-final-byte-class branch from 673bfeb to 63145fe Compare February 15, 2026 23:55
@Qix-
Copy link
Copy Markdown
Contributor

Qix- commented Feb 17, 2026

Hi there, this looks interesting. Do you have any sources for these? I'd be curious to read more about it.

@MRayermannMSFT
Copy link
Copy Markdown
Author

Sure! Here's what I'm basing my work off of:


ECMA-48 §5.4 - CSI sequence structure

The spec (ECMA-48) defines the CSI control sequence format in §5.4 as:

CSI P...P I...I F

With the byte ranges spelled out explicitly:

  1. P ... P are Parameter Bytes, which, if present, consist of bit combinations from 03/00 to 03/15 [i.e. 0x30–0x3F];

  2. I ... I are Intermediate Bytes, which, if present, consist of bit combinations from 02/00 to 02/15 [i.e. 0x20–0x2F]. Together with the Final Byte F, they identify the control function;

  3. F is the Final Byte; it consists of a bit combination from 04/00 to 07/14 [i.e. 0x40–0x7E]; it terminates the control sequence.

Wikipedia's ANSI escape code — CSI article also summarizes these ranges with the same ECMA-48 §5.4 citation.

Problems with the current regex

The current final-byte character class is:

[\dA-PR-TZcf-nq-uy=><~]

1. Missing final bytes. The regex omits 26 valid final bytes from the 0x40–0x7E range, including commonly-used ones like X (0x58, ECH – Erase Character), @ (0x40, ICH – Insert Character), b (0x62, REP – Repeat), d (0x64, VPA – Vertical Position Absolute), and e (0x65, VPR – Vertical Position Relative).

2. Digits treated as final bytes. The class includes \d (0x30–0x39), but digits are parameter bytes per ECMA-48 §5.4 item 2 - not final bytes. This causes partial matching: ESC[31X is consumed as ESC[31 (treating 1 as the final byte), leaving a stray X in the output.

Real-world impact/what motivated my PR - Windows ConPTY ECH

Windows ConPTY emits ECH sequences (ESC[nX) to erase characters on a line. Microsoft documents this under Console Virtual Terminal Sequences → Text Modification:

Sequence Code Description Behavior
ESC [ <n> X ECH Erase Character Erase <n> characters from the current cursor position by overwriting them with a space character.

After stripping with the current regex, the X leaks through:

Input:  "hello\x1b[5Xworld"
Expect: "helloworld"
Actual: "helloXworld"  ← \x1b[5 matched (digit '5' consumed as final byte), X left behind

Other sequences from the same page that are similarly affected include ICH (ESC[n@), and sequences with intermediate bytes like DECSCUSR (ESC[n SP q, documented under Cursor Shape).


I'm far from an escape code expert, so there's a chance I'm understanding the spec wrong, so happy to make changes to this PR as needed!

@sindresorhus
Copy link
Copy Markdown
Member

A few things:

  1. The new pattern strips ESC[>0h and ESC[<0c as only ESC[, not the full sequence. That means valid private-parameter CSI gets partially consumed and leaves control text behind.
  2. The pattern now matches bare ESC[ as a complete escape sequence. So incomplete or truncated CSI gets treated as valid and removed.
  3. There is no test that asserts full matching for private-parameter CSI (ESC[>..., ESC[<...) or prevents partial ESC[ matches.

@sindresorhus sindresorhus force-pushed the bug-incomplete-csi-final-byte-class branch from 63145fe to d52f332 Compare February 18, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants