Skip to content

Comments

first implementation of d-string#107

Closed
methane wants to merge 2 commits intomainfrom
d-string
Closed

first implementation of d-string#107
methane wants to merge 2 commits intomainfrom
d-string

Conversation

@methane
Copy link
Owner

@methane methane commented Jan 2, 2026

Summary by CodeRabbit

  • New Features

    • Added support for dedented string literals ("d-strings"): triple-quoted multiline strings that must start with a newline, end with a properly indented line, and are automatically dedented. Raw variants supported.
  • Tests

    • New unittest suite validating d-string behavior and error cases (single-quote use, empty d-strings, missing start/end newline, simple decoding).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 2, 2026

📝 Walkthrough

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'first implementation of d-string' clearly and specifically describes the main change: introducing a new d-string feature across the codebase.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
Lib/test/test_dstring.py (2)

5-12: Avoid shadowing the built-in str.

Using str as a variable name shadows Python's built-in str type. Consider renaming to expr or test_str.

🔎 Proposed fix
     def assertAllRaise(self, exception_type, regex, error_strings):
-        for str in error_strings:
-            with self.subTest(str=str):
+        for expr in error_strings:
+            with self.subTest(expr=expr):
                 with self.assertRaisesRegex(exception_type, regex) as cm:
-                    eval(str)
-                # print("Testing expression:", repr(str))
-                # print(repr(cm.exception))
-                # print(repr(cm.exception.text))
+                    eval(expr)

37-43: Consider expanding test coverage for d-strings.

The current tests cover basic cases well. Consider adding tests for:

  • Dedented f-strings: df"""\n hello {world}\n """
  • Dedented t-strings: dt"""\n hello {world}\n """
  • Inconsistent indentation error case
  • Mixed tabs and spaces handling
  • Complex escape sequences within dedented content

Would you like me to generate additional test cases for these scenarios?

Parser/lexer/lexer.c (1)

1113-1114: Minor: Unused variable declaration.

Line 1113 declares const char *p but it's immediately shadowed by the loop variable declaration on line 1114. The declaration on line 1113 can be removed.

🔎 Proposed fix
         enum string_kind_t string_kind = FSTRING;
-        const char *p = tok->start;
         for (const char *p = tok->start; *p != c; p++) {
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61fc72a and aad9013.

📒 Files selected for processing (5)
  • Lib/test/test_dstring.py
  • Lib/tokenize.py
  • Parser/action_helpers.c
  • Parser/lexer/lexer.c
  • Parser/string_parser.c
🧰 Additional context used
🧬 Code graph analysis (2)
Lib/test/test_dstring.py (1)
Lib/unittest/case.py (1)
  • assertRaisesRegex (1400-1414)
Parser/action_helpers.c (2)
Parser/string_parser.c (1)
  • _PyPegen_decode_string (241-248)
Parser/pegen.h (1)
  • RAISE_ERROR_KNOWN_LOCATION (187-200)
🔇 Additional comments (10)
Lib/tokenize.py (1)

89-90: LGTM! String prefix combinations for d-strings are properly defined.

The new 'd' base prefix and its valid combinations ('df', 'dt', 'dr', 'dfr', 'dtr') are correctly added. The absence of 'db' (dedented bytes) is intentional as confirmed by the assertion in Parser/string_parser.c.

Parser/string_parser.c (2)

250-330: Well-structured dedent implementation with thorough validation.

The _PyPegen_dedent_string function correctly:

  1. Validates d-strings start with a newline
  2. Extracts the dedent prefix from the final line
  3. Validates and removes consistent indentation from each line
  4. Handles empty lines correctly (lines with only newlines)

The escape sequence handling at line 319 is a nice optimization - treating lines without backslashes as raw to avoid unnecessary decode overhead.


347-410: Clean integration of d-string handling into the string parser.

The changes properly:

  • Track dedentmode alongside other string modes
  • Enforce triple-quote requirement for d-strings (line 405-410)
  • Assert that dedent mode is never combined with bytes mode (line 415)
  • Route to the dedicated _PyPegen_dedent_string decoder when active
Parser/action_helpers.c (4)

1295-1376: Correct implementation of dedent handling for f/t-string parts.

The _PyPegen_dedent_string_part function properly handles:

  • First vs. non-first parts (skipping initial newline only for first part)
  • Empty lines (with and without trailing newlines)
  • Indentation validation against the dedent prefix
  • Line-by-line decoding with proper escape handling

The logic for non-first parts (lines 1310-1330) correctly preserves the first line without dedenting, as interpolated expressions may appear mid-line.


1440-1480: Thorough d-string validation for f/t-string contexts.

The dedent validation logic correctly:

  1. Ensures the string starts with a newline (line 1444)
  2. Ensures the last item is a Constant for prefix extraction (line 1453)
  3. Extracts the dedent prefix by scanning backwards from the end (lines 1466-1479)
  4. Validates only spaces and tabs in the indent line

1378-1414: Clean integration of dedent awareness into f-string part decoding.

The updated _PyPegen_decode_fstring_part properly routes to the dedent path when a dedent_prefix is provided, while preserving the original behavior (including {{/}} handling) for non-dedented strings.


1513-1513: Correct propagation of is_first flag for d-string processing.

The i == 0 check correctly identifies the first string part, ensuring proper dedent handling where the initial newline is skipped only for the first part.

Parser/lexer/lexer.c (3)

456-504: Prefix incompatibility checks for 'd' are correctly implemented.

The new checks properly reject:

  • u + d (line 483-485): Legacy unicode prefix is incompatible with dedent
  • b + d (line 493-495): Bytes cannot be dedented

The absence of f+d and t+d incompatibility checks is correct, as dedented f-strings (df) and dedented t-strings (dt) are valid combinations.


1061-1137: Correct f-string/t-string mode initialization with d-prefix support.

The updated logic properly handles dedented f-strings (df) and t-strings (dt) by:

  1. Including 'd' in the f_string_quote entry condition (line 1061)
  2. Iterating through all prefix characters to set string_kind and raw flags
  3. The d/D case correctly does nothing special - dedent handling is deferred to the parser

Plain d"..." strings correctly bypass this path and are tokenized as regular STRING tokens via letter_quote.


750-781: Complete integration of 'd' prefix detection in the lexer.

The saw_d tracking is properly integrated into the prefix detection loop and passed to the validation function. The detection logic correctly handles both 'd' and 'D'.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Parser/string_parser.c (1)

250-330: Review the assertion and consider edge cases in dedent logic.

The implementation looks solid overall, but there are a few points to verify:

  1. Line 316 assertion safety: The assertion assert(line_end != NULL) assumes memchr will always find a newline. While the comment states "we know there is at least one newline before prefix," this relies on the loop condition line_start < prefix. Consider whether all edge cases are covered, especially for malformed input.

  2. Tab/space mixing: The validation at lines 292-302 checks character-by-character equality with the dedent prefix but doesn't explicitly detect or warn about mixing tabs and spaces within indentation. This could lead to subtle bugs if the first line uses tabs and subsequent lines use spaces (or vice versa). Consider whether this is intentional or if a warning should be added.

  3. Line 319 optimization: The logic is_raw || memchr(line_start, '\\', line_end - line_start) == NULL is a nice optimization to avoid unnecessary escape processing when there are no backslashes.

🔎 Consider adding validation for consistent whitespace in indentation
         for (i = 0; i < prefix_len && line_start + i < prefix; i++) {
             if (line_start[i] != prefix[i]) {
                 if (line_start[i] == '\n') {
                     break; // empty line
                 }
+                // Provide more specific error for tab/space mixing
+                if ((line_start[i] == ' ' || line_start[i] == '\t') &&
+                    (prefix[i] == ' ' || prefix[i] == '\t')) {
+                    PyUnicodeWriter_Discard(w);
+                    RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1,
+                        "d-string indentation mixing tabs and spaces");
+                    return NULL;
+                }
                 PyUnicodeWriter_Discard(w);
                 RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1,
                     "d-string line missing valid indentation");
                 return NULL;
             }
         }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aad9013 and e1dc5c0.

📒 Files selected for processing (5)
  • Lib/test/test_dstring.py
  • Lib/tokenize.py
  • Parser/action_helpers.c
  • Parser/lexer/lexer.c
  • Parser/string_parser.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • Lib/test/test_dstring.py
🧰 Additional context used
🧬 Code graph analysis (1)
Parser/string_parser.c (3)
Parser/pegen.h (1)
  • RAISE_ERROR_KNOWN_LOCATION (187-200)
Objects/unicode_writer.c (5)
  • PyUnicodeWriter_Create (157-185)
  • PyUnicodeWriter_Discard (188-195)
  • PyUnicodeWriter_WriteChar (316-326)
  • PyUnicodeWriter_WriteStr (360-380)
  • PyUnicodeWriter_Finish (625-632)
Include/refcount.h (1)
  • Py_XDECREF (510-515)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Windows / Build and test (arm64)
  • GitHub Check: Windows / Build and test (Win32)
  • GitHub Check: Windows (free-threading) / Build and test (arm64)
  • GitHub Check: macOS (free-threading) / build and test (macos-14)
  • GitHub Check: Windows (free-threading) / Build and test (x64)
  • GitHub Check: Windows / Build and test (x64)
  • GitHub Check: macOS / build and test (macos-14)
  • GitHub Check: Ubuntu (free-threading) / build and test (ubuntu-24.04)
  • GitHub Check: Address sanitizer (ubuntu-24.04)
  • GitHub Check: Android (x86_64)
  • GitHub Check: Hypothesis tests on Ubuntu
  • GitHub Check: Android (aarch64)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.5.4)
  • GitHub Check: iOS
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.4.3)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.0.18)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.2.6)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.3.5)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 1.1.1w)
  • GitHub Check: Check if generated files are up to date
🔇 Additional comments (7)
Lib/tokenize.py (1)

89-90: LGTM! The prefix combinations align with the parser implementation.

The addition of 'd' as a base prefix and its combinations ('df', 'dt', 'dr', 'dfr', 'dtr') correctly excludes incompatible combinations like 'db' (dedent with bytes) and 'du' (dedent with unicode), which are appropriately rejected in the parser and lexer.

Parser/string_parser.c (1)

347-347: LGTM! The dedent mode integration is clean and well-structured.

The changes correctly:

  • Introduce a dedentmode flag to track 'd' prefix detection
  • Enforce triple-quote requirement for d-strings (lines 405-410)
  • Assert mutual exclusivity with bytesmode (line 415), consistent with the tokenizer not including 'db' combinations
  • Route to the dedent handler when appropriate (lines 432-434)

The control flow is logical and the error messages are clear.

Also applies to: 362-365, 405-410, 415-415, 432-434

Parser/lexer/lexer.c (3)

456-504: LGTM! The prefix validation correctly handles the new 'd' prefix.

The validation appropriately:

  • Rejects 'u' + 'd' combinations (line 483-485)
  • Rejects 'b' + 'd' combinations (line 493-495)
  • Allows 'd' with 'f', 't', and 'r', which aligns with the valid prefixes defined in Lib/tokenize.py

750-792: LGTM! The 'd' prefix detection integrates cleanly with existing prefix handling.

The implementation:

  • Follows the same pattern as other prefix detection (b, r, u, f, t)
  • Correctly routes 'd' prefixed f-strings and t-strings to the f_string_quote label (line 787-789)
  • Maintains consistency with the existing codebase structure

1060-1137: LGTM! The refactored prefix handling elegantly supports variable-length prefixes.

The changes:

  • Line 1061: Correctly adds 'd' to the initial prefix check
  • Lines 1113-1131: The loop-based approach is a good improvement over fixed conditionals, making it easier to handle multi-character prefixes like 'dfr' and 'dtr'
  • The logic correctly determines string_kind (TSTRING vs FSTRING) and the raw flag based on the prefix characters

The Py_UNREACHABLE() at line 1129 is appropriate since invalid prefix combinations are already rejected by maybe_raise_syntax_error_for_string_prefixes().

Parser/action_helpers.c (2)

1378-1414: LGTM! The signature update and routing logic are well-implemented.

The changes correctly:

  • Add is_first and dedent_prefix parameters to support dedented f-strings
  • Route through _PyPegen_dedent_string_part when a dedent prefix is provided (lines 1389-1391)
  • Preserve the existing logic for non-dedented strings, including the special handling for {{ and }} (lines 1395-1400)
  • The updated call site at line 1513 correctly passes i == 0 for is_first

1417-1543: Edge case handling in dedent prefix extraction is sound given tokenizer constraints.

The dedent detection and prefix extraction logic (lines 1440-1480) is well-implemented. The three edge cases mentioned in the review are handled appropriately:

  1. Empty first constant: PyUnicode_ReadChar returns (Py_UCS4)-1 for out-of-bounds access (index 0 on empty string), which is not equal to '\n', so the error is raised correctly.

  2. Last item with only a newline or empty string: While the while loop at lines 1467-1479 would not execute for zero-length strings (loop condition bstr < dedent_prefix fails), this scenario is prevented by tokenizer constraints. The parallel implementation in Parser/string_parser.c uses an assertion (assert(prefix > s)) confirming the tokenizer guarantees at least a newline plus indentation characters in the last item.

  3. Line 1444 check: The PyUnicode_ReadChar call safely handles empty strings by returning an out-of-range error value that fails the comparison.

The core logic at line 1513 correctly passes i == 0 and dedent_prefix to the decode function.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ed5d26 and 5ce8afc.

📒 Files selected for processing (6)
  • Lib/test/test_dstring.py
  • Lib/test/test_tokenize.py
  • Lib/tokenize.py
  • Parser/action_helpers.c
  • Parser/lexer/lexer.c
  • Parser/string_parser.c
🚧 Files skipped from review as they are similar to previous changes (4)
  • Lib/tokenize.py
  • Parser/string_parser.c
  • Lib/test/test_dstring.py
  • Parser/lexer/lexer.c
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Ubuntu (bolt) / build and test (ubuntu-24.04)
  • GitHub Check: Ubuntu / build and test (ubuntu-24.04-arm)
  • GitHub Check: Sanitizers / TSan
  • GitHub Check: Ubuntu / build and test (ubuntu-24.04)
  • GitHub Check: Ubuntu (free-threading) / build and test (ubuntu-24.04-arm)
  • GitHub Check: macOS / build and test (macos-15-intel)
  • GitHub Check: Windows / Build and test (arm64)
  • GitHub Check: WASI / build and test
  • GitHub Check: Address sanitizer (ubuntu-24.04)
  • GitHub Check: Ubuntu SSL tests with AWS-LC (ubuntu-24.04, 1.55.0)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 1.1.1w)
  • GitHub Check: Hypothesis tests on Ubuntu
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.2.6)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.5.4)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.3.5)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.0.18)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.4.3)
  • GitHub Check: Android (x86_64)
  • GitHub Check: Android (aarch64)
  • GitHub Check: Check if generated files are up to date
🔇 Additional comments (5)
Lib/test/test_tokenize.py (1)

3421-3427: LGTM: Proper accommodation for d-string prefix validation.

The change from eval(f'{letter}""') to eval(f'{letter}"""\n"""') correctly handles d-string prefix testing. Since d-strings require a newline at the start (as enforced by the validation in Parser/action_helpers.c), the empty string format wouldn't work for detecting valid d-prefixes. Using the multiline format with a newline ensures d-strings are properly recognized during prefix validation.

Also applies to: 3445-3451

Parser/action_helpers.c (4)

1295-1311: LGTM: Well-structured helper for line writing.

The unicodewriter_write_line function correctly handles both raw strings and strings requiring escape processing. The logic appropriately:

  • Directly writes UTF-8 when raw or no backslashes present (optimization)
  • Decodes strings with backslashes via _PyPegen_decode_string before writing
  • Properly handles cleanup on error (line 1305)

1394-1424: LGTM: Signature updated correctly to support dedenting.

The function signature change appropriately adds parameters needed for d-string dedenting (is_first, indent_char, dedent_count). The logic correctly:

  • Routes to _PyPegen_dedent_string_part when dedent_count > 0 (line 1405-1407)
  • Falls back to regular _PyPegen_decode_string otherwise (line 1410)
  • Optimizes by setting is_raw when no backslashes are present (line 1402)

1452-1509: LGTM: Comprehensive d-string validation logic.

The d-string validation correctly enforces the required format:

  • Start validation (lines 1453-1461): Ensures the first element is a constant starting with \n
  • End validation (lines 1463-1492): Verifies the last element is an indented line by scanning backward to find the last newline and validating that only whitespace follows
  • Indent consistency (lines 1499-1507): Ensures uniform indentation character throughout, raising TabError for mixed tabs/spaces (appropriate error type)

The backward scan implementation (lines 1479-1492) is correct for portability (avoids GNU-specific memrchr).


1542-1542: LGTM: Call site updated consistently.

The call to _PyPegen_decode_fstring_part correctly passes the new parameters:

  • i == 0 for is_first (indicates whether this is the first part)
  • indent_char and indent_count from the d-string validation

This matches the updated function signature at line 1395.

Comment on lines +1349 to +1374
while (line_start < s + len) {
lineno++;

Py_ssize_t i = 0;
while (line_start + i < s_end && i < dedent_count && line_start[i] == indent_char) {
i++;
}

if (line_start[i] == '\0') { // found an empty line without newline.
break;
}
if (line_start[i] == '\n') { // found an empty line with newline.
if (PyUnicodeWriter_WriteChar(w, '\n') < 0) {
PyUnicodeWriter_Discard(w);
return NULL;
}
line_start += i+1;
continue;
}
if (i < dedent_count) { // found an invalid indent.
assert(line_start[i] != indent_char);
PyUnicodeWriter_Discard(w);
RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1,
"d-string line missing valid indentation");
return NULL;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential buffer access issue after dedent loop.

After the while loop at lines 1353-1355 exits, the code accesses line_start[i] at lines 1357, 1360, and 1369 without first verifying that line_start + i < s_end. The loop could exit because line_start + i >= s_end, making these accesses out of bounds.

🔎 Proposed fix to add bounds checking
     while (line_start < s + len) {
         lineno++;

         Py_ssize_t i = 0;
         while (line_start + i < s_end && i < dedent_count && line_start[i] == indent_char) {
             i++;
         }

-        if (line_start[i] == '\0') {  // found an empty line without newline.
+        if (line_start + i >= s_end || line_start[i] == '\0') {  // found an empty line without newline.
             break;
         }
-        if (line_start[i] == '\n') {  // found an empty line with newline.
+        if (line_start + i < s_end && line_start[i] == '\n') {  // found an empty line with newline.
             if (PyUnicodeWriter_WriteChar(w, '\n') < 0) {
                 PyUnicodeWriter_Discard(w);
                 return NULL;
             }
             line_start += i+1;
             continue;
         }
-        if (i < dedent_count) {  // found an invalid indent.
-            assert(line_start[i] != indent_char);
+        if (line_start + i < s_end && i < dedent_count) {  // found an invalid indent.
+            assert(line_start + i < s_end && line_start[i] != indent_char);
             PyUnicodeWriter_Discard(w);
             RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1,
                 "d-string line missing valid indentation");
             return NULL;
         }

@methane methane closed this Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant