Conversation
📝 WalkthroughPre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
Lib/test/test_dstring.py (2)
5-12: Avoid shadowing the built-instr.Using
stras a variable name shadows Python's built-instrtype. Consider renaming toexprortest_str.🔎 Proposed fix
def assertAllRaise(self, exception_type, regex, error_strings): - for str in error_strings: - with self.subTest(str=str): + for expr in error_strings: + with self.subTest(expr=expr): with self.assertRaisesRegex(exception_type, regex) as cm: - eval(str) - # print("Testing expression:", repr(str)) - # print(repr(cm.exception)) - # print(repr(cm.exception.text)) + eval(expr)
37-43: Consider expanding test coverage for d-strings.The current tests cover basic cases well. Consider adding tests for:
- Dedented f-strings:
df"""\n hello {world}\n """- Dedented t-strings:
dt"""\n hello {world}\n """- Inconsistent indentation error case
- Mixed tabs and spaces handling
- Complex escape sequences within dedented content
Would you like me to generate additional test cases for these scenarios?
Parser/lexer/lexer.c (1)
1113-1114: Minor: Unused variable declaration.Line 1113 declares
const char *pbut it's immediately shadowed by the loop variable declaration on line 1114. The declaration on line 1113 can be removed.🔎 Proposed fix
enum string_kind_t string_kind = FSTRING; - const char *p = tok->start; for (const char *p = tok->start; *p != c; p++) {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Lib/test/test_dstring.pyLib/tokenize.pyParser/action_helpers.cParser/lexer/lexer.cParser/string_parser.c
🧰 Additional context used
🧬 Code graph analysis (2)
Lib/test/test_dstring.py (1)
Lib/unittest/case.py (1)
assertRaisesRegex(1400-1414)
Parser/action_helpers.c (2)
Parser/string_parser.c (1)
_PyPegen_decode_string(241-248)Parser/pegen.h (1)
RAISE_ERROR_KNOWN_LOCATION(187-200)
🔇 Additional comments (10)
Lib/tokenize.py (1)
89-90: LGTM! String prefix combinations for d-strings are properly defined.The new
'd'base prefix and its valid combinations ('df','dt','dr','dfr','dtr') are correctly added. The absence of'db'(dedented bytes) is intentional as confirmed by the assertion inParser/string_parser.c.Parser/string_parser.c (2)
250-330: Well-structured dedent implementation with thorough validation.The
_PyPegen_dedent_stringfunction correctly:
- Validates d-strings start with a newline
- Extracts the dedent prefix from the final line
- Validates and removes consistent indentation from each line
- Handles empty lines correctly (lines with only newlines)
The escape sequence handling at line 319 is a nice optimization - treating lines without backslashes as raw to avoid unnecessary decode overhead.
347-410: Clean integration of d-string handling into the string parser.The changes properly:
- Track
dedentmodealongside other string modes- Enforce triple-quote requirement for d-strings (line 405-410)
- Assert that dedent mode is never combined with bytes mode (line 415)
- Route to the dedicated
_PyPegen_dedent_stringdecoder when activeParser/action_helpers.c (4)
1295-1376: Correct implementation of dedent handling for f/t-string parts.The
_PyPegen_dedent_string_partfunction properly handles:
- First vs. non-first parts (skipping initial newline only for first part)
- Empty lines (with and without trailing newlines)
- Indentation validation against the dedent prefix
- Line-by-line decoding with proper escape handling
The logic for non-first parts (lines 1310-1330) correctly preserves the first line without dedenting, as interpolated expressions may appear mid-line.
1440-1480: Thorough d-string validation for f/t-string contexts.The dedent validation logic correctly:
- Ensures the string starts with a newline (line 1444)
- Ensures the last item is a Constant for prefix extraction (line 1453)
- Extracts the dedent prefix by scanning backwards from the end (lines 1466-1479)
- Validates only spaces and tabs in the indent line
1378-1414: Clean integration of dedent awareness into f-string part decoding.The updated
_PyPegen_decode_fstring_partproperly routes to the dedent path when adedent_prefixis provided, while preserving the original behavior (including{{/}}handling) for non-dedented strings.
1513-1513: Correct propagation ofis_firstflag for d-string processing.The
i == 0check correctly identifies the first string part, ensuring proper dedent handling where the initial newline is skipped only for the first part.Parser/lexer/lexer.c (3)
456-504: Prefix incompatibility checks for 'd' are correctly implemented.The new checks properly reject:
u+d(line 483-485): Legacy unicode prefix is incompatible with dedentb+d(line 493-495): Bytes cannot be dedentedThe absence of
f+dandt+dincompatibility checks is correct, as dedented f-strings (df) and dedented t-strings (dt) are valid combinations.
1061-1137: Correct f-string/t-string mode initialization with d-prefix support.The updated logic properly handles dedented f-strings (
df) and t-strings (dt) by:
- Including
'd'in the f_string_quote entry condition (line 1061)- Iterating through all prefix characters to set
string_kindandrawflags- The
d/Dcase correctly does nothing special - dedent handling is deferred to the parserPlain
d"..."strings correctly bypass this path and are tokenized as regular STRING tokens vialetter_quote.
750-781: Complete integration of 'd' prefix detection in the lexer.The
saw_dtracking is properly integrated into the prefix detection loop and passed to the validation function. The detection logic correctly handles both 'd' and 'D'.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Parser/string_parser.c (1)
250-330: Review the assertion and consider edge cases in dedent logic.The implementation looks solid overall, but there are a few points to verify:
Line 316 assertion safety: The assertion
assert(line_end != NULL)assumesmemchrwill always find a newline. While the comment states "we know there is at least one newline before prefix," this relies on the loop conditionline_start < prefix. Consider whether all edge cases are covered, especially for malformed input.Tab/space mixing: The validation at lines 292-302 checks character-by-character equality with the dedent prefix but doesn't explicitly detect or warn about mixing tabs and spaces within indentation. This could lead to subtle bugs if the first line uses tabs and subsequent lines use spaces (or vice versa). Consider whether this is intentional or if a warning should be added.
Line 319 optimization: The logic
is_raw || memchr(line_start, '\\', line_end - line_start) == NULLis a nice optimization to avoid unnecessary escape processing when there are no backslashes.🔎 Consider adding validation for consistent whitespace in indentation
for (i = 0; i < prefix_len && line_start + i < prefix; i++) { if (line_start[i] != prefix[i]) { if (line_start[i] == '\n') { break; // empty line } + // Provide more specific error for tab/space mixing + if ((line_start[i] == ' ' || line_start[i] == '\t') && + (prefix[i] == ' ' || prefix[i] == '\t')) { + PyUnicodeWriter_Discard(w); + RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1, + "d-string indentation mixing tabs and spaces"); + return NULL; + } PyUnicodeWriter_Discard(w); RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1, "d-string line missing valid indentation"); return NULL; } }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Lib/test/test_dstring.pyLib/tokenize.pyParser/action_helpers.cParser/lexer/lexer.cParser/string_parser.c
🚧 Files skipped from review as they are similar to previous changes (1)
- Lib/test/test_dstring.py
🧰 Additional context used
🧬 Code graph analysis (1)
Parser/string_parser.c (3)
Parser/pegen.h (1)
RAISE_ERROR_KNOWN_LOCATION(187-200)Objects/unicode_writer.c (5)
PyUnicodeWriter_Create(157-185)PyUnicodeWriter_Discard(188-195)PyUnicodeWriter_WriteChar(316-326)PyUnicodeWriter_WriteStr(360-380)PyUnicodeWriter_Finish(625-632)Include/refcount.h (1)
Py_XDECREF(510-515)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Windows / Build and test (arm64)
- GitHub Check: Windows / Build and test (Win32)
- GitHub Check: Windows (free-threading) / Build and test (arm64)
- GitHub Check: macOS (free-threading) / build and test (macos-14)
- GitHub Check: Windows (free-threading) / Build and test (x64)
- GitHub Check: Windows / Build and test (x64)
- GitHub Check: macOS / build and test (macos-14)
- GitHub Check: Ubuntu (free-threading) / build and test (ubuntu-24.04)
- GitHub Check: Address sanitizer (ubuntu-24.04)
- GitHub Check: Android (x86_64)
- GitHub Check: Hypothesis tests on Ubuntu
- GitHub Check: Android (aarch64)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.5.4)
- GitHub Check: iOS
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.4.3)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.0.18)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.2.6)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.3.5)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 1.1.1w)
- GitHub Check: Check if generated files are up to date
🔇 Additional comments (7)
Lib/tokenize.py (1)
89-90: LGTM! The prefix combinations align with the parser implementation.The addition of 'd' as a base prefix and its combinations ('df', 'dt', 'dr', 'dfr', 'dtr') correctly excludes incompatible combinations like 'db' (dedent with bytes) and 'du' (dedent with unicode), which are appropriately rejected in the parser and lexer.
Parser/string_parser.c (1)
347-347: LGTM! The dedent mode integration is clean and well-structured.The changes correctly:
- Introduce a
dedentmodeflag to track 'd' prefix detection- Enforce triple-quote requirement for d-strings (lines 405-410)
- Assert mutual exclusivity with bytesmode (line 415), consistent with the tokenizer not including 'db' combinations
- Route to the dedent handler when appropriate (lines 432-434)
The control flow is logical and the error messages are clear.
Also applies to: 362-365, 405-410, 415-415, 432-434
Parser/lexer/lexer.c (3)
456-504: LGTM! The prefix validation correctly handles the new 'd' prefix.The validation appropriately:
- Rejects 'u' + 'd' combinations (line 483-485)
- Rejects 'b' + 'd' combinations (line 493-495)
- Allows 'd' with 'f', 't', and 'r', which aligns with the valid prefixes defined in
Lib/tokenize.py
750-792: LGTM! The 'd' prefix detection integrates cleanly with existing prefix handling.The implementation:
- Follows the same pattern as other prefix detection (b, r, u, f, t)
- Correctly routes 'd' prefixed f-strings and t-strings to the
f_string_quotelabel (line 787-789)- Maintains consistency with the existing codebase structure
1060-1137: LGTM! The refactored prefix handling elegantly supports variable-length prefixes.The changes:
- Line 1061: Correctly adds 'd' to the initial prefix check
- Lines 1113-1131: The loop-based approach is a good improvement over fixed conditionals, making it easier to handle multi-character prefixes like 'dfr' and 'dtr'
- The logic correctly determines
string_kind(TSTRING vs FSTRING) and therawflag based on the prefix charactersThe
Py_UNREACHABLE()at line 1129 is appropriate since invalid prefix combinations are already rejected bymaybe_raise_syntax_error_for_string_prefixes().Parser/action_helpers.c (2)
1378-1414: LGTM! The signature update and routing logic are well-implemented.The changes correctly:
- Add
is_firstanddedent_prefixparameters to support dedented f-strings- Route through
_PyPegen_dedent_string_partwhen a dedent prefix is provided (lines 1389-1391)- Preserve the existing logic for non-dedented strings, including the special handling for
{{and}}(lines 1395-1400)- The updated call site at line 1513 correctly passes
i == 0foris_first
1417-1543: Edge case handling in dedent prefix extraction is sound given tokenizer constraints.The dedent detection and prefix extraction logic (lines 1440-1480) is well-implemented. The three edge cases mentioned in the review are handled appropriately:
Empty first constant:
PyUnicode_ReadCharreturns(Py_UCS4)-1for out-of-bounds access (index 0 on empty string), which is not equal to'\n', so the error is raised correctly.Last item with only a newline or empty string: While the while loop at lines 1467-1479 would not execute for zero-length strings (loop condition
bstr < dedent_prefixfails), this scenario is prevented by tokenizer constraints. The parallel implementation inParser/string_parser.cuses an assertion (assert(prefix > s)) confirming the tokenizer guarantees at least a newline plus indentation characters in the last item.Line 1444 check: The
PyUnicode_ReadCharcall safely handles empty strings by returning an out-of-range error value that fails the comparison.The core logic at line 1513 correctly passes
i == 0anddedent_prefixto the decode function.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
Lib/test/test_dstring.pyLib/test/test_tokenize.pyLib/tokenize.pyParser/action_helpers.cParser/lexer/lexer.cParser/string_parser.c
🚧 Files skipped from review as they are similar to previous changes (4)
- Lib/tokenize.py
- Parser/string_parser.c
- Lib/test/test_dstring.py
- Parser/lexer/lexer.c
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Ubuntu (bolt) / build and test (ubuntu-24.04)
- GitHub Check: Ubuntu / build and test (ubuntu-24.04-arm)
- GitHub Check: Sanitizers / TSan
- GitHub Check: Ubuntu / build and test (ubuntu-24.04)
- GitHub Check: Ubuntu (free-threading) / build and test (ubuntu-24.04-arm)
- GitHub Check: macOS / build and test (macos-15-intel)
- GitHub Check: Windows / Build and test (arm64)
- GitHub Check: WASI / build and test
- GitHub Check: Address sanitizer (ubuntu-24.04)
- GitHub Check: Ubuntu SSL tests with AWS-LC (ubuntu-24.04, 1.55.0)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 1.1.1w)
- GitHub Check: Hypothesis tests on Ubuntu
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.2.6)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.5.4)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.3.5)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.0.18)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.4.3)
- GitHub Check: Android (x86_64)
- GitHub Check: Android (aarch64)
- GitHub Check: Check if generated files are up to date
🔇 Additional comments (5)
Lib/test/test_tokenize.py (1)
3421-3427: LGTM: Proper accommodation for d-string prefix validation.The change from
eval(f'{letter}""')toeval(f'{letter}"""\n"""')correctly handles d-string prefix testing. Since d-strings require a newline at the start (as enforced by the validation inParser/action_helpers.c), the empty string format wouldn't work for detecting valid d-prefixes. Using the multiline format with a newline ensures d-strings are properly recognized during prefix validation.Also applies to: 3445-3451
Parser/action_helpers.c (4)
1295-1311: LGTM: Well-structured helper for line writing.The
unicodewriter_write_linefunction correctly handles both raw strings and strings requiring escape processing. The logic appropriately:
- Directly writes UTF-8 when raw or no backslashes present (optimization)
- Decodes strings with backslashes via
_PyPegen_decode_stringbefore writing- Properly handles cleanup on error (line 1305)
1394-1424: LGTM: Signature updated correctly to support dedenting.The function signature change appropriately adds parameters needed for d-string dedenting (
is_first,indent_char,dedent_count). The logic correctly:
- Routes to
_PyPegen_dedent_string_partwhendedent_count > 0(line 1405-1407)- Falls back to regular
_PyPegen_decode_stringotherwise (line 1410)- Optimizes by setting
is_rawwhen no backslashes are present (line 1402)
1452-1509: LGTM: Comprehensive d-string validation logic.The d-string validation correctly enforces the required format:
- Start validation (lines 1453-1461): Ensures the first element is a constant starting with
\n- End validation (lines 1463-1492): Verifies the last element is an indented line by scanning backward to find the last newline and validating that only whitespace follows
- Indent consistency (lines 1499-1507): Ensures uniform indentation character throughout, raising
TabErrorfor mixed tabs/spaces (appropriate error type)The backward scan implementation (lines 1479-1492) is correct for portability (avoids GNU-specific
memrchr).
1542-1542: LGTM: Call site updated consistently.The call to
_PyPegen_decode_fstring_partcorrectly passes the new parameters:
i == 0foris_first(indicates whether this is the first part)indent_charandindent_countfrom the d-string validationThis matches the updated function signature at line 1395.
| while (line_start < s + len) { | ||
| lineno++; | ||
|
|
||
| Py_ssize_t i = 0; | ||
| while (line_start + i < s_end && i < dedent_count && line_start[i] == indent_char) { | ||
| i++; | ||
| } | ||
|
|
||
| if (line_start[i] == '\0') { // found an empty line without newline. | ||
| break; | ||
| } | ||
| if (line_start[i] == '\n') { // found an empty line with newline. | ||
| if (PyUnicodeWriter_WriteChar(w, '\n') < 0) { | ||
| PyUnicodeWriter_Discard(w); | ||
| return NULL; | ||
| } | ||
| line_start += i+1; | ||
| continue; | ||
| } | ||
| if (i < dedent_count) { // found an invalid indent. | ||
| assert(line_start[i] != indent_char); | ||
| PyUnicodeWriter_Discard(w); | ||
| RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1, | ||
| "d-string line missing valid indentation"); | ||
| return NULL; | ||
| } |
There was a problem hiding this comment.
Potential buffer access issue after dedent loop.
After the while loop at lines 1353-1355 exits, the code accesses line_start[i] at lines 1357, 1360, and 1369 without first verifying that line_start + i < s_end. The loop could exit because line_start + i >= s_end, making these accesses out of bounds.
🔎 Proposed fix to add bounds checking
while (line_start < s + len) {
lineno++;
Py_ssize_t i = 0;
while (line_start + i < s_end && i < dedent_count && line_start[i] == indent_char) {
i++;
}
- if (line_start[i] == '\0') { // found an empty line without newline.
+ if (line_start + i >= s_end || line_start[i] == '\0') { // found an empty line without newline.
break;
}
- if (line_start[i] == '\n') { // found an empty line with newline.
+ if (line_start + i < s_end && line_start[i] == '\n') { // found an empty line with newline.
if (PyUnicodeWriter_WriteChar(w, '\n') < 0) {
PyUnicodeWriter_Discard(w);
return NULL;
}
line_start += i+1;
continue;
}
- if (i < dedent_count) { // found an invalid indent.
- assert(line_start[i] != indent_char);
+ if (line_start + i < s_end && i < dedent_count) { // found an invalid indent.
+ assert(line_start + i < s_end && line_start[i] != indent_char);
PyUnicodeWriter_Discard(w);
RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError, lineno, i, lineno, i+1,
"d-string line missing valid indentation");
return NULL;
}
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.