Skip to content

fix: force UTF-8 encoding for Python bridge on Windows#2

Open
quotentiroler wants to merge 1 commit intoblueraai:mainfrom
quotentiroler:fix/windows-utf8-encoding-surrogates
Open

fix: force UTF-8 encoding for Python bridge on Windows#2
quotentiroler wants to merge 1 commit intoblueraai:mainfrom
quotentiroler:fix/windows-utf8-encoding-surrogates

Conversation

@quotentiroler
Copy link

Problem

On Windows, indexing repositories containing source files with Unicode characters (smart quotes "", em dashes, etc.) fails at 99% with:

Failed to parse Python AST: 'utf-8' codec can't encode character '\udc9d' in position 20822: surrogates not allowed

Root Cause

PythonBridge spawns ast_worker.py and writes UTF-8 JSON to its stdin pipe. However, Python on Windows defaults to the system ANSI code page (e.g. cp1252) for stdio streams — not UTF-8.

Multi-byte UTF-8 sequences like \xe2\x80\x9d (right double quote U+201D) get misread: the \x9d byte is undefined in cp1252 and produces a surrogate character \udc9d via Python's surrogateescape error handler. When json.dumps() later tries to encode this surrogate back to UTF-8, it crashes with "surrogates not allowed".

Fix

Applied a belt-and-suspenders fix in two layers:

  1. python/ast_worker.py: On Windows, wraps sys.stdin/sys.stdout with explicit io.TextIOWrapper(encoding='utf-8') to ensure correct decoding regardless of system locale settings.

  2. src/crawl/bridge.ts: Passes PYTHONUTF8=1 and PYTHONIOENCODING=utf-8 environment variables when spawning the Python child process, which forces Python to use UTF-8 for all IO operations.

Testing

Tested on Windows 11 with German locale (cp1252). Before the fix, indexing the huggingface/lerobot repository (517 files) failed 4 consecutive times at 99%. After the fix, indexing completes successfully (517/517 files) and search returns results.

On Windows, Python defaults to the system ANSI code page (e.g. cp1252)
for stdio streams. When Node.js sends UTF-8 JSON containing multi-byte
characters (like smart quotes U+201C/U+201D encoded as \xe2\x80\x9c/\x9d)
over stdin pipes, Python misreads the \x9d byte as a cp1252 character,
producing surrogate \udc9d. This later crashes json.dumps with
'surrogates not allowed' during AST parsing at the finalization step.

Fix applied in two layers (belt and suspenders):

1. python/ast_worker.py: Wrap sys.stdin/sys.stdout with explicit UTF-8
   TextIOWrapper on Windows, ensuring correct decoding regardless of
   system locale settings.

2. src/crawl/bridge.ts: Pass PYTHONUTF8=1 and PYTHONIOENCODING=utf-8
   environment variables when spawning the Python child process, which
   forces Python to use UTF-8 for all IO operations.

Fixes indexing failures on Windows where source files contain Unicode
characters like smart quotes, em dashes, or other non-ASCII text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant