Skip to content

Replace Python ANTLR parser with C++ parser via pybind11#588

Draft
javihern98 wants to merge 15 commits intomainfrom
perf/cpp-parser
Draft

Replace Python ANTLR parser with C++ parser via pybind11#588
javihern98 wants to merge 15 commits intomainfrom
perf/cpp-parser

Conversation

@javihern98
Copy link
Contributor

@javihern98 javihern98 commented Mar 11, 2026

Summary

  • Replace antlr4-python3-runtime with a C++ ANTLR parser exposed via pybind11, achieving 95.7% performance improvement (9.6s → 0.41s for 8000 statements)
  • Switch build backend from poetry-core to scikit-build-core for C++ extension compilation
  • Update CI workflows to build platform-specific wheels with cibuildwheel for Python 3.9-3.13 on Linux, macOS, and Windows

Changes

C++ Parser (src/vtlengine/AST/Grammar/_cpp_parser/)

  • bindings.cpp: pybind11 module with lazy-wrapping ParseNode/TerminalNode classes
  • _rule_constants.py: 223 (rule_index, alt_index) constants for isinstance replacement
  • Generated C++ parser/lexer from Vtl.g4 (ANTLR 4.13.1)

ASTConstructor Refactoring

  • All isinstance(ctx, Parser.XxxContext)ctx.ctx_id == RC.XXX
  • All isinstance(x, TerminalNodeImpl)x.is_terminal
  • All x.getSymbol().textx.text
  • All list(ctx.getChildren())ctx.children

Build System

  • pyproject.toml: scikit-build-core backend, cibuildwheel config
  • CMakeLists.txt: builds pybind11 extension with vendored ANTLR4 C++ runtime
  • scripts/setup_antlr4_runtime.sh: downloads ANTLR4 C++ runtime for development
  • MSVC compatibility: ANTLR4CPP_STATIC define, /FI"chrono" for ProfilingATNSimulator

CI Workflows

  • testing.yml: C++ parser wheel cached via actions/cache (keyed on OS, Python version, and source hash). On cache hit, ANTLR download and C++ compilation are skipped entirely. Poetry dependency cache via setup-python
  • ubuntu_test_24_04.yml: Same wheel caching strategy. Uses --no-deps to avoid numpy/pandas conflicts with system packages
  • version.yml: Simplified to pure bash — extracts versions from source files with grep/tomllib, no build required
  • release.yml: Wheels-only publishing via cibuildwheel (no sdist)

CI Performance (Testing workflow)

Run type Duration
No cache (first build) ~7m37s
C++ wheel cached ~5m10s

Deleted Files

  • src/vtlengine/AST/Grammar/lexer.py (2140 lines)
  • src/vtlengine/AST/Grammar/parser.py (16415 lines)
  • src/vtlengine/AST/VtlVisitor.py (906 lines)
  • src/vtlengine/AST/Grammar/runtime_patches.py
  • src/vtlengine/AST/Grammar/fast_lexer.py

New Files

  • scripts/check_version.sh: standalone version consistency check
  • scripts/setup_antlr4_runtime.sh: downloads ANTLR4 C++ runtime for development

MG01 Benchmark (rc6 vs rc7, 3 runs each)

Real-world benchmark using the MG01 VTL validation suite (7,387 AST statements, 7,379 output datasets). Outputs are identical between both versions.

create_ast

rc6 (Python ANTLR) rc7 (C++ ANTLR) Speedup
Run 1 10.08s 3.45s
Run 2 12.56s 4.07s
Run 3 13.34s 4.68s
Avg 11.99s 4.07s 2.95x
Min 10.08s 3.45s 2.92x

semantic_analysis

rc6 (Python ANTLR) rc7 (C++ ANTLR) Speedup
Run 1 16.88s 7.08s
Run 2 14.17s 6.82s
Run 3 15.99s 6.86s
Avg 15.68s 6.92s 2.27x
Min 14.17s 6.82s 2.08x

Test plan

  • All 4037 tests pass locally
  • mypy clean, ruff clean
  • CI passes on all Python versions (3.9-3.13) and platforms (Linux, macOS, Windows)
  • MG01 benchmark: identical outputs between rc6 and rc7
  • Verify wheel builds correctly via cibuildwheel

Replace the antlr4-python3-runtime dependency with a C++ ANTLR parser
exposed through pybind11, achieving 95.7% performance improvement
(9.6s → 0.41s for 8000 statements).

Key changes:
- Add C++ ANTLR parser with pybind11 lazy-wrapping bindings
- Refactor ASTConstructor to use (rule_index, alt_index) dispatch
- Switch build backend from poetry-core to scikit-build-core
- Update CI workflows for C++ compilation and cibuildwheel
- Remove antlr4-python3-runtime dependency
- Delete dead files: lexer.py, parser.py, VtlVisitor.py
Fix ruff I001 import sorting errors across 7 files after moving
_cpp_parser into Grammar/. Update cibuildwheel config: test-requires
as array, per-platform before-build commands.
poetry install doesn't invoke scikit-build-core to compile the C++
extension. Use pip install . (with build isolation) after installing
deps to actually build the .so module. Update version.yml to not
require poetry for version extraction.
Testing workflow: copy the compiled C++ extension from site-packages
back to the source tree so mypy can resolve the import.

Ubuntu 24.04: use --no-deps to avoid upgrading system numpy/pandas
which causes binary incompatibility errors.
The YAML folded scalar (>) was preserving leading whitespace in the
python -c command, causing IndentationError. Use single-line command.
Use follow_imports = "silent" for vtlengine.AST.* in mypy config
to suppress errors from AST files when the C++ extension .so isn't
in the source tree (CI builds install to site-packages only).
Remove the copy C++ extension step from the testing workflow.
Instead of silencing all AST modules, target only:
- _cpp_parser: follow_imports=silent (handles missing .so in CI)
- ASTConstructor + ASTConstructorModules: disallow_untyped_calls=false
No need to build the C++ parser just to check version consistency.
Extract __version__ with grep and pyproject version with tomllib,
removing the ANTLR download and pip install steps entirely.
Pure bash version check using grep — no Python, no build needed.
Use actions/cache to store the built wheel keyed on OS, Python
version, and hash of C++ source files. On cache hit, ANTLR download
and C++ compilation are skipped entirely — only the wheel install
runs.
The missing .so cascades errors through all AST files, not just
the constructor. Use follow_imports=silent for vtlengine.AST.*
which matches the existing exclude pattern's intent.
- Define ANTLR4CPP_STATIC to avoid dllimport errors on Windows
- Use /w instead of -w for MSVC warning suppression
- Broaden mypy follow_imports=silent to all vtlengine.AST.*
Not needed for parsing and causes MSVC build errors with
high_resolution_clock on Windows.
ProfilingATNSimulator is referenced by other ANTLR runtime code, so
it can't be excluded. Use /FI"chrono" on MSVC to fix the missing
high_resolution_clock symbol.
Use setup-python's built-in poetry cache for faster dependency
installs. Combine dependency install and wheel install into one step.
@javihern98
Copy link
Contributor Author

This one is posponed until further notice due to the analysis downstream over the changes in the build process and how can it impact the installation of the library in very constrained scenarios. Scheduled to review in April-May

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant