Replace Python ANTLR parser with C++ parser and AST builder#589
Closed
javihern98 wants to merge 22 commits intomainfrom
Closed
Replace Python ANTLR parser with C++ parser and AST builder#589javihern98 wants to merge 22 commits intomainfrom
javihern98 wants to merge 22 commits intomainfrom
Conversation
Replace the antlr4-python3-runtime dependency with a C++ ANTLR parser exposed through pybind11, achieving 95.7% performance improvement (9.6s → 0.41s for 8000 statements). Key changes: - Add C++ ANTLR parser with pybind11 lazy-wrapping bindings - Refactor ASTConstructor to use (rule_index, alt_index) dispatch - Switch build backend from poetry-core to scikit-build-core - Update CI workflows for C++ compilation and cibuildwheel - Remove antlr4-python3-runtime dependency - Delete dead files: lexer.py, parser.py, VtlVisitor.py
Fix ruff I001 import sorting errors across 7 files after moving _cpp_parser into Grammar/. Update cibuildwheel config: test-requires as array, per-platform before-build commands.
poetry install doesn't invoke scikit-build-core to compile the C++ extension. Use pip install . (with build isolation) after installing deps to actually build the .so module. Update version.yml to not require poetry for version extraction.
Testing workflow: copy the compiled C++ extension from site-packages back to the source tree so mypy can resolve the import. Ubuntu 24.04: use --no-deps to avoid upgrading system numpy/pandas which causes binary incompatibility errors.
The YAML folded scalar (>) was preserving leading whitespace in the python -c command, causing IndentationError. Use single-line command.
Use follow_imports = "silent" for vtlengine.AST.* in mypy config to suppress errors from AST files when the C++ extension .so isn't in the source tree (CI builds install to site-packages only). Remove the copy C++ extension step from the testing workflow.
Instead of silencing all AST modules, target only: - _cpp_parser: follow_imports=silent (handles missing .so in CI) - ASTConstructor + ASTConstructorModules: disallow_untyped_calls=false
No need to build the C++ parser just to check version consistency. Extract __version__ with grep and pyproject version with tomllib, removing the ANTLR download and pip install steps entirely.
Pure bash version check using grep — no Python, no build needed.
Use actions/cache to store the built wheel keyed on OS, Python version, and hash of C++ source files. On cache hit, ANTLR download and C++ compilation are skipped entirely — only the wheel install runs.
The missing .so cascades errors through all AST files, not just the constructor. Use follow_imports=silent for vtlengine.AST.* which matches the existing exclude pattern's intent.
- Define ANTLR4CPP_STATIC to avoid dllimport errors on Windows - Use /w instead of -w for MSVC warning suppression - Broaden mypy follow_imports=silent to all vtlengine.AST.*
Not needed for parsing and causes MSVC build errors with high_resolution_clock on Windows.
ProfilingATNSimulator is referenced by other ANTLR runtime code, so it can't be excluded. Use /FI"chrono" on MSVC to fix the missing high_resolution_clock symbol.
Use setup-python's built-in poetry cache for faster dependency installs. Combine dependency install and wheel install into one step.
Phase 0: Infrastructure - cached py::object refs for all 44 AST classes, ScalarType classes, Model classes, SemanticError. Helper functions for token info extraction, node type checking, and Python class construction. Phase 1: Port all 50 Terminal visitor methods from Terminals.py to C++, including visitConstant, visitVarID, visitComponentID, visitBasicScalarType, visitWindowingClause, and all other leaf-level visitors. Phase 2: Port all 48 ExprComponent visitor methods from ExprComponents.py to C++, including visitExprComponent (recursive dispatch), all function component visitors (string, numeric, time, comparison, conditional, aggregate, analytic), and the cast/eval operators. All 4037 tests pass. The C++ functions are exposed as pybind11 bindings but not yet wired into the main AST construction path - the Python visitor still runs unchanged.
Implement all 81 dataset-level expression visitors in C++: - visitExpr dispatch and all expression alternatives - Join functions (inner/left/cross/full join with clause handling) - Dataset clauses (rename, calc, filter, keep/drop, pivot/unpivot, subspace, aggr) - String, numeric, comparison, time, conditional functions - Set functions (union, intersect, diff, symdiff, setdiff) - Hierarchy and validation functions - Aggregate and analytic functions with grouping clauses - de_ruleset_elements dict mutation for validation operators
Implement all 18 top-level visitor methods in C++: - visitStart entry point with statement collection - visitStatement dispatch (temporary/persist assignment, define expression) - Operator definition with parameter items and return types - Datapoint ruleset definition with signature and rule clauses - Hierarchical ruleset definition with code item relations - build_ast() now calls visitStart() for full tree walk - Initialize missing cached AST class refs (Argument, Operator, HRBinOp, etc.)
- Replace ASTVisitor().visitStart() with C++ build_ast() in create_ast() - Fix default Windowing to use raw ints (-1, 0) matching Python behavior (create_windowing normalizes to strings, but ASTString checks for ints) - Fix visitSignedInteger overflow: use stoll instead of stoi for large numbers
- Add ASTBuilder::cleanup() to release all cached Python class references - Add cleanup_phase3() for Phase 3 statics - Use stoll instead of stoi for large integer support in visitSignedInteger - Fix default Windowing to preserve raw ints matching Python behavior
…::object refs Static py::object destructors were calling Py_DECREF after Python interpreter finalization, causing a segfault in __run_exit_handlers. Replace assignment with .release() (sets internal PyObject* to nullptr without Py_DECREF) and register cleanup via atexit from Python side so it runs while the interpreter is still alive.
Delete ASTConstructor.py, ASTConstructorModules/, _rule_constants.py, 60+ unused visit_* pybind11 bindings, ~190 token/rule constants, and the backwards-compat ASTVisitor import. Only ML_COMMENT constant kept (used by ASTComment.py for prettify). Total: -5,942 lines.
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Replace the pure-Python ANTLR parser and AST visitor with a C++ implementation via pybind11, dramatically improving parse + AST construction performance.
Key changes:
.g4grammar, compiled as a native pybind11 extension (vtl_cpp_parser.so)py::objectrefs cleaned up viaatexitto prevent segfault at interpreter finalizationPerformance (MG01 benchmark, 52K-line VTL script):
Checklist
ruff format,ruff check,mypy)pytest) — all 4037 tests passImpact / Risk
run(),semantic_analysis(),prettify()) is unchanged. The C++ builder produces identical Python AST objects.Notes
lexer.py,parser.py,VtlVisitor.py) is removed since the C++ extension fully replaces it.antlr4-python3-runtimedependency is no longer needed at runtime (ANTLR C++ runtime is statically linked).perf/cpp-parserbase branch contains the parser-only work; this branch adds the AST builder on top.