refactor: Replace string-based import parsing with AST-based detection#592
Open
StefanoFioravanzo wants to merge 5 commits intokubeflow:mainfrom
Open
refactor: Replace string-based import parsing with AST-based detection#592StefanoFioravanzo wants to merge 5 commits intokubeflow:mainfrom
StefanoFioravanzo wants to merge 5 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a9f7105 to
e74ab06
Compare
jesuino
reviewed
Feb 4, 2026
e74ab06 to
a947477
Compare
Add new imports.py module with centralized package name resolution: - STDLIB_MODULES: Comprehensive set of Python 3.10+ stdlib modules - PACKAGE_NAME_MAP: Centralized registry mapping import names to PyPI package names (sklearn→scikit-learn, cv2→opencv-python, etc.) - ImportInfo dataclass: Structured representation of parsed imports - parse_imports_ast(): Parse all import forms using Python AST - get_packages_to_install(): Extract pip package names from code - is_stdlib_module(): Check if a module is part of stdlib This replaces the brittle string-based import parsing in the compiler with proper AST analysis that handles all Python import forms including multi-line imports, parenthesized imports, and aliases. Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
Add test_imports.py with 70 test cases covering: - TestStdlibModules: Verify stdlib module detection - TestPackageNameMap: Verify import-to-PyPI mappings - TestImportInfo: Test the ImportInfo dataclass methods - TestParseImportsAst: Test AST parsing of various import forms - Simple imports, aliases, from imports - Multiple names, nested modules, parenthesized imports - Dotted imports, mixed code, line number tracking - TestGetPackagesToInstall: Test package extraction - stdlib filtering, package name mapping, deduplication - Real-world data science import patterns - TestIsStdlibModule: Test stdlib detection helper Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
Replace the brittle string-based import parsing in compiler.py with the new AST-based approach from the imports module. Before: - String splitting on 'import ' and 'from ' prefixes - Hardcoded if/elif chains for package name mapping - Only handled 'random' and 'sklearn' special cases - Could not handle multi-line imports or parenthesized imports - Added stdlib modules to packages_to_install After: - Proper AST parsing via get_packages_to_install() - Centralized PACKAGE_NAME_MAP with 12+ common mappings - Handles all Python import forms correctly - Filters out stdlib modules automatically - Extensible: just add to PACKAGE_NAME_MAP for new cases The _get_package_list_from_imports method now delegates to the imports module, reducing it from 25 lines to 10 lines while significantly improving correctness and maintainability. Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
a947477 to
e907eee
Compare
Sort imports alphabetically to satisfy ruff linter. Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
|
so many lint error |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
imports.pymodule with proper AST-based import parsingProblem
When Kale compiles a notebook into a Kubeflow Pipeline, it needs to determine which Python packages should be installed in the pipeline steps. This is done by analyzing the import statements in the notebook cells.
The previous implementation used simple string matching to parse imports:
kale/backend/kale/compiler.py
Lines 279 to 298 in 5c4a324
This approach had several issues:
randomandsklearnwere mapped to their PyPI names. Other common mismatches (e.g.,PIL→pillow,cv2→opencv-python,yaml→pyyaml) were not handled.os,sys,json, etc. were being added to the package list unnecessarily.import a, b, c(multiple imports)from x import (a, b, c)(parenthesized imports)Solution
This PR introduces a new
backend/kale/common/imports.pymodule that uses Python'sastmodule to properly parse imports:Key Components
STDLIB_MODULES: A comprehensive set of Python 3.12+ standard library modules to exclude from package requirements.PACKAGE_NAME_MAP: A centralized registry mapping import names to PyPI package names:ImportInfo:A dataclass providing structured information about each import (module, names, alias, line number).parse_imports_ast(): Parses all imports from source code using AST, handling all valid Python import syntaxes.get_packages_to_install(): Returns the set of PyPI packages needed for given source code, filtering stdlib and applying name mappings.Benefits
Changes
backend/kale/common/imports.pybackend/kale/compiler.py- Use new module instead of string matchingbackend/kale/tests/unit_tests/test_imports.py- 70 comprehensive tests