Skip to content

refactor: Replace string-based import parsing with AST-based detection#592

Open
StefanoFioravanzo wants to merge 5 commits intokubeflow:mainfrom
StefanoFioravanzo:refactor/ast-based-import-detection
Open

refactor: Replace string-based import parsing with AST-based detection#592
StefanoFioravanzo wants to merge 5 commits intokubeflow:mainfrom
StefanoFioravanzo:refactor/ast-based-import-detection

Conversation

@StefanoFioravanzo
Copy link
Member

Summary

  • Introduce a new imports.py module with proper AST-based import parsing
  • Replace brittle hardcoded string matching in the compiler
  • Add comprehensive test coverage (70 unit tests)

Problem

When Kale compiles a notebook into a Kubeflow Pipeline, it needs to determine which Python packages should be installed in the pipeline steps. This is done by analyzing the import statements in the notebook cells.

The previous implementation used simple string matching to parse imports:

for line in lines:
line = line.strip()
if line.startswith("import "):
# For 'import package' or 'import package as alias'
parts = line.split(" ")
if len(parts) > 1:
package_name = parts[1].split(".")[0]
if package_name == "random":
package_name = "random2"
if package_name == "sklearn":
package_name = "scikit-learn"
package_names.add(package_name)
elif line.startswith("from "):
parts = line.split(" ")
if len(parts) > 1:
package_name = parts[1].split(".")[0]
if package_name == "sklearn":
package_name = "scikit-learn"
package_names.add(package_name)
return sorted(package_names)

This approach had several issues:

  1. Hardcoded mappings: Only random and sklearn were mapped to their PyPI names. Other common mismatches (e.g., PILpillow, cv2opencv-python, yamlpyyaml) were not handled.
  2. No stdlib filtering: Standard library modules like os, sys, json, etc. were being added to the package list unnecessarily.
  3. Fragile parsing: String splitting doesn't handle all valid Python import syntaxes:
  • import a, b, c (multiple imports)
  • from x import (a, b, c) (parenthesized imports)
  • Multi-line imports
  • Comments on import lines
  1. Not extensible: Adding new package mappings required modifying the compiler code directly.

Solution

This PR introduces a new backend/kale/common/imports.py module that uses Python's ast module to properly parse imports:

Key Components

  • STDLIB_MODULES: A comprehensive set of Python 3.12+ standard library modules to exclude from package requirements.
  • PACKAGE_NAME_MAP: A centralized registry mapping import names to PyPI package names:
PACKAGE_NAME_MAP = {
    "sklearn": "scikit-learn",
    "cv2": "opencv-python",
    "PIL": "pillow",
    "yaml": "pyyaml",
    # ... and more
}
  • ImportInfo: A dataclass providing structured information about each import (module, names, alias, line number).
  • parse_imports_ast(): Parses all imports from source code using AST, handling all valid Python import syntaxes.
  • get_packages_to_install(): Returns the set of PyPI packages needed for given source code, filtering stdlib and applying name mappings.

Benefits

  • Correct parsing: Handles all valid Python import syntax
  • Extensible: New package mappings can be added to PACKAGE_NAME_MAP
  • Maintainable: Centralized configuration instead of scattered conditionals
  • Tested: 70 unit tests covering all import forms and edge cases

Changes

  • New module: backend/kale/common/imports.py
  • Refactored: backend/kale/compiler.py - Use new module instead of string matching
  • Tests: backend/kale/tests/unit_tests/test_imports.py - 70 comprehensive tests

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from stefanofioravanzo. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested a review from ederign February 4, 2026 08:43
@StefanoFioravanzo StefanoFioravanzo changed the title Refactor/ast based import detection refactor: Replace string-based import parsing with AST-based detection Feb 4, 2026
@StefanoFioravanzo StefanoFioravanzo force-pushed the refactor/ast-based-import-detection branch from a9f7105 to e74ab06 Compare February 4, 2026 08:48
@StefanoFioravanzo StefanoFioravanzo added this to the Kale 2.0 milestone Feb 4, 2026
@StefanoFioravanzo StefanoFioravanzo force-pushed the refactor/ast-based-import-detection branch from e74ab06 to a947477 Compare February 4, 2026 12:36
Add new imports.py module with centralized package name resolution:

- STDLIB_MODULES: Comprehensive set of Python 3.10+ stdlib modules
- PACKAGE_NAME_MAP: Centralized registry mapping import names to PyPI
  package names (sklearn→scikit-learn, cv2→opencv-python, etc.)
- ImportInfo dataclass: Structured representation of parsed imports
- parse_imports_ast(): Parse all import forms using Python AST
- get_packages_to_install(): Extract pip package names from code
- is_stdlib_module(): Check if a module is part of stdlib

This replaces the brittle string-based import parsing in the compiler
with proper AST analysis that handles all Python import forms including
multi-line imports, parenthesized imports, and aliases.

Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
Add test_imports.py with 70 test cases covering:

- TestStdlibModules: Verify stdlib module detection
- TestPackageNameMap: Verify import-to-PyPI mappings
- TestImportInfo: Test the ImportInfo dataclass methods
- TestParseImportsAst: Test AST parsing of various import forms
  - Simple imports, aliases, from imports
  - Multiple names, nested modules, parenthesized imports
  - Dotted imports, mixed code, line number tracking
- TestGetPackagesToInstall: Test package extraction
  - stdlib filtering, package name mapping, deduplication
  - Real-world data science import patterns
- TestIsStdlibModule: Test stdlib detection helper

Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
Replace the brittle string-based import parsing in compiler.py with
the new AST-based approach from the imports module.

Before:
- String splitting on 'import ' and 'from ' prefixes
- Hardcoded if/elif chains for package name mapping
- Only handled 'random' and 'sklearn' special cases
- Could not handle multi-line imports or parenthesized imports
- Added stdlib modules to packages_to_install

After:
- Proper AST parsing via get_packages_to_install()
- Centralized PACKAGE_NAME_MAP with 12+ common mappings
- Handles all Python import forms correctly
- Filters out stdlib modules automatically
- Extensible: just add to PACKAGE_NAME_MAP for new cases

The _get_package_list_from_imports method now delegates to the
imports module, reducing it from 25 lines to 10 lines while
significantly improving correctness and maintainability.

Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
@StefanoFioravanzo StefanoFioravanzo force-pushed the refactor/ast-based-import-detection branch from a947477 to e907eee Compare February 4, 2026 19:31
Sort imports alphabetically to satisfy ruff linter.

Signed-off-by: Stefano Fioravanzo <stefano.fioravanzo@gmail.com>
@FAUST-BENCHOU
Copy link

so many lint error
uv run ruff check backend --fix may fix them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants