diff --git a/FAQ.md b/FAQ.md index e84bde465..90705ce21 100644 --- a/FAQ.md +++ b/FAQ.md @@ -18,25 +18,9 @@ pipeline, Kale sets the notebook server's image as the steps' base image (or a custom user-defined image), so all those incremental changes (e.g. new installations) will be lost. -You will notice this is not happening in our CodeLab because, when running in -MiniKF, Kale integrates with Rok, a data management platform that takes care of -snapshotting the mounted volumes and making them available to the pipeline step. -Thus preserving the exact development environment found in the notebook. - -### Pod has unbound immediate PersistentVolumeClaim - -In order to data, Kale mounts a data volume on each pipeline step. Since steps -can run concurrently, your storage class needs to support `RWX` -(`ReadWriteMany`) volumes. If that is not the case, the pod will be left -unschedulable as it won't find this kind of resource. - -What you can do in this case is either install a storage class that enables -`RWX` volumes or: - -1. Retrieve the `.py` file generated by Kale (it should be next to the `.ipynb`) -2. Search for `marshal_vop` definition (`marshal_vop = dsl.VolumeOp...`) -3. Change this line `modes=dsl.VOLUME_MODE_RWM`, to `modes=dsl.VOLUME_MODE_RWO` -4. Run the `.py` file +To solve this, you can either: +1. Build a custom Docker image with all your dependencies pre-installed +2. List additional packages in the cell tags that Kale will include in `packages_to_install` ### Data passing and pickle errors @@ -61,27 +45,17 @@ implemented. ### Compiler errors -When compiling your notebook you may encounter the following error: -``` -Internal compiler error: Compiler has produced Argo-incompatible workflow. -Please create a new issue at https://github.com/kubeflow/pipelines/issues attaching the pipeline code and the pipeline package. -``` -followed by some explanation. For example: -``` -Error: time="2020-10-12T17:57:45-07:00" level=fatal msg="/dev/stdin failed to parse: error unmarshaling JSON: while decoding JSON: json: unknown field \"volumes\"" -``` - -This is an error raised by the KFP compiler. Kale compile process contains -converting to KFP DSL and then compiling it, so it triggers the KFP compiler. +If you encounter compiler errors, ensure you're using a compatible version of +KFP (v2.4.0+). The KFP v2 compiler produces IR YAML that is submitted to the +Kubeflow Pipelines backend. -The KFP compiler runs `argo lint` on the generated workflow, if it finds the -`argo` executable in your environment's `PATH`. +Common issues: +- **Missing dependencies**: Ensure all required packages are listed in your imports cell +- **Invalid Python syntax**: Check that your notebook cells contain valid Python 3.12+ code +- **Type mismatches**: KFP v2 uses typed artifacts; ensure inputs/outputs match expected types -To overcome this issue, you could either remove `argo` from your `PATH` or -replace it with a version that is supported by KFP. At the time of writing this -section, the recommended version is 2.4.3. Follow [this -link](https://github.com/argoproj/argo/releases/tag/v2.4.3) to get the proper -binary. +If issues persist, check the generated `.kale.py` file in the `.kale/` directory +and file an issue at https://github.com/kubeflow-kale/kale/issues. ## Limitations diff --git a/README.md b/README.md index 660de184a..b95b122e4 100644 --- a/README.md +++ b/README.md @@ -52,11 +52,13 @@ See the `Kale v2.0 Demo` video at the bottom of the `README` for more details. Read more about Kale and how it works in this Medium post: [Automating Jupyter Notebook Deployments to Kubeflow Pipelines with Kale](https://medium.com/kubeflow/automating-jupyter-notebook-deployments-to-kubeflow-pipelines-with-kale-a4ede38bea1f) +For a detailed technical overview, see the [Architecture Documentation](docs/ARCHITECTURE.md). + ## Getting started ### Requirements -- **Python 3.10+** +- **Python 3.12+** - **Kubeflow Pipelines v2.4.0+** - Install as recommended in the official [Kubeflow Pipelines Installation](https://www.kubeflow.org/docs/components/pipelines/operator-guides/installation/) documentation - A Kubernetes cluster (`minikube`, `kind`, or any K8s cluster) diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 000000000..971922ba7 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,376 @@ +# Kale Architecture + +This document provides a high-level overview of Kale's architecture for contributors and users who want to understand how the system works. + +## Overview + +Kale (Kubeflow Automated pipeLines Engine) converts Jupyter notebooks into Kubeflow Pipelines. It consists of two main components: + +1. **Backend** (`backend/kale/`) - Python package that processes notebooks and generates KFP pipelines +2. **JupyterLab Extension** (`labextension/`) - TypeScript/React UI for annotating notebooks and deploying pipelines + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ User's JupyterLab │ +│ ┌──────────────────────────────────┐ ┌───────────────────────────────┐ │ +│ │ Jupyter Notebook │ │ Kale Extension Panel │ │ +│ │ ┌────────────────────────────┐ │ │ ┌─────────────────────────┐ │ │ +│ │ │ # imports │ │ │ │ Pipeline Name: [____] │ │ │ +│ │ │ import pandas as pd │ │ │ │ Experiment: [____] │ │ │ +│ │ └────────────────────────────┘ │ │ │ │ │ │ +│ │ ┌────────────────────────────┐ │ │ │ Steps: │ │ │ +│ │ │ # step:load_data │ │◄───┼──┤ ☑ load_data │ │ │ +│ │ │ df = pd.read_csv(...) │ │ │ │ ☑ preprocess │ │ │ +│ │ └────────────────────────────┘ │ │ │ ☑ train_model │ │ │ +│ │ ┌────────────────────────────┐ │ │ │ │ │ │ +│ │ │ # step:preprocess │ │ │ │ [Compile & Run] │ │ │ +│ │ │ # prev:load_data │ │ │ └─────────────────────────┘ │ │ +│ │ │ df = df.dropna() │ │ │ │ │ +│ │ └────────────────────────────┘ │ └───────────────────────────────┘ │ +│ └──────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + │ RPC (via Jupyter Kernel) + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Kale Backend │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ │ +│ │ NotebookProcessor│───►│ Pipeline │───►│ Compiler │ │ +│ │ │ │ │ │ │ │ +│ │ • Parse cells │ │ • Steps DAG │ │ • Jinja2 templates │ │ +│ │ • Extract tags │ │ • Dependencies │ │ • KFP DSL generation │ │ +│ │ • Detect deps │ │ • Configuration │ │ • packages_to_install │ │ +│ └─────────────────┘ └─────────────────┘ └───────────┬─────────────┘ │ +└───────────────────────────────────────────────────────────┬─────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Generated KFP Pipeline │ +│ ┌────────────────────────────────────────────────────────────────────────┐ │ +│ │ @kfp_dsl.component(base_image='python:3.12', ...) │ │ +│ │ def load_data(...): │ │ +│ │ ... │ │ +│ │ │ │ +│ │ @kfp_dsl.pipeline(name='my-pipeline') │ │ +│ │ def auto_generated_pipeline(): │ │ +│ │ load_data_task = load_data() │ │ +│ │ preprocess_task = preprocess(load_data_task.outputs[...]) │ │ +│ └────────────────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + │ kfp.compiler.Compiler() + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Kubeflow Pipelines │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ │ +│ │ Pipeline YAML │───►│ KFP Server │───►│ Kubernetes Cluster │ │ +│ │ (.yaml) │ │ │ │ │ │ +│ └─────────────────┘ └─────────────────┘ │ ┌───┐ ┌───┐ ┌───┐ │ │ +│ │ │Pod│►│Pod│►│Pod│ │ │ +│ │ └───┘ └───┘ └───┘ │ │ +│ └─────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Data Flow + +### 1. Notebook Annotation + +Users annotate notebook cells with special tags (comments) that define the pipeline structure: + +```python +# imports +import pandas as pd +import numpy as np +``` + +```python +# step:load_data +df = pd.read_csv('data.csv') +``` + +```python +# step:preprocess +# prev:load_data +df = df.dropna() +processed_data = df.values +``` + +### 2. Notebook Processing (`NotebookProcessor`) + +The `NotebookProcessor` reads the notebook and: + +1. **Parses cell tags** - Identifies imports, functions, steps, dependencies +2. **Extracts code** - Collects source code for each step +3. **Detects dependencies** - Uses PyFlakes + AST analysis to find variables passed between steps +4. **Builds Pipeline object** - Creates a DAG of steps with their configurations + +### 3. Pipeline Compilation (`Compiler`) + +The `Compiler` takes the `Pipeline` object and: + +1. **Generates lightweight components** - Each step becomes a `@kfp_dsl.component` +2. **Resolves package dependencies** - Parses imports to build `packages_to_install` +3. **Renders templates** - Uses Jinja2 to generate the final Python script +4. **Produces KFP DSL** - Outputs a standalone Python file with the pipeline definition + +### 4. Pipeline Execution + +The generated script is: + +1. **Compiled to YAML** - `kfp.compiler.Compiler()` produces pipeline.yaml +2. **Uploaded to KFP** - Sent to the Kubeflow Pipelines server +3. **Executed on Kubernetes** - Each step runs as a container in a pod + +## Backend Components + +### Directory Structure + +``` +backend/kale/ +├── __init__.py # Package exports +├── cli.py # Command-line interface +├── compiler.py # Pipeline → KFP DSL conversion +├── pipeline.py # Pipeline and configuration classes +├── step.py # Step and StepConfig classes +├── config/ # Configuration system +│ ├── config.py # Base Config class with Field system +│ └── validators.py # Field validators +├── processors/ +│ ├── baseprocessor.py # Base class for processors +│ └── nbprocessor.py # Notebook processing logic +├── common/ # Shared utilities +│ ├── astutils.py # AST parsing for dependency detection +│ ├── flakeutils.py # PyFlakes integration +│ ├── graphutils.py # DAG operations +│ ├── kfputils.py # KFP client utilities +│ └── ... +├── marshal/ # Data serialization between steps +│ ├── backend.py # Marshal backend +│ └── backends.py # Type-specific backends (sklearn, numpy, etc.) +├── rpc/ # RPC handlers for JupyterLab extension +│ ├── nb.py # Notebook operations +│ ├── kfp.py # KFP operations +│ └── ... +└── templates/ # Jinja2 templates for code generation + ├── new_nb_function_template.jinja2 + └── new_pipeline_template.jinja2 +``` + +### Key Classes + +#### `NotebookProcessor` (`processors/nbprocessor.py`) + +Converts a Jupyter notebook into a `Pipeline` object. + +```python +processor = NotebookProcessor( + nb_path='my_notebook.ipynb', + config=NotebookConfig(...) +) +pipeline = processor.run() +``` + +Key responsibilities: +- Parse notebook cells and extract tags +- Detect variable dependencies between steps using PyFlakes/AST +- Build the step dependency graph +- Merge imports and functions into each step + +#### `Pipeline` (`pipeline.py`) + +Represents the pipeline DAG and configuration. + +```python +class Pipeline: + steps: List[Step] # Pipeline steps + config: PipelineConfig # Pipeline configuration + processor: BaseProcessor # Reference to the processor + + def get_step(name) -> Step + def get_ordered_ancestors(step_name) -> List[str] +``` + +#### `Step` (`step.py`) + +Represents a single pipeline step. + +```python +class Step: + name: str # Step name (from tag) + source: List[str] # Source code lines + ins: Set[str] # Input variables (from previous steps) + outs: Set[str] # Output variables (to next steps) + config: StepConfig # Step configuration +``` + +#### `Compiler` (`compiler.py`) + +Converts a `Pipeline` into executable KFP DSL code. + +```python +compiler = Compiler(pipeline, imports_and_functions) +dsl_path = compiler.compile() # Returns path to generated .py file +compiler.run() # Compile, upload, and run +``` + +### Dependency Detection + +Kale uses a combination of PyFlakes and AST analysis to detect variable dependencies: + +1. **PyFlakes** identifies undefined names in each step's code +2. **AST analysis** finds all names defined in ancestor steps +3. **Matching** connects undefined names to their definitions + +``` +Step A: defines `df`, `model` +Step B: uses `df` (undefined) → detected as input from A +Step C: uses `model` (undefined) → detected as input from A +``` + +## JupyterLab Extension + +### Directory Structure + +``` +labextension/ +├── src/ +│ ├── index.ts # Plugin registration +│ ├── widget.tsx # Main plugin activation +│ ├── components/ # React components +│ │ ├── DeployButton.tsx +│ │ ├── Input.tsx +│ │ └── ... +│ ├── widgets/ +│ │ ├── LeftPanel.tsx # Main sidebar panel +│ │ └── cell-metadata/ # Cell tag editors +│ └── lib/ +│ ├── RPCUtils.tsx # Kernel RPC communication +│ ├── NotebookUtils.tsx # Notebook manipulation +│ └── TagsUtils.ts # Cell tag parsing +├── style/ # CSS styles +└── jupyter-config/ # Server extension config +``` + +### Communication Flow + +The extension communicates with the backend via Jupyter kernel RPC: + +``` +Extension (TypeScript) Kernel (Python) + │ │ + │──── executeRpc() ───────────►│ + │ "nb.compile_notebook" │ + │ │ + │ │ NotebookProcessor + │ │ Compiler + │ │ + │◄──── result ─────────────────│ + │ {pipeline_path: ...} │ +``` + +## Cell Tag Language + +Kale uses special comments to annotate notebook cells: + +| Tag | Description | Example | +|-----|-------------|---------| +| `# imports` | Mark cell as imports (prepended to all steps) | `# imports` | +| `# functions` | Mark cell as functions (prepended to all steps) | `# functions` | +| `# skip` | Skip this cell entirely | `# skip` | +| `# step:name` | Define a pipeline step | `# step:train_model` | +| `# prev:name` | Declare dependency on another step | `# prev:load_data` | +| `# pipeline-parameters` | Define pipeline input parameters | `# pipeline-parameters` | +| `# pipeline-metrics` | Define pipeline output metrics | `# pipeline-metrics` | +| `# annotation:key:value` | Add Kubernetes pod annotation | `# annotation:team:ml` | +| `# label:key:value` | Add Kubernetes pod label | `# label:env:prod` | +| `# limit:resource:value` | Set resource limit | `# limit:nvidia.com/gpu:1` | + +## KFP v2 Integration + +Kale generates KFP v2-compatible pipelines using: + +- **Lightweight Python components** via `@kfp_dsl.component` decorator +- **Native artifacts** (`Input`, `Output`, `Dataset`, `Model`, etc.) +- **Modern compiler** (`kfp.compiler.Compiler`) + +### Generated Component Example + +```python +@kfp_dsl.component( + base_image='python:3.12', + packages_to_install=['pandas', 'numpy', 'kubeflow-kale', 'kfp>=2.0.0'], +) +def load_data( + load_data_html_report: Output[HTML], + data_output_artifact: Output[Dataset], +): + # User's notebook code here + df = pd.read_csv('data.csv') + + # Kale marshal: save outputs for next steps + from kale.marshal import Marshaller + _kale_marshaller = Marshaller() + _kale_marshaller.save(df, data_output_artifact.path, 'data') +``` + +## Configuration System + +Kale uses a typed configuration system with validation: + +```python +class PipelineConfig(Config): + pipeline_name = Field(type=str, required=True) + experiment_name = Field(type=str, default="Default") + docker_image = Field(type=str, default="python:3.12") + volumes = Field(type=list, default=[]) + # ... + +class StepConfig(Config): + labels = Field(type=dict, default={}) + annotations = Field(type=dict, default={}) + limits = Field(type=dict, default={}) + retry_count = Field(type=int, default=0) + timeout = Field(type=int) +``` + +## Marshal System + +The marshal system handles data passing between pipeline steps: + +1. **Output**: At the end of each step, variables are serialized to artifacts +2. **Input**: At the start of each step, artifacts are deserialized to variables + +```python +# End of step A +_kale_marshaller.save(df, artifact_path, 'df') + +# Start of step B +df = _kale_marshaller.load(artifact_path, 'df') +``` + +Type-specific backends handle different data types: +- `PandasBackend` - DataFrames, Series +- `NumpyBackend` - Arrays +- `SKLearnBackend` - Scikit-learn models +- `PyTorchBackend` - PyTorch models +- `PickleBackend` - Generic Python objects (fallback) + +## Development Workflow + +```bash +# Setup development environment +make dev + +# Run JupyterLab with live reload +make jupyter + +# Run tests +make test-backend +make test-labextension + +# Build for release +make build +``` + +See [CONTRIBUTE.md](../CONTRIBUTE.md) for detailed development instructions.