End-to-End Machine Learning System with Kubeflow

This project demonstrates building an end-to-end machine learning system on Kubeflow, encompassing model training, deployment, and inference. It integrates a custom ML model application with official Kubeflow manifests for platform deployment.

Core Features

Model Training (Kubeflow Trainier): Defines model architecture and outlines conceptual training flow.
Model Inference (Kubeflow KServe): Deploys trained models as API services.
MLOps Platform (Kubeflow Dashboard and Notebooks): Leverages Kubeflow for comprehensive ML lifecycle management.
Reproducible Deployment (InferenceServie with Knative): Provides complete Kubeflow configurations for consistent environment setup across Kubernetes clusters.

Prerequisites

Before you begin, ensure the following environment is set up:

A running Kubernetes cluster (e.g., Minikube, Kind, GKE, EKS, etc.).
kubectl
kustomize

Quick Start

Step 0: Build Kind Cluster

If you don't have a Kubernetes cluster, you can quickly set one up using Kind.

# Create a Kind cluster named 'kubeflow' using the specified configuration file.
# This configuration file (kubeflow-example-config.yaml) should define cluster-specific settings.
kind create cluster --name=kubeflow  --config kubeflow-example-config.yaml

Save Kubeconfig:

kind get kubeconfig --name kubeflow > /tmp/kubeflow-config
export KUBECONFIG=/tmp/kubeflow-config

Create a Secret Based on Existing Credentials to Pull the Images:

docker login 

kubectl create secret generic regcred \
    --from-file=.dockerconfigjson=$HOME/.docker/config.json \
    --type=kubernetes.io/dockerconfigjson

Step 1: Deploy Kubeflow Platform

The manifests directory contains all resources required to deploy Kubeflow. In this project, I disable Kubeflow Pipelines, Katib and Spark Operator due to limited resource.

Note: Deploying a full Kubeflow instance is complex and may require environment-specific adjustments (e.g., storage, networking, authentication). The following command provides a basic deployment example.

# git clone the kubeflow/manifest repo
git clone https://github.com/kubeflow/manifests.git

# Change directory to the manifests folder
cd manifests

# Build Kubeflow configurations using kustomize and apply them to the Kubernetes cluster.
# This process downloads and creates numerous Kubernetes resources and may take some time.
while ! kustomize build example | kubectl apply --server-side --force-conflicts -f -; do echo "Retrying to apply resources"; sleep 20; done

After deployment, refer to the Official Kubeflow Documentation to access your Kubeflow Dashboard.

export ISTIO_NAMESPACE=istio-system
kubectl port-forward svc/istio-ingressgateway -n ${ISTIO_NAMESPACE} 8080:80

Step 2: Model Training

DCNv2 (Deep & Cross Network v2) is a CTR/recommendation model that combines a deep network with explicit cross layers to efficiently capture both low- and high-order feature interactions. TaobaoAd_x1 is a large-scale display advertising dataset from the Taobao platform, with user, ad/item, and context features labeled for click-through prediction. In this project, we use a 1% sample of the training split for faster experimentation.

Model: DCNv2
Dataset: TaobaoAd_x1

The model_weights.pth file contains pre-trained model weights. To retrain:

Use Central Board to initiate a notebook. (It will also create a PVC for you.)
Copy train.ipynb, model.py and feature_encoder.py to the workdir.
Refer to train.ipynb to explore distributed pytorch model trainig.

Pytorch DDP pods:

Step 3: Define Custom Model

We will use serve.py to create and deploy an inference server to Kubernetes. This is a custom predictor implemented using KServe API.

seems that KServe migrates TorchServe to Triton TorchScript backend recently...

Package the Inference Service as a Docker Image

Pack the image with Procfile, .python-version and pyproject.toml.
```
pack build --builder=heroku/builder:24 ${DOCKER_USER}/dcnv2:v1
```
Push the Image to Docker Hub
```
docker push ${DOCKER_USER}/dcnv2:v1
```

Step 4: Deploy the Service in Kubernetes

Create a serve.yaml file to define the InferenceService CR. Modify image and STORAGE_URI if needed.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: dcnv2
  namespace: kubeflow-user-example-com
  annotations:
    sidecar.istio.io/inject: "false"
spec:
  predictor:
    scaleTarget: 1
    scaleMetric: concurrency
    maxReplicas: 10
    containers:
      - name: kserve-container
        image: boboru/dcnv2:v1
        resources:
          requests:
            cpu: "100m"
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
        env:
          - name: PROTOCOL
            value: v2
          - name: MODEL_PATH
            value: /mnt/models/model_weights.pth
          - name: ENCODER_PATH
            value: /mnt/models/preprocess_metadata.pkl
          - name: DENSE_COLS
            value: price
          - name: SPARSE_COLS
            value: userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level,adgroup_id,cate_id,campaign_id,customer,brand,pid,btag
          - name: STORAGE_URI
            value: pvc://torch-workspace

Deploy it:

# Apply the Kubernetes manifest defined in serve.yaml to create the InferenceService
kubectl apply -f serve.yaml

Step 5: Test Inference Service

Out-of-cluster

After deployment, test the service locally via port-forward.

Forward the Service Port to Local

# ignore it if the service has been forwarded
export ISTIO_NAMESPACE=istio-system
kubectl port-forward svc/istio-ingressgateway -n ${ISTIO_NAMESPACE} 8080:80

Send Inference Request

Because the model is deployed on Kubeflow, you need appropriate permissions. Use a ServiceAccount (SA) to obtain a JWT token to access the model. Adjust the --duration value as needed. For details, see the KServe Istio + Dex sample.

INGRESS_HOST=localhost
INGRESS_PORT=8080
MODEL_NAME=dcnv2
INPUT_PATH=./input.json

SERVICE_HOSTNAME=$(kubectl get inferenceservice -n kubeflow-user-example-com $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)

TOKEN=$(kubectl create token default-editor -n kubeflow-user-example-com --audience=istio-ingressgateway.istio-system.svc.cluster.local --duration=24h)

Use curl to send infer request (v1):

curl -v -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -d @$INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

Furthermore, test_infer.py demonstrates the code for sending request directly or adopting InferenceRESTClient from KServe:

uv run test_infer.py --token $TOKEN --host $SERVICE_HOSTNAME

In-cluster

Since we are in the cluster, the JWT token can be ignored. Also, internval service endpoint can be accessed directly.

Visit inference.ipynb and excute it in the cluster for more exmaples.

Step 6 (Optional): Autoscaling

With the Knative Pod Autoscaler configured in serve.yaml, you can load test the service using hey.

  scaleTarget: 1
  scaleMetric: concurrency
  maxReplicas: 10

Load test with hey:

hey -z 30s -c 30 -m POST -host ${SERVICE_HOSTNAME}  -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

The number of InferenceService pods will scale up until it reaches maxReplicas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End Machine Learning System with Kubeflow

Core Features

Prerequisites

Quick Start

Step 0: Build Kind Cluster

Step 1: Deploy Kubeflow Platform

Step 2: Model Training

Step 3: Define Custom Model

Step 4: Deploy the Service in Kubernetes

Step 5: Test Inference Service

Out-of-cluster

In-cluster

Step 6 (Optional): Autoscaling

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
imgs		imgs
.gitignore		.gitignore
.python-version		.python-version
Procfile		Procfile
README.md		README.md
feature_encoder.py		feature_encoder.py
inference.ipynb		inference.ipynb
input.json		input.json
kubeflow-example-config.yaml		kubeflow-example-config.yaml
model.py		model.py
model_weights.pth		model_weights.pth
preprocess_metadata.pkl		preprocess_metadata.pkl
pyproject.toml		pyproject.toml
serve.py		serve.py
serve.yaml		serve.yaml
test_infer.py		test_infer.py
train.ipynb		train.ipynb
uv.lock		uv.lock

boboru/kubeflow-mlops-starter

Folders and files

Latest commit

History

Repository files navigation

End-to-End Machine Learning System with Kubeflow

Core Features

Prerequisites

Quick Start

Step 0: Build Kind Cluster

Step 1: Deploy Kubeflow Platform

Step 2: Model Training

Step 3: Define Custom Model

Step 4: Deploy the Service in Kubernetes

Step 5: Test Inference Service

Out-of-cluster

In-cluster

Step 6 (Optional): Autoscaling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages