Setup mlflow server on Kubernetes with PostgreSQL DB (Google cloud) as backend and Google cloud storage as artifact storage.
We will create the following resources in Google cloud:
- Bucket in cloud storage that will be used as artifact storage
- PostreSQL DB in cloud SQL that will be used as mlflow backend db
- Container registry (GCR) that will host the mlflow image defined in
mlflow_server/Dockerfile - Service account (and json-key) with access to GCS and cloud SQL
- Service account (and json-key) with access to GCR (used by the Google node pool to pull images from GCR)
The kubernetes cluster contains:
- Kubernetes Secret that contains credentials of the service account with GCS and SQL access, as well as access to the backend db
- Kuberentes Configmap
- Kubernetes Deployment where each pod holds two containers:
- cloud sql auth proxy container that creates a secure connection to the PostgreSQL DB
- mlflow server that connects to the PostgreSQL DB via the cloud sql auth proxy. We use a custom build image that is defined in
./mlflow_server
- Kubernetes Service
-
Install the
terraformversion manager. We will work with version1.2.7:tfenv install 1.2.7 tfenv use 1.2.7
-
Set the
project,regionandzonein./terrafrom/variables.tfand authenticate:gcloud init gcloud auth application-default login
The following command will create the required infrastructure (backend db, cloud storage, kubernetes cluster and service accounts). It will also create the namespace
mlflowand add to it a config-map and a secret with all relevant credentials for the service.# in the ./terraform directory terraform init terraform apply -
To deploy the service, we first have to build a mlflow-server image (content in
./mlflow_serverdirectory) and push it to the container registry in our project. We will use Google cloud build:# Docker image tag export TAG_NAME=1.0.0 export PROJECT_ID=$(gcloud config list --format='value(core.project)') gcloud builds submit mlflow_server \ --config mlflow_server/cloudbuild.yaml \ --substitutions=TAG_NAME=$TAG_NAME
As a result the image
gcr.io/${PROJECT_ID}/mlflow:${TAG_NAME}should be created. -
The remaining components that have to be created are described in
kubernetes/mlflow.yaml. We have to change the image of themlflow-server-container(line 21) to point to the image that we have created in the previous step. We can usekubectlto crete the missing components (alternatively, you can also usehelm, as described in helm/README.md:# change the kubectl context (gcc-mlflow: name of the container cluster) # get zone and project from ./terrafrom/variables.tf gcloud container clusters get-credentials gcc-mlflow \ --zone < zone > \ --project < project_id > kubectl -n mlflow create -f kubernetes/mlflow.yaml
We will rely on the Google cloud SDK to create the resources of interest. To run the commands below you need to have gsutil, gcloud and OpenSSL CLIs installed.
-
Setup environment variables:
# Setup environment variables echo """ export REGION=europe-west3 export TAG_NAME=1.0.0 export BUCKET_NAME="artifacts-$(openssl rand -hex 12)" export SQL_INSTANCE_NAME=mlflow-backend export SQL_PWD=$(openssl rand -hex 12) # locations where some credentials will be stored export MLFLOW_CREDENTIALS=".mlflow_credentials/gcs-csql-access.json" export GCR_CREDENTIALS=".mlflow_credentials/gcr-access.json" """ > .env source .env chmod +x kubernetes.sh chmod +x create_infra.sh
-
Create the required resources in Google cloud (except the kubernetes cluster):
# gcloud configuration gcloud init # Create infrastructure ./create_infra.sh
Unfortunately, I am not able to create a kubernetes cluster with the gcloud sdk so you have to use the UI to create it.
-
Creation of the Kubernetes cluster components. You have to change the kubectl context:
gcloud container clusters get-credentials < container cluster name > \ --zone < zone > \ --project < project_id > # gcloud deployment ./kubernetes.sh gcloud
If you have looked at the last shell script you will see that there is an option for local deployment with
docker-descktop:# switch kubectl context kubectl config use-context docker-desktop # local deployment ./kubernetes.sh local
-
Test if everything works:
To test if the Mflow server is running you can run the experimentpython test/train.pyand verify through the mlflow UI that the results are logged. In the experiment definition you will see that we are using theGCS_CREDENTIALSto store the artifacts in GCS.
