From ec37d5fdb09003c5757145df25660a6468b653b4 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Mon, 22 May 2023 09:44:27 +0100 Subject: [PATCH 01/10] Add operator ADR-0003 --- ...styai-service-deployment-using-operator.md | 184 ++++++++++++++++++ adr/README.md | 3 +- 2 files changed, 186 insertions(+), 1 deletion(-) create mode 100644 adr/ADR-0003-trustyai-service-deployment-using-operator.md diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator.md b/adr/ADR-0003-trustyai-service-deployment-using-operator.md new file mode 100644 index 0000000..d825dc0 --- /dev/null +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator.md @@ -0,0 +1,184 @@ +--- +num: 3 # allocate an id when the draft is created +title: TrustyAI Service Deployment using Operator +status: "Draft" # One of Draft, Accepted, Rejected +authors: + - "ruivieira" # One item for each author, as github id or "firstname lastname" +tags: + - "service" # e.g. service, python, java, etc +--- + +## Title + +TrustyAI Service Deployment using Operator + +## Context and Problem Statement + +The [TrustyAI Service](https://github.com/trustyai-explainability/trustyai-explainability/tree/main/explainability-service) is currently deployed manually, leading to potential inconsistencies and errors. +A Kubernetes operator would provide a simple and consistent way to deploy and manage the TrustyAI service. + +## Goals + +1. Automate the deployment, management, and maintenance of the TrustyAI service. +2. Reduce manual errors and increase consistency in deployments. +3. Help updating of the TrustyAI service. + +## Non-Goals + +1. Implementing mechanisms that perform actions other than the deployment, management, and maintenance of the TrustyAI service. + +## Current Situation + +Currently, TrustyAI service deployments are done manually or through scripts that do not fully take advantage of Kubernetes. This can be inneficent and lead to errors introduced by manual steps. +As an example, the TrustyAI service needs to update ModelMesh's configuration to add the TrustyAI service as a new endpoint. This is currently done via a deployment-time script that patches the ModelMesh configuration. This could be done automatically by the Operator. + +Althought TrustyAI's deployment needs to configure a considerable number of resources (_e.g._ `Deployment`, `Service`, `ConfigMap`, `Route`, `ServiceMonitor`), the actual configuration options available for a custom TrustyAI deployment are limited. This means that the a custom TrustyAI Custom Resource Definition (CRD) would be quite simple, and the Operator would be able to handle the creation and management of the required resources. + +## Proposal + +We propose to use a stand-alone TrustyAI Kubernetes Operator which would create and manage the required `Deployment`, `Service`, `ConfigMap`, `Route`, and `ServiceMonitor` resources based on a simple Custom Resource while keeping the state consistent with the desired one. + +### Custom Resource + +An example of a custom resource is: + +```yaml +apiVersion: trustyai.opendatahub.io/v1 +kind: TrustyAIService +metadata: + name: trustyai-service-example + namespace: default +spec: + replicas: 1 + image: quay.io/trustyaiservice/trustyai-service + tag: v1.0 + storage: + format: "PVC" + folder: "/inputs" + data: + filename: "data.csv" + format: "CSV" + metrics: + schedule: "5s" +``` + +In this example: + +1. `replicas` is an optional field that specifies the number of replicas of the TrustyAI service that you want to run. If not provided, the default is one replica. + +2. `image` and `tag` are optional fields that allow you to specify a custom image and tag for the TrustyAI service. If not provided, the default is `quay.io/trustyaiservice/trustyai-service:latest`. + +3. `storage` is a mandatory field that specifies the storage details. It has two nested fields: + - `format` - the storage format, (example: a Persistent Volume Claim (PVC)). + - `folder` - the folder path where data is stored. + +4. `data` is a mandatory field that specifies the data details. It has two nested fields: + - `filename` - the suffix of the file that the service uses for data. + - `format` - the format of the data file (example: a CSV file). + +5. `metrics` is a mandatory field that specifies the metrics details. It has one nested field: + - `schedule` - the schedule for metrics collection, (example: every 5 seconds). + + +The storage, data and metrics keys consist of the only mandatory configuration fields for the TrustyAI service, at the moment. Future configuration keys can be added to the custom resource as needed. + +The proposed `apiVersion` and `kind` are `trustyai.opendatahub.io/v1` and `TrustyAIService`, respectively. + +### ModelMesh Serving Integration + +The operator also ensures the correct configuration of the ModelMesh Serving component. Once the TrustyAI Service is deployed and reachable, the operator will patch the ModelMesh Serving configuration to include a custom payload processor and it will be configured to point to the consumer endpoint of the deployed TrustyAI Service. + +The processor configuration is embedded in a Kubernetes `ConfigMap` and follows the format: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: model-serving-config + namespace: default +data: + config.yaml: | + payloadProcessors: http://trustyai-service.$NAMESPACE/consumer/kserve/v2 +``` + +In this configuration, `$NAMESPACE` is replaced by the Operator with the namespace where the TrustyAI Service and ModelMesh Serving are deployed ensuring that ModelMesh sends payloads correctly to the TrustyAI Service. + +### Monitoring (Prometheus) + +The TrustyAI Operator also creates a `ServiceMonitor` object which defines the services to be monitored by Prometheus. The `ServiceMonitor` will have the following configuration: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: trustyai-metrics + labels: + modelmesh-service: modelmesh-serving +spec: + endpoints: + - interval: 4s + path: /q/metrics + honorLabels: true + honorTimestamps: true + scrapeTimeout: 3s + bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token + targetPort: 8080 + scheme: http + params: + 'match[]': + - '{__name__= "trustyai_spd"}' + - '{__name__= "trustyai_dir"}' + metricRelabelings: + - action: keep + regex: trustyai_.* + sourceLabels: + - __name__ + selector: + matchLabels: + app.kubernetes.io/name: trustyai-service +``` + +The `ServiceMonitor` object targets the TrustyAI Service and specifies how Prometheus should scrape metrics from the service, which includes the path to the metrics endpoint (`/q/metrics`), the interval at which it should scrape the metrics (every 4 seconds), and the type of metrics it should scrape (metrics with names that start with `trustyai_`). +The selector would also be updated to match the labels of the TrustyAI Service from the Custom Resource. +The scrape interval and metrics names could potentially also be configurable via the custom resource (with the current values as defaults). + +### Route + +If deployed on OpenShift, the Operator will also create a `Route` object to expose the TrustyAI Service to external clients. The `Route` object will have the following configuration: + +```yaml +kind: Route +apiVersion: route.openshift.io/v1 +metadata: + name: trustyai + labels: + app: trustyai + app.kubernetes.io/name: trustyai-service + app.kubernetes.io/part-of: trustyai + app.kubernetes.io/version: 0.1.0 +spec: + to: + kind: Service + name: trustyai-service + port: + targetPort: http + tls: null +``` + +Note that TrustyAI isn't currently implementing HTTPS endpoints, so the `tls` field will be set to `null` for now. Once HTTPS is implemented, the `tls` field will be updated to include the TLS configuration. + +### Threat Model + +No other threats additionally to the ones common to any operators themselves, which include misconfiguration of the operator, security vulnerabilities in the operator code or in the created resources. + +## Challenges + +1. Go and Kubernetes/OpenShift knowledge required to develop the Operator. + +## Dependencies + +1. Operator Lifecycle Manager (OLM) for installing and managing the Operator. + +## Consequences if not completed + +If not completed, we will continue with the manual deployment and management of the TrustyAI service which would make it harder to scale and update the service. \ No newline at end of file diff --git a/adr/README.md b/adr/README.md index a52d13b..f67ef5f 100644 --- a/adr/README.md +++ b/adr/README.md @@ -15,4 +15,5 @@ # Approved ADRs * [ADR-0001: TrustyAI external library integration](ADR-0001-trustyai-external-library-integration.md) -* [ADR-0002: Metrics and XAI namespaces](ADR-0002-metrics-and-xai-namespaces.md) \ No newline at end of file +* [ADR-0002: Metrics and XAI namespaces](ADR-0002-metrics-and-xai-namespaces.md) +* [ADR-0003: TrustyAI Service Deployment using Operator](ADR-0003-trustyai-service-deployment-using-operator.md) \ No newline at end of file From ef42a274d03b4040c2a6fd7330fbe18c5f2cab0e Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Tue, 23 May 2023 10:31:35 +0100 Subject: [PATCH 02/10] Update proposal --- ...rvice-deployment-using-operator-pattern.md | 245 ++++++++++++++++++ ...styai-service-deployment-using-operator.md | 184 ------------- adr/README.md | 2 +- 3 files changed, 246 insertions(+), 185 deletions(-) create mode 100644 adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md delete mode 100644 adr/ADR-0003-trustyai-service-deployment-using-operator.md diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md new file mode 100644 index 0000000..d5003eb --- /dev/null +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -0,0 +1,245 @@ +--- +num: 3 # allocate an id when the draft is created +title: TrustyAI Service Deployment using Operator pattern +status: "Draft" # One of Draft, Accepted, Rejected +authors: + - "ruivieira" + - "danielezonca" # One item for each author, as github id or "firstname lastname" +tags: + - "service" # e.g. service, python, java, etc +--- + +## Title + +TrustyAI Service Deployment using Operator pattern + +## Context and Problem Statement + +The [TrustyAI Service](https://github.com/trustyai-explainability/trustyai-explainability/tree/main/explainability-service) can be deployed manually as a standalone container or via [ODH-manifest](https://github.com/opendatahub-io/odh-manifests/blob/master/trustyai-service/) as part of ODH KfDef. Both cases have limitations: a plain Deployment is error prone for the users (some parameters are mandatory) and the ODH-manifest contains [some hacks](https://github.com/opendatahub-io/odh-manifests/blob/master/trustyai-service/default/trustyai-deployment.yaml#L69-L87). +A Kubernetes operator would provide a simple and consistent way to deploy and manage the TrustyAI service. + +In addition to this, the deployment and the storage (PVC for now) must be created into a user owned namespace to give users full control and prevent security issues. An operator can enforce this. + +## Goals + +* Automate the deployment, management, and maintenance of the TrustyAI service. +* Reduce manual errors and increase consistency in deployments. +* Help updating the TrustyAI service. + +## Non-goals + +Implementing mechanisms that perform actions unrelated with the lifecycle of the TrustyAI service (create, upgrade, monitor, etc).. + +## Current situation + +Currently, TrustyAI service deployments are done manually or through scripts that do not fully take advantage of Kubernetes. This can be inefficient and lead to errors introduced by manual steps. + +As an example, the TrustyAI service needs to update ModelMesh's configuration to add the TrustyAI service as a new endpoint. This is currently done via a deployment-time script that patches the ModelMesh configuration. This could be done automatically by the Operator. + +Although TrustyAI's deployment needs to configure a considerable number of resources (_e.g._ `Deployment`, `Service`, `ConfigMap`, `Route`, `ServiceMonitor`), the actual configuration options available for a custom TrustyAI deployment are limited. This means that a custom TrustyAI Custom Resource Definition (CRD) would be quite simple, and the Operator would be able to handle the creation and management of the required resources. + +## Proposal + +We propose to use a stand-alone TrustyAI Kubernetes Operator which would create and manage the required Deployment, Service, ConfigMap, Route, and ServiceMonitor resources based on a simple Custom Resource while keeping the state consistent with the desired one [^1]. + +[^1]: Initial implementation at https://github.com/ruivieira/trustyai-service-operator + +### Custom Resource + +An example of a custom resource is: + +```yaml +apiVersion: trustyai.opendatahub.io/v1alpha1 +kind: TrustyAIService +metadata: + name: trustyai-service-example + +spec: + storage: + format: "PVC" + folder: "/inputs" + data: + filename: "data.csv" + format: "CSV" + trustyaiMetrics: + schedule: "15s" +status: + phase: … + replicas: … + conditions: + - type: Ready + … + - type: ModelMeshReady + status: "True" + lastTransitionTime: … + reason: ModelMeshHealthy + message: ModelMesh is running and healthy. + - type: StorageReady + status: "True" + lastTransitionTime: … + reason: StorageHealthy + message: Storage system is functioning correctly. + lastUpdateTime: … +``` + +In this example: + +* `replicas` is an optional field that specifies the number of replicas of the TrustyAI service that you want to run. If not provided, the default is one replica. +* `storage` is a mandatory field that specifies the storage details. It has two nested fields: + * `format` - the storage format, (example: a Persistent Volume Claim (PVC)). + * `folder` - the folder path where data is stored. +* data is a mandatory field that specifies the data details. It has two nested fields: + * `filename` - the suffix of the file that the service uses for data. + * `format` - the format of the data file (example: a CSV file). +* `trustyaiMetrics` is a mandatory field that specifies the metrics details. It has one nested field: + * `schedule` - the schedule for metrics calculation, (example: every 5 seconds). +* `status` - as part of the reconciliation process, the operator will add additional conditions, apart from the standard ones, to the custom resource to indicate the status of the deployment. These conditions will be: + * `ModelMeshReady`, which indicates that the ModelMesh Serving component is running. + * `StorageReady`, which indicates that the storage component is running. + +The storage, data and metrics keys consist of the only mandatory configuration fields for the TrustyAI service, at the moment. Future configuration keys can be added to the custom resource as needed. + +The proposed `apiVersion` and `kind` are `trustyai.opendatahub.io/v1alpha1` and `TrustyAIService`, respectively. + +### ModelMesh Serving Integration + +The operator also ensures the correct configuration of the ModelMesh Serving component. Once the TrustyAI Service is deployed and reachable, the operator will patch the ModelMesh Serving configuration to include a custom payload processor and it will be configured to point to the consumer endpoint of the deployed TrustyAI Service. + +The processor configuration is embedded in a Kubernetes ConfigMap and follows the format: + + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: model-serving-config + namespace: default +data: + config.yaml: | + payloadProcessors: http://trustyai-service.$NAMESPACE/consumer/kserve/v2 +``` + +In this configuration, `$NAMESPACE` is replaced by the Operator with the namespace where the TrustyAI Service and ModelMesh Serving are deployed ensuring that ModelMesh sends payloads correctly to the TrustyAI Service. + +### Monitoring (Prometheus) + +The TrustyAI Operator also creates a `ServiceMonitor` object which defines the services to be monitored by Prometheus. The `ServiceMonitor` will have the following configuration: + + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: trustyai-metrics + labels: + modelmesh-service: modelmesh-serving +spec: + endpoints: + - interval: 4s + path: /q/metrics + honorLabels: true + honorTimestamps: true + scrapeTimeout: 3s + bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token + targetPort: 8080 + scheme: http + params: + 'match[]': + - '{__name__= "trustyai_spd"}' + - '{__name__= "trustyai_dir"}' + metricRelabelings: + - action: keep + regex: trustyai_.* + sourceLabels: + - __name__ + selector: + matchLabels: + app.kubernetes.io/name: trustyai-service +``` + + +The `ServiceMonitor` object targets the TrustyAI Service and specifies how Prometheus should scrape metrics from the service, which includes the path to the metrics endpoint (`/q/metrics`), the interval at which it should scrape the metrics (every 4 seconds), and the type of metrics it should scrape (metrics with names that start with `trustyai_`). +The selector would also be updated to match the labels of the TrustyAI Service from the Custom Resource. +The scrape interval and metrics names could potentially also be configurable via the custom resource (with the current values as defaults). + + +A possibility for the service monitor customization is the inclusion of custom values using a nested `ref:` field for `serviceMonitoring`. This would allow users to specify a custom configuration for the `ServiceMonitor` object. For example: + + +```yaml +apiVersion: trustyai.opendatahub.io/v1 +kind: TrustyAIService +metadata: + name: trustyai-service-example + namespace: default +spec: + ... + serviceMonitoring: + ref: + apiVersion: monitoring.coreos.com/v1 + kind: ServiceMonitor + ... + spec: + endpoints: + - interval: 15s +``` + +If such configuration is not provided, the operator will use the default configuration. + +### Route + +If deployed on OpenShift, the Operator will also create a `Route` object to expose the TrustyAI Service to external clients. The `Route` object will have the following configuration: + + +```yaml +kind: Route +apiVersion: route.openshift.io/v1 +metadata: + name: trustyai + labels: + app: trustyai + app.kubernetes.io/name: trustyai-service + app.kubernetes.io/part-of: trustyai + app.kubernetes.io/version: 0.1.0 +spec: + to: + kind: Service + name: trustyai-service + port: + targetPort: http + tls: null +``` + + +Note that TrustyAI isn't currently implementing HTTPS endpoints, so the `tls` field will be set to `null` for now. Once HTTPS is implemented, the `tls` field will be updated to include the TLS configuration. + +### Testing + +The testing and CI of the TrustyAI Operator will be performed using the following approaches: + +* Unit tests for the Operator code, to ensure that the Operator's functionality is correct. +* Integration tests using [Kuttl](https://kuttl.dev/) to ensure that the Operator is correctly deployed and configured. The Kuttl tests will, for instance, ensure that: + * The state is correctly updated when the Custom Resource is updated. + * Routes and ServiceMonitors are correctly created. + * ModelMesh Payload Processors are correctly configured. +* End-to-End (E2E) tests, by integrating with the work already being implemented with the [TrustyAI E2E tests](https://github.com/trustyai-explainability/trustyai-explainability/tree/main/e2e_tests) + +Regarding the Operator's distribution, OperatorHub is out of scope for now, but it could be considered in the future. + +## Threat Model + +* No other threats additionally to the ones common to any operators themselves, which include misconfiguration of the operator, security vulnerabilities in the operator code or in the created resources. + +## Challenges + +* Go and Kubernetes/OpenShift knowledge required to develop the Operator. + +## Dependencies + +* Operator Lifecycle Manager (OLM) for installing and managing the Operator. + +## Consequences if not completed + +If not completed, we will continue with the manual deployment and management of the TrustyAI service which would make it harder to scale and update the service. + + diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator.md b/adr/ADR-0003-trustyai-service-deployment-using-operator.md deleted file mode 100644 index d825dc0..0000000 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator.md +++ /dev/null @@ -1,184 +0,0 @@ ---- -num: 3 # allocate an id when the draft is created -title: TrustyAI Service Deployment using Operator -status: "Draft" # One of Draft, Accepted, Rejected -authors: - - "ruivieira" # One item for each author, as github id or "firstname lastname" -tags: - - "service" # e.g. service, python, java, etc ---- - -## Title - -TrustyAI Service Deployment using Operator - -## Context and Problem Statement - -The [TrustyAI Service](https://github.com/trustyai-explainability/trustyai-explainability/tree/main/explainability-service) is currently deployed manually, leading to potential inconsistencies and errors. -A Kubernetes operator would provide a simple and consistent way to deploy and manage the TrustyAI service. - -## Goals - -1. Automate the deployment, management, and maintenance of the TrustyAI service. -2. Reduce manual errors and increase consistency in deployments. -3. Help updating of the TrustyAI service. - -## Non-Goals - -1. Implementing mechanisms that perform actions other than the deployment, management, and maintenance of the TrustyAI service. - -## Current Situation - -Currently, TrustyAI service deployments are done manually or through scripts that do not fully take advantage of Kubernetes. This can be inneficent and lead to errors introduced by manual steps. -As an example, the TrustyAI service needs to update ModelMesh's configuration to add the TrustyAI service as a new endpoint. This is currently done via a deployment-time script that patches the ModelMesh configuration. This could be done automatically by the Operator. - -Althought TrustyAI's deployment needs to configure a considerable number of resources (_e.g._ `Deployment`, `Service`, `ConfigMap`, `Route`, `ServiceMonitor`), the actual configuration options available for a custom TrustyAI deployment are limited. This means that the a custom TrustyAI Custom Resource Definition (CRD) would be quite simple, and the Operator would be able to handle the creation and management of the required resources. - -## Proposal - -We propose to use a stand-alone TrustyAI Kubernetes Operator which would create and manage the required `Deployment`, `Service`, `ConfigMap`, `Route`, and `ServiceMonitor` resources based on a simple Custom Resource while keeping the state consistent with the desired one. - -### Custom Resource - -An example of a custom resource is: - -```yaml -apiVersion: trustyai.opendatahub.io/v1 -kind: TrustyAIService -metadata: - name: trustyai-service-example - namespace: default -spec: - replicas: 1 - image: quay.io/trustyaiservice/trustyai-service - tag: v1.0 - storage: - format: "PVC" - folder: "/inputs" - data: - filename: "data.csv" - format: "CSV" - metrics: - schedule: "5s" -``` - -In this example: - -1. `replicas` is an optional field that specifies the number of replicas of the TrustyAI service that you want to run. If not provided, the default is one replica. - -2. `image` and `tag` are optional fields that allow you to specify a custom image and tag for the TrustyAI service. If not provided, the default is `quay.io/trustyaiservice/trustyai-service:latest`. - -3. `storage` is a mandatory field that specifies the storage details. It has two nested fields: - - `format` - the storage format, (example: a Persistent Volume Claim (PVC)). - - `folder` - the folder path where data is stored. - -4. `data` is a mandatory field that specifies the data details. It has two nested fields: - - `filename` - the suffix of the file that the service uses for data. - - `format` - the format of the data file (example: a CSV file). - -5. `metrics` is a mandatory field that specifies the metrics details. It has one nested field: - - `schedule` - the schedule for metrics collection, (example: every 5 seconds). - - -The storage, data and metrics keys consist of the only mandatory configuration fields for the TrustyAI service, at the moment. Future configuration keys can be added to the custom resource as needed. - -The proposed `apiVersion` and `kind` are `trustyai.opendatahub.io/v1` and `TrustyAIService`, respectively. - -### ModelMesh Serving Integration - -The operator also ensures the correct configuration of the ModelMesh Serving component. Once the TrustyAI Service is deployed and reachable, the operator will patch the ModelMesh Serving configuration to include a custom payload processor and it will be configured to point to the consumer endpoint of the deployed TrustyAI Service. - -The processor configuration is embedded in a Kubernetes `ConfigMap` and follows the format: - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: model-serving-config - namespace: default -data: - config.yaml: | - payloadProcessors: http://trustyai-service.$NAMESPACE/consumer/kserve/v2 -``` - -In this configuration, `$NAMESPACE` is replaced by the Operator with the namespace where the TrustyAI Service and ModelMesh Serving are deployed ensuring that ModelMesh sends payloads correctly to the TrustyAI Service. - -### Monitoring (Prometheus) - -The TrustyAI Operator also creates a `ServiceMonitor` object which defines the services to be monitored by Prometheus. The `ServiceMonitor` will have the following configuration: - -```yaml -apiVersion: monitoring.coreos.com/v1 -kind: ServiceMonitor -metadata: - name: trustyai-metrics - labels: - modelmesh-service: modelmesh-serving -spec: - endpoints: - - interval: 4s - path: /q/metrics - honorLabels: true - honorTimestamps: true - scrapeTimeout: 3s - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token - targetPort: 8080 - scheme: http - params: - 'match[]': - - '{__name__= "trustyai_spd"}' - - '{__name__= "trustyai_dir"}' - metricRelabelings: - - action: keep - regex: trustyai_.* - sourceLabels: - - __name__ - selector: - matchLabels: - app.kubernetes.io/name: trustyai-service -``` - -The `ServiceMonitor` object targets the TrustyAI Service and specifies how Prometheus should scrape metrics from the service, which includes the path to the metrics endpoint (`/q/metrics`), the interval at which it should scrape the metrics (every 4 seconds), and the type of metrics it should scrape (metrics with names that start with `trustyai_`). -The selector would also be updated to match the labels of the TrustyAI Service from the Custom Resource. -The scrape interval and metrics names could potentially also be configurable via the custom resource (with the current values as defaults). - -### Route - -If deployed on OpenShift, the Operator will also create a `Route` object to expose the TrustyAI Service to external clients. The `Route` object will have the following configuration: - -```yaml -kind: Route -apiVersion: route.openshift.io/v1 -metadata: - name: trustyai - labels: - app: trustyai - app.kubernetes.io/name: trustyai-service - app.kubernetes.io/part-of: trustyai - app.kubernetes.io/version: 0.1.0 -spec: - to: - kind: Service - name: trustyai-service - port: - targetPort: http - tls: null -``` - -Note that TrustyAI isn't currently implementing HTTPS endpoints, so the `tls` field will be set to `null` for now. Once HTTPS is implemented, the `tls` field will be updated to include the TLS configuration. - -### Threat Model - -No other threats additionally to the ones common to any operators themselves, which include misconfiguration of the operator, security vulnerabilities in the operator code or in the created resources. - -## Challenges - -1. Go and Kubernetes/OpenShift knowledge required to develop the Operator. - -## Dependencies - -1. Operator Lifecycle Manager (OLM) for installing and managing the Operator. - -## Consequences if not completed - -If not completed, we will continue with the manual deployment and management of the TrustyAI service which would make it harder to scale and update the service. \ No newline at end of file diff --git a/adr/README.md b/adr/README.md index f67ef5f..6669ff1 100644 --- a/adr/README.md +++ b/adr/README.md @@ -16,4 +16,4 @@ * [ADR-0001: TrustyAI external library integration](ADR-0001-trustyai-external-library-integration.md) * [ADR-0002: Metrics and XAI namespaces](ADR-0002-metrics-and-xai-namespaces.md) -* [ADR-0003: TrustyAI Service Deployment using Operator](ADR-0003-trustyai-service-deployment-using-operator.md) \ No newline at end of file +* [ADR-0003: TrustyAI Service Deployment using Operator pattern](ADR-0003-trustyai-service-deployment-using-operator-pattern.md) \ No newline at end of file From 80e86ada991cf547c7b416faab17e881008b5549 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Thu, 25 May 2023 18:21:20 +0100 Subject: [PATCH 03/10] Remove OLM dependency --- ...0003-trustyai-service-deployment-using-operator-pattern.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index d5003eb..22ce49d 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -236,10 +236,8 @@ Regarding the Operator's distribution, OperatorHub is out of scope for now, but ## Dependencies -* Operator Lifecycle Manager (OLM) for installing and managing the Operator. +None ## Consequences if not completed If not completed, we will continue with the manual deployment and management of the TrustyAI service which would make it harder to scale and update the service. - - From e733daf68ded472a955a65cb75f8699aac3cf776 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Thu, 25 May 2023 18:23:04 +0100 Subject: [PATCH 04/10] Add OperatorHub as a non-goal --- ...-0003-trustyai-service-deployment-using-operator-pattern.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index 22ce49d..651c9be 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -28,7 +28,8 @@ In addition to this, the deployment and the storage (PVC for now) must be create ## Non-goals -Implementing mechanisms that perform actions unrelated with the lifecycle of the TrustyAI service (create, upgrade, monitor, etc).. +* Implementing mechanisms that perform actions unrelated with the lifecycle of the TrustyAI service (create, upgrade, monitor, etc).. +* In the initial stage, distribution via OperatorHub is not a goal. This may be considered in the future. ## Current situation From 65ea47368fdd0f46cb595edd05f19525c9b71869 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Thu, 25 May 2023 18:24:07 +0100 Subject: [PATCH 05/10] Remove OperatorHub paragraph --- ...R-0003-trustyai-service-deployment-using-operator-pattern.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index 651c9be..a677c3d 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -225,8 +225,6 @@ The testing and CI of the TrustyAI Operator will be performed using the followin * ModelMesh Payload Processors are correctly configured. * End-to-End (E2E) tests, by integrating with the work already being implemented with the [TrustyAI E2E tests](https://github.com/trustyai-explainability/trustyai-explainability/tree/main/e2e_tests) -Regarding the Operator's distribution, OperatorHub is out of scope for now, but it could be considered in the future. - ## Threat Model * No other threats additionally to the ones common to any operators themselves, which include misconfiguration of the operator, security vulnerabilities in the operator code or in the created resources. From 07f30cd631d8215b0d42cd953047557a3f0d74cb Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Thu, 25 May 2023 18:33:30 +0100 Subject: [PATCH 06/10] Add storage and envtest --- ...rvice-deployment-using-operator-pattern.md | 28 ++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index a677c3d..56fb6f5 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -59,6 +59,8 @@ spec: storage: format: "PVC" folder: "/inputs" + pv: "mypv" + size: "1Gi" data: filename: "data.csv" format: "CSV" @@ -89,6 +91,8 @@ In this example: * `storage` is a mandatory field that specifies the storage details. It has two nested fields: * `format` - the storage format, (example: a Persistent Volume Claim (PVC)). * `folder` - the folder path where data is stored. + * `pv` - the name of the Persistent Volume (PV) to use (already existing). + * `size` - the size of the PV to use (example: 1Gi). * data is a mandatory field that specifies the data details. It has two nested fields: * `filename` - the suffix of the file that the service uses for data. * `format` - the format of the data file (example: a CSV file). @@ -214,12 +218,34 @@ spec: Note that TrustyAI isn't currently implementing HTTPS endpoints, so the `tls` field will be set to `null` for now. Once HTTPS is implemented, the `tls` field will be updated to include the TLS configuration. +### Storage + +The TrustyAI service requires storage to store inference data. Upon CR deployment, the operator will create a `PersistentVolumeClaim` object to request storage for the TrustyAI Service. The `PersistentVolumeClaim` object will have the following configuration: + +```yaml +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: trustyai-service-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 1Gi + volumeMode: Filesystem + storageClassName: "" +``` + +and bind it to the TrustyAI Service deployment and supplied PV. +The PVC will be created in the same namespace as the TrustyAI Service is being deployed. + ### Testing The testing and CI of the TrustyAI Operator will be performed using the following approaches: * Unit tests for the Operator code, to ensure that the Operator's functionality is correct. -* Integration tests using [Kuttl](https://kuttl.dev/) to ensure that the Operator is correctly deployed and configured. The Kuttl tests will, for instance, ensure that: +* Integration tests using [envtest](https://book.kubebuilder.io/reference/envtest.html) to ensure that the Operator is correctly deployed and configured. The Kuttl tests will, for instance, ensure that: * The state is correctly updated when the Custom Resource is updated. * Routes and ServiceMonitors are correctly created. * ModelMesh Payload Processors are correctly configured. From a75bb5752d44355b40b7750d08a937857a152db8 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Thu, 25 May 2023 18:35:01 +0100 Subject: [PATCH 07/10] Remove hardcoded namespaces --- ...R-0003-trustyai-service-deployment-using-operator-pattern.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index 56fb6f5..6fe704e 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -118,7 +118,6 @@ apiVersion: v1 kind: ConfigMap metadata: name: model-serving-config - namespace: default data: config.yaml: | payloadProcessors: http://trustyai-service.$NAMESPACE/consumer/kserve/v2 @@ -176,7 +175,6 @@ apiVersion: trustyai.opendatahub.io/v1 kind: TrustyAIService metadata: name: trustyai-service-example - namespace: default spec: ... serviceMonitoring: From 557376e12facebebfba86cb00a469a3173f0b852 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Thu, 25 May 2023 18:36:21 +0100 Subject: [PATCH 08/10] Add full service hostname --- ...R-0003-trustyai-service-deployment-using-operator-pattern.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index 6fe704e..e8dd535 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -120,7 +120,7 @@ metadata: name: model-serving-config data: config.yaml: | - payloadProcessors: http://trustyai-service.$NAMESPACE/consumer/kserve/v2 + payloadProcessors: http://trustyai-service.$NAMESPACE.svc.cluster.local/consumer/kserve/v2 ``` In this configuration, `$NAMESPACE` is replaced by the Operator with the namespace where the TrustyAI Service and ModelMesh Serving are deployed ensuring that ModelMesh sends payloads correctly to the TrustyAI Service. From 7e9f4c50d5a7e8c50d1b4e72c6a58a1f6a967e53 Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Tue, 6 Jun 2023 13:55:02 +0100 Subject: [PATCH 09/10] Add custom image via ConfigMap --- ...rvice-deployment-using-operator-pattern.md | 21 +++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index e8dd535..7d8ac51 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -238,6 +238,27 @@ spec: and bind it to the TrustyAI Service deployment and supplied PV. The PVC will be created in the same namespace as the TrustyAI Service is being deployed. +### Custom Image Configuration using ConfigMap + +If a custom image is required for the TrustyAI service (_e.g._ for development or testing), you can configure the operator to use custom images by creating a `ConfigMap` in the operator's namespace. +The operator only checks the `ConfigMap` at deployment, so changes made afterward won't trigger a redeployment of services. + +An example of a ConfigMap that specifies a custom image: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: trustyai-service-operator-config +data: + trustyaiServiceImageName: 'quay.io/mycustomrepo/mycustomimage' + trustyaiServiceImageTag: 'v1.0.0' +``` + +After the ConfigMap is applied, the operator will use the image name and tag specified in the `ConfigMap` for the CR deployment. + +Since this functionality is mainly for development and testing, if you want to use a different image or tag after deployment, you'll need to update the `ConfigMap` and redeploy the operator to have the changes take effect. The running TrustyAI services won't be redeployed automatically. To use the new image or tag, you'll need to delete and recreate the TrustyAIService resources. + ### Testing The testing and CI of the TrustyAI Operator will be performed using the following approaches: From 6768c6a4263ee50fed0296e8cca07989552892da Mon Sep 17 00:00:00 2001 From: Rui Vieira Date: Fri, 7 Jul 2023 09:40:37 +0100 Subject: [PATCH 10/10] Add storage scenario for multiple CRs in the same namespace --- ...-0003-trustyai-service-deployment-using-operator-pattern.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md index 7d8ac51..270170b 100644 --- a/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md +++ b/adr/ADR-0003-trustyai-service-deployment-using-operator-pattern.md @@ -238,6 +238,9 @@ spec: and bind it to the TrustyAI Service deployment and supplied PV. The PVC will be created in the same namespace as the TrustyAI Service is being deployed. +If multiple CRs are deployed in the same namespace, each will have its own PVC. +The PVC and PV naming rule is respectively `${CR_NAME}-pvc` and `${CR_NAME}-pv`, where `$CR_NAME` is the name of the CR. + ### Custom Image Configuration using ConfigMap If a custom image is required for the TrustyAI service (_e.g._ for development or testing), you can configure the operator to use custom images by creating a `ConfigMap` in the operator's namespace.