diff --git a/docs/en/infrastructure_management/hardware_profile/functions/hardware_profile.mdx b/docs/en/infrastructure_management/hardware_profile/functions/hardware_profile.mdx new file mode 100644 index 0000000..ce7cd5f --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/functions/hardware_profile.mdx @@ -0,0 +1,128 @@ +--- +weight: 10 +--- + +# Hardware Profile Management + +To configure specific hardware configurations and constraints for your data scientists and engineers to use when deploying model inference services on the platform, you must create and manage associated hardware profiles. The hardware profile encapsulates node affinities, tolerations, and resource constraints into a single, reusable entity. + +## Create a hardware profile + +**Prerequisites** + +* You have logged in to the platform as a user with administrator privileges. +* You have verified your desired computing resources, including CPU, memory, and any specialized accelerators (e.g., GPU models) available in the underlying Kubernetes cluster. +* You are familiar with Kubernetes scheduling concepts such as Node Selectors, Taints, and Tolerations. + +**Procedure** + + + +### Step 1: Navigate to Hardware Profile +From the main navigation menu, go to **Hardware Profile**. The Hardware Profiles page opens, displaying existing hardware profiles in the system. + +### Step 2: Initiate hardware profile creation +Click **Create hardware profile** in the top right corner. The Create hardware profile configuration page opens. + +### Step 3: Configure basic details +In the Basic Details section, provide identifying information for the profile: +* **Name**: Enter a unique and descriptive name for the hardware profile (e.g., `gpu-high-performance-profile`). +* **Description**: (Optional) Enter a clear description of the hardware profile to help other users understand its intended use case. + +### Step 4: Configure resource identifiers (requests and limits) +You can define constraints for compute resources, such as CPU, memory, or specific accelerators (e.g., `nvidia.com/gpu`). Click **Add Identifier** or modify the pre-existing resource fields. You can add two types of identifiers: + +- **Built-in Identifiers**: Select from a dropdown list of standard resource types configured by the platform (e.g., `cpu`, `memory`, `nvidia.com/gpu`). For these built-in types, the **Identifier**, **Display Name**, and **Resource Type** are strictly predefined by the platform and cannot be altered. +- **Custom Identifiers**: Enter your own unique resource parameters. You must manually define: + * **Identifier**: The exact Kubernetes resource key (e.g., `nvidia.com/a100` or a custom vendor ASIC). + * **Display Name**: A human-readable name for the resource that will appear on the UI (e.g., `NVIDIA A100 GPU`). + * **Resource Type**: Categorize the resource accurately for the cluster: + * **`CPU` / `Memory`**: Select to define standard compute boundaries. + * **`Accelerator`**: Select this primarily for any specialized AI chips (like NVIDIA GPUs, AMD GPUs, or Intel Gaudi accelerators) used for model training or heavy inference tasks. By setting the type to Accelerator, the platform explicitly recognizes the dependency as a core AI computing engine. + * **`Other`**: Select this for non-AI auxiliary devices attached to nodes (such as high-speed network interfaces for RDMA, infiniband, or unique storage parameters). + +For both built-in and custom identifiers, you must configure the exact allocation boundaries: +* **Default**: Set the baseline amount of this resource to allocate. This is initially injected into the user's workload when they select the profile. +* **Minimum allowed**: Define the minimum acceptable request amount. This acts as a hard lower bound to prevent users from requesting insufficient resources for critical models. +* **Maximum allowed**: (Optional) Specify an absolute maximum limit. This firmly prevents users from reserving excessive cluster resources beyond the defined capacity threshold. + +### Step 5: Configure node scheduling rules +To rigidly control which nodes the inference workload schedule applies to, set Node Selectors and Tolerations. This ensures high-performance workloads land on the physically correct node pools. +* **Node Selectors**: Under the Node Selectors section, click **Add Node Selector**. Enter the **Key** and **Value** constraints. The platform will automatically inject these key-value pairs to restrict workloads solely to nodes with matching labels. +* **Tolerations**: Under the Tolerations section, click **Add Toleration** to explicitly allow scheduling workloads onto nodes with matching taints. Define the **Key**, **Operator** (e.g., `Equal`, `Exists`), **Value**, **Effect** (e.g., `NoSchedule`, `NoExecute`), and optional **Toleration Seconds**. Like native Kubernetes tolerations, you can add multiple tolerations to a single hardware profile. + +### Step 6: Finalize creation +Review the configurations you have entered to ensure accuracy. Click **Create** to finalize the hardware profile creation. + + + +## Updating a hardware profile + +You can update the existing hardware profiles in your deployment to adapt to new infrastructure changes, hardware upgrades, or iteratively revised resource policies. You can reliably change important identifying information, minimum and maximum resource constraints, or adjust cluster node placements via node selectors and tolerations. + + + +### Step 1: Locate the hardware profile +From the navigation menu, click **Hardware Profile**. Locate the hardware profile you want to update from the list. + +### Step 2: Edit the hardware profile +On the right side of the row containing the relevant hardware profile, click the Action menu (⋮) and select **Update**. + +### Step 3: Modify the configurations +Make the necessary modifications to your hardware profile configurations: +* Safely adjust the **Description**. +* Update the **Default**, **Minimum**, or **Maximum allowed** thresholds for specific resource identifiers to strictly match your modern cluster capacity. +* Modify the **Node Selectors** to target different node labels, or update **Tolerations** to align with newly tainted worker nodes. + +### Step 4: Apply changes +Click **Update** to permanently apply your changes. + + + +*Note: Updating a hardware profile typically affects solely newly configured workloads going forward. Active deployments previously instantiated using this hardware profile will firmly preserve their originally injected constraints. To enforce the new hardware profile settings on an already-running workload, you must explicitly edit or redeploy the corresponding inference service.* + +## Deleting a hardware profile + +When a specific hardware configuration becomes outdated or spans obsolete Kubernetes nodes, you can safely delete its hardware profile. This ensures no future data scientists can incorrectly select obsolete node configurations or unmanageable limits. + + + +### Step 1: Locate the hardware profile +From the main navigation menu, click **Hardware Profile**. Locate the hardware profile you want to delete. + +### Step 2: Delete +Click the Action menu (⋮) on the far right side of the relevant hardware profile row, and securely select **Delete**. + +### Step 3: Confirm deletion +A warning dialog will appear asking you to confirm the deletion context. Click **Delete**. + + + +*Note: Deleting a hardware profile does not delete or actively disrupt running inference services that previously deployed with this profile. They will continue to operate flawlessly with the resource limitations and topology constraints initially injected by the platform's webhook. However, the deleted hardware profile will immediately disappear from the profile selection dropdown for all newly created deployments.* + +## Using a hardware profile for inference services + +When users (such as data scientists, AI engineers, and developers) dynamically create or configure model inference services (both `InferenceService` and `LLMInferenceService`), they can leverage predefined hardware profiles efficiently. + +A hardware profile seamlessly streamlines the tedious task of manually configuring intricate node scheduling rules and setting explicit resource limitations. Depending on your workload specifics, you have the flexibility to accept the strict default configurations or finely customize your limits within the officially boundaries authorized by the selected profile. + + + +### Step 1: Launch deployment form +From the navigation menu, go to **Service Manage**. Click **Create** to launch the form for deploying a brand-new model inference service. + +### Step 2: Select a Hardware Profile +In the deployment form, scroll down and navigate to the **Deployment Resources** section. Here, you can define your resource limits by first choosing a **Config Type**: +* By default, it is set to **Hardware Profile**. You can then click the **Profile** drop-down menu to select a specific hardware profile that is currently enabled by the platform administrator for your desired compute environment. +* Alternatively, you can choose **Custom** if you prefer to bypass predefined profiles and manually supply raw Kubernetes resource limits. + +### Step 3: Review and customize resource allocations +Once you've selected a hardware profile, the form safely locks in corresponding baseline definitions curated by the administrator. However, you are empowered to refine your exact resource limits: +* To view the administrator's designated boundaries, click the **View Detail** button adjacent to the profile dropdown. This opens an informative drawer or modal explicitly highlighting the hardware profile specifics, including the configured node rules and the absolute limits for CPU, Memory, and GPUs. +* Depending on your precise workload needs, click the **Custom Configuration** button displayed dynamically below the hardware profile section. Custom requests and limits strictly must conceptually remain *within the range* defined by the hardware profile's minimum and maximum constraints. +* By triggering this customization, you unlock the ability to directly modify the final **Requests** and **Limits** configuration for the inference service. If you submit an invalid request parameter, the validation engine will elegantly catch the divergence and present you with a validation error. + +### Step 4: Deploy +Populate the remaining parameters for your service and click **Deploy**. + + \ No newline at end of file diff --git a/docs/en/infrastructure_management/hardware_profile/functions/index.mdx b/docs/en/infrastructure_management/hardware_profile/functions/index.mdx new file mode 100644 index 0000000..7176c7c --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/functions/index.mdx @@ -0,0 +1,10 @@ +--- +weight: 50 +i18n: + title: + en: Guides +--- + +# Guides + + diff --git a/docs/en/infrastructure_management/hardware_profile/how_to/cpu_and_gpu_profiles.mdx b/docs/en/infrastructure_management/hardware_profile/how_to/cpu_and_gpu_profiles.mdx new file mode 100644 index 0000000..0e46ebc --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/how_to/cpu_and_gpu_profiles.mdx @@ -0,0 +1,104 @@ +--- +weight: 30 +i18n: + title: + en: Creating CPU-Only and GPU-Accelerated Profiles + zh: 创建纯 CPU 与 GPU 加速的 Hardware Profile +--- + +# Creating CPU-Only and GPU-Accelerated Profiles + +In a production AI platform, you often need to serve different types of machine learning workloads. For example, traditional machine learning models (like scikit-learn or XGBoost) or simple data processing tasks only require CPU resources, while Large Language Models (LLMs) or complex deep learning models require GPU acceleration. + +By creating distinct Hardware Profiles for CPU-only and GPU-accelerated workloads, you can effectively isolate these two types of services and prevent lightweight CPU models from unintentionally consuming expensive GPU resources. + +## Example 1: CPU-Only Hardware Profile + +A CPU-only profile omits any accelerator identifiers (such as `nvidia.com/gpu`) and strictly relies on `cpu` and `memory` identifiers. + +When creating a CPU-only profile, ensure that: +1. The **Accelerator** resource type is entirely excluded. +2. The Node Selector does not target any GPU-specific nodes. +3. The name and description clearly indicate that this profile is meant for standard ML inference or lightweight models. + +Here is an example of a CPU-only hardware profile: + +```yaml +apiVersion: infrastructure.opendatahub.io/v1alpha1 +kind: HardwareProfile +metadata: + name: standard-cpu-profile + namespace: kube-public +spec: + # Do not include nvidia.com/gpu + identifiers: + - identifier: "cpu" + displayName: "CPU" + minCount: "1" + maxCount: "8" + defaultCount: "2" + resourceType: CPU + - identifier: "memory" + displayName: "Memory" + minCount: "2Gi" + maxCount: "16Gi" + defaultCount: "4Gi" + resourceType: Memory + # Standard CPU nodes + scheduling: + type: Node + node: + nodeSelector: + node-role.kubernetes.io/worker: "true" +``` + +## Example 2: GPU-Accelerated Hardware Profile + +A GPU-accelerated profile explicitly requires the `nvidia.com/gpu` identifier, ensuring that any workload selecting this profile will be allocated physical GPU resources. + +When creating a GPU-accelerated profile: +1. Include an identifier for the specific accelerator (e.g., `nvidia.com/gpu`). +2. Add the corresponding Tolerations if your GPU nodes are tainted (e.g., `nvidia.com/gpu:NoSchedule`). +3. Optionally add a Node Selector to target specific GPU architectures (e.g., `accelerator: nvidia-t4`). + +Here is an example of a GPU-accelerated hardware profile: + +```yaml +apiVersion: infrastructure.opendatahub.io/v1alpha1 +kind: HardwareProfile +metadata: + name: gpu-t4-profile + namespace: kube-public +spec: + identifiers: + # Crucially include the GPU resource + - identifier: "nvidia.com/gpu" + displayName: "GPU" + minCount: "1" + maxCount: "4" + defaultCount: "1" + resourceType: Accelerator + - identifier: "cpu" + displayName: "CPU" + minCount: "4" + maxCount: "16" + defaultCount: "8" + resourceType: CPU + - identifier: "memory" + displayName: "Memory" + minCount: "16Gi" + maxCount: "64Gi" + defaultCount: "32Gi" + resourceType: Memory + scheduling: + type: Node + node: + nodeSelector: + accelerator: nvidia-t4 + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" +``` + +By providing these two distinctly different profiles, platform administrators can ensure Data Scientists have the exact environment they need, without wasting high-value compute resources on simple tasks. diff --git a/docs/en/infrastructure_management/hardware_profile/how_to/create_hardware_profile_cli.mdx b/docs/en/infrastructure_management/hardware_profile/how_to/create_hardware_profile_cli.mdx new file mode 100644 index 0000000..fa08e77 --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/how_to/create_hardware_profile_cli.mdx @@ -0,0 +1,84 @@ +--- +weight: 15 +i18n: + title: + en: Create Hardware Profile using CLI + zh: 使用 CLI 创建 Hardware Profile +--- + +# Create Hardware Profile using CLI + +This document describes how to create `HardwareProfile` resources using the command line and provides a sample YAML. + +## Prerequisites + +- You have access to a Kubernetes cluster with the platform installed. +- You have configured `kubectl` to communicate with your cluster. +- You have permissions in the namespace where `HardwareProfile` resources are managed (for example, an admin namespace such as `kube-public`). + +## Create a HardwareProfile + +Create a YAML file named `gpu-high-performance-profile.yaml` with the following content: + +```yaml +apiVersion: infrastructure.opendatahub.io/v1alpha1 +kind: HardwareProfile +metadata: + name: gpu-high-performance-profile + namespace: kube-public +spec: + # Define resource limitations and defaults + identifiers: + - identifier: "nvidia.com/gpu" + displayName: "GPU" + minCount: "1" + maxCount: "8" + defaultCount: "1" + resourceType: Accelerator + - identifier: "cpu" + displayName: "CPU" + minCount: "4" + maxCount: "32" + defaultCount: "8" + resourceType: CPU + - identifier: "memory" + displayName: "Memory" + minCount: "16Gi" + maxCount: "128Gi" + defaultCount: "32Gi" + resourceType: Memory + # Configure Node Selectors and Tolerations for scheduling + scheduling: + type: Node + node: + nodeSelector: + accelerator: nvidia-a100 + node-role.kubernetes.io/worker: "true" + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" +``` + +Then apply the YAML file to your cluster using `kubectl`: + +```bash +kubectl apply -f gpu-high-performance-profile.yaml -n kube-public +``` + +## Check HardwareProfile Status + +You can check whether the `HardwareProfile` has been successfully created using the following command: + +```bash +kubectl get hardwareprofile gpu-high-performance-profile -n kube-public +``` + +The output should look similar to this: + +```bash +NAME AGE +gpu-high-performance-profile 2m +``` + +Once correctly applied, your Data Scientists will be able to select **GPU High Performance** when deploying their Inference Services using the UI, and the constraints specified in the profile will automatically validate and inject into the deployed workloads. diff --git a/docs/en/infrastructure_management/hardware_profile/how_to/index.mdx b/docs/en/infrastructure_management/hardware_profile/how_to/index.mdx new file mode 100644 index 0000000..aca892f --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/how_to/index.mdx @@ -0,0 +1,11 @@ +--- +weight: 60 +i18n: + title: + en: How To +title: How To +--- + +# How To + + diff --git a/docs/en/infrastructure_management/hardware_profile/how_to/schedule_to_specific_gpu_nodes.mdx b/docs/en/infrastructure_management/hardware_profile/how_to/schedule_to_specific_gpu_nodes.mdx new file mode 100644 index 0000000..cdaacec --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/how_to/schedule_to_specific_gpu_nodes.mdx @@ -0,0 +1,45 @@ +--- +weight: 20 +i18n: + title: + en: Schedule Workloads to Specific GPU Nodes + zh: 将工作负载调度到特定的 GPU 节点 +--- + +# Schedule Workloads to Specific GPU Nodes + +When defining a Hardware Profile, you often need to ensure that the AI inference workload is strictly scheduled onto nodes with a specific type of GPU (such as an NVIDIA A100 or H100) and that the workload tolerates the taints on those dedicated nodes to avoid regular CPU workloads taking over the GPU nodes. + +This guide demonstrates how to configure these constraints in a Hardware Profile so that your Data Scientists don't need to manually configure them. + +## Use Node Selectors + +Node selectors allow you to guide pods to specific nodes based on node labels. + +1. Find the exact Kubernetes label of the GPU nodes in your cluster. For example: + * `accelerator: nvidia-a100` + * `nvidia.com/gpu.present: "true"` +2. Edit or create your Hardware Profile. +3. In the **Node Selectors** section, add the Key-Value pair corresponding to the label: + * **Key**: `accelerator` + * **Value**: `nvidia-a100` + +Once saved, any Inference Service attempting to use this Hardware Profile will inherently receive this node selector, ensuring it only lands on a node with an A100 GPU. + +## Use Taints and Tolerations + +GPU nodes are frequently "tainted" by cluster administrators so that standard pods (like web servers or generic databases) are not scheduled on them, thereby reserving the GPU processing power for AI workloads. + +If your GPU nodes have a taint like `nvidia.com/gpu:NoSchedule`, your Hardware Profile must include a corresponding toleration. + +1. Under the **Tolerations** section of your Hardware Profile, add a new toleration. +2. Configure it to match the taint on the GPU node: + * **Key**: `nvidia.com/gpu` + * **Operator**: `Exists` (This tolerates any value for the key `nvidia.com/gpu`. Alternatively, use `Equal` and explicitly set the **Value**). + * **Effect**: `NoSchedule` (Matches the restrictive effect of the taint). + +By adding this toleration to the Hardware Profile, the deployed Inference Service is explicitly granted "permission" to be scheduled on the dedicated GPU nodes. + +## Combined Configuration + +By combining both a **Node Selector** (to instruct the scheduler *where* to go) and a **Toleration** (to allow the scheduler to *place* it there), your Hardware Profile effectively acts as a reliable blueprint for heterogeneous node architectures. diff --git a/docs/en/infrastructure_management/hardware_profile/index.mdx b/docs/en/infrastructure_management/hardware_profile/index.mdx new file mode 100644 index 0000000..9f31ec2 --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/index.mdx @@ -0,0 +1,7 @@ +--- +weight: 60 +--- + +# Hardware Profile + + diff --git a/docs/en/infrastructure_management/hardware_profile/intro.mdx b/docs/en/infrastructure_management/hardware_profile/intro.mdx new file mode 100644 index 0000000..ebd18aa --- /dev/null +++ b/docs/en/infrastructure_management/hardware_profile/intro.mdx @@ -0,0 +1,35 @@ +--- +weight: 5 +--- + +# Introduction + +Hardware profiles centrally allow platform administrators to provision specific and standardized hardware configurations. These configurations tightly encapsulate computing resource limits, node selectors, and node tolerations directly into a cohesive unit that platform users can effortlessly select when deploying varying model inference services. + +Utilizing hardware profiles significantly reduces manual errors resulting from raw configuration via YAML, prevents unintentional scheduling on wrong topology groups, and comprehensively ensures robust resource management on cluster workloads. + +Hardware profiles natively support and interact prominently with the platform's `InferenceService` and `LLMInferenceService` resources. + +## Why do we need a Hardware Profile? + +While standard Kubernetes offers resource requests and limits through Pod specifications, constructing and deploying AI inference workloads (such as Large Language Models or specialized KServe predictors) introduces unique operational challenges. Our implementation of Hardware Profiles is tailored specifically to solve these challenges with the following platform-specific characteristics: + +1. **Topology & Specialized Accelerator Abstraction** + Data scientists prioritize model performance and logic rather than the underlying cluster topology. They may not know the exact node labels or taints required to schedule workloads onto specific GPU nodes, vGPU resources, or interconnect networks. A Hardware Profile abstracts away these technical complexities. Administrators can embed precise `Node Selectors` and `Tolerations` directly into the profile, ensuring that when a user selects a "High-End NVIDIA A100" profile from the UI, the workload automatically targets the correct physical machine pools. + +2. **Dynamic Bounded Customization (Not Just Rigid Quotas)** + Unlike platforms that strictly enforce a single, immutable resource size (t-shirt sizing), our system defines a dynamic, scalable boundary for each resource type. Administrators configure the **Minimum allowed**, **Default**, and **Maximum allowed** limits. When a user selects a profile, they inherit the *Default* settings immediately. However, through the **Customize Data** option, they retain the profound flexibility to manually fine-tune their specific Requests and Limits. As long as those values fall within the authorized profile boundaries, they succeed—allowing elasticity for distinct models without risking excessive cluster monopolization. + +3. **Smart Webhook Validation & Asymmetric Auto-Correction** + Our platform employs a dedicated Mutating Webhook that deeply integrates with the model serving pipelines. Instead of relying on users to perfectly craft YAML manifests, the webhook gracefully intercepts the request and safely injects the profile's constraints into the workload runtime. Furthermore, it intelligently safeguards the cluster—for instance, if a user specifies limits but omits requests (or vice versa), the webhook natively performs smart semantic adjustments (capping requests to limits, or elevating defaults) and comprehensively blocks configurations that violate the profile's defined minimum or maximum limits before any Pods are spawned. + +4. **Native Interoperability with Custom Serving Engines** + Whether deploying a standard `InferenceService` or a heavily customized `LLMInferenceService`, the hardware profile engine natively tracks the complex Pod/Container structures behind the scenes and injects exactly into the active predictive container's resources. + +### Key Aspects of a Hardware Profile + +* **Resource Identifiers (Limits & Requests):** Profiles securely govern native Kubernetes limitations (such as minimal CPU thresholds, default available Memory allocations, and strict maximum GPU acceleration limits) to prevent system overload while maintaining operational stability. +* **Taints & Tolerations:** Hardware profiles inherently instruct workload pods precisely which nodes they are resilient enough to handle (e.g., tolerating dedicated heterogeneous hardware taints). +* **Node Selectors:** They strictly constrain workloads to distinct node label selectors to match the correct machine architectures without implicit guessing. +* **Backend Webhook Injection:** Through automated interception mechanisms installed in the cluster, hardware constraints transparently merge and attach to submitted workloads directly from the management namespace. + diff --git a/docs/en/model_inference/inference_service/functions/inference_service.mdx b/docs/en/model_inference/inference_service/functions/inference_service.mdx index 5697d8a..e4915c5 100644 --- a/docs/en/model_inference/inference_service/functions/inference_service.mdx +++ b/docs/en/model_inference/inference_service/functions/inference_service.mdx @@ -8,39 +8,40 @@ The core definition of the inference service feature is to deploy trained machin ## Advantages -* Simplifies the model deployment process, reducing deployment complexity. -* Provides high-availability, high-performance online and batch inference services. -* Supports dynamic model updates and version management. -* Realizes automated operation, maintenance, and monitoring of model inference services. +- Simplifies the model deployment process, reducing deployment complexity. +- Provides high-availability, high-performance online and batch inference services. +- Supports dynamic model updates and version management. +- Realizes automated operation, maintenance, and monitoring of model inference services. ## Core Features **Direct Model Deployment for Inference Services** -* Allows users to directly select specific versions of model files from the model repository and specify the inference runtime image to quickly deploy online inference services. The system automatically downloads, caches, and loads the model, starting the inference service. This simplifies the model deployment process and lowers the deployment threshold. +- Allows users to select specific versions of model files from a **Model repository** or directly mount models from a **PVC**, and specify the inference runtime image to quickly deploy online inference services. The system automatically downloads, caches, and loads the model, starting the inference service. This simplifies the model deployment process and lowers the deployment threshold. **Application for Inference Services** -* Use Kubernetes applications as inference services. This approach provides greater flexibility, allowing users to customize the inference environment according to their needs. +- Use Kubernetes applications as inference services. This approach provides greater flexibility, allowing users to customize the inference environment according to their needs. **Inference Service Template Management** -* Supports the creation, management, and deletion of inference service templates, allowing users to quickly deploy inference services based on predefined templates. + +- Supports the creation, management, and deletion of inference service templates, allowing users to quickly deploy inference services based on predefined templates. **Batch Operation of Inference Services** -* Supports batch operations on multiple inference services, such as batch starting, stopping, updating, and deleting. -* Able to support the creation, monitoring, and result export of batch inference tasks. -* Provides batch resource management, which can allocate and adjust the resources of inference services in batches. +- Supports batch operations on multiple inference services, such as batch starting, stopping, updating, and deleting. +- Able to support the creation, monitoring, and result export of batch inference tasks. +- Provides batch resource management, which can allocate and adjust the resources of inference services in batches. **Inference Experience** -* Provides an interactive interface to facilitate user testing and experience of inference services. -* Supports multiple input and output formats to meet the needs of different application scenarios. -* Provides model performance evaluation tools to help users optimize model deployment. +- Provides an interactive interface to facilitate user testing and experience of inference services. +- Supports multiple input and output formats to meet the needs of different application scenarios. +- Provides model performance evaluation tools to help users optimize model deployment. **Inference Runtime Support** -* Integrates a variety of mainstream inference frameworks, such as vLLM, Seldon MLServer, etc., and supports user-defined inference runtimes. +- Integrates a variety of mainstream inference frameworks, such as vLLM, Seldon MLServer, etc., and supports user-defined inference runtimes. :::tip @@ -51,74 +52,79 @@ The core definition of the inference service feature is to deploy trained machin **Access Methods, Logs, Swagger, Monitoring, etc.** -* Provides multiple access methods, such as HTTP API and gRPC. -* Supports detailed log recording and analysis to facilitate user troubleshooting. -* Automatically generates Swagger documentation to facilitate user integration and invocation of inference services. -* Provides real-time monitoring and alarm features to ensure stable service operation. - +- Provides multiple access methods, such as HTTP API and gRPC. +- Supports detailed log recording and analysis to facilitate user troubleshooting. +- Automatically generates Swagger documentation to facilitate user integration and invocation of inference services. +- Provides real-time monitoring and alarm features to ensure stable service operation. ## Create inference service -### Step 1: Navigate to Model Repository -In the left navigation bar, click **Model Repository** +### Step 1: Access the Inference Service Creation Page + +There are two ways to start the inference service publishing process: + +1. **From Model Repository**: In the left navigation bar, click **Model Repository**. Click the target model name to enter the model details page, and click **Publish Inference Service** in the upper right corner. +2. **From Inference Service**: In the left navigation bar, click **Inference Service**, and then click **Create Inference Service**. :::tip Custom publishing inference service requires manual setting of parameters. You can also create a "template" by combining input parameters for quick publishing of inference services. ::: -### Step 2: Initiate Inference Service Publishing -Click the model name to enter the model details page, and click **Publish Inference Service** in the upper right corner. +### Step 2: Configure Model Metadata (if needed) + +If you start from the Model Repository and the **"Publish Inference Service"** button is not clickable, go to the **"File Management"** tab, click "Edit Metadata", and select **"Task Type"** and **"Framework"** based on the actual model information. (You must edit the metadata of the default branch for it to take effect.) -### Step 3: Configure Model Metadata (if needed) -If the **"Publish Inference Service"** button is not clickable, go to the **"File Management"** tab, click "Edit Metadata", and select **"Task Type"** and **"Framework"** based on the actual model information. (You must edit the metadata of the default branch for it to take effect.) +### Step 3: Select Publish Mode and Configure -### Step 4: Select Publish Mode and Configure Enter the **Publish Mode Selection** page. AML provides **Custom Publish** and **Template Publish** options. 1. Template Publish: - - Select the model and click **Template Name** - Enter the template publish form, where parameters from the template are preloaded but can be manually edited - Click **Publish** to deploy the inference service + 2. Custom Publish: - Click **Custom Publish** - - Enter the custom publish form and configure the parameters + - Enter the custom publish form and configure the parameters, including the **Model Location** (Model repository or PVC). - Click **Publish** to deploy the inference service -### Step 5: Monitor and Manage Inference Service +### Step 4: Monitor and Manage Inference Service + You can view the status, logs, and other details of the published inference service under Inference Service in the left navigation. **If the inference service fails to start or the running resources are insufficient, you may need to update or republish the inference service and modify the configuration that may cause the startup failure.** **Note:** The inference service will automatically scale up and down between the "minimum number of replicas" and "maximum number of replicas" according to the request traffic. If the "minimum number of replicas" is set to 0, the inference service will automatically pause and release resources when there is no request for a period of time. At this time, if a request comes, the inference service can automatically start and load the model cached in the PVC. AML completes the release and operation of cloud native inference services based on [kserve](https://github.com/kserve/kserve) InferenceService CRD. If you are familiar with the use of kserve, you can also click the "YAML" button in the upper right corner when "Publish inference service directly from model" to directly modify the YAML file to complete more advanced release operations. - **Parameter Descriptions for model publishing** -| Parameters | Description | -| :----- | :--------------------------------- | -| Name | Required, The name of the inference API. | -| Description | A detailed description of the inference API, explaining its functionality and purpose. | -| Model | Required, The name of the model used for inference. | -| Version | Required, The version of the model. options include Branch and Tag.| -| Inference Runtimes | Required, The engine used for inference runtime | -| Requests CPU | Required, The amount of CPU resources requested by the inference service.| -| Requests Memory | Required, The amount of memory resources requested by the inference service.| -| Limits CPU | Required, The maximum amount of CPU resources that the inference service can use.| -| Limits Memory | Required, The maximum amount of memory resources that the inference service can use.| -| GPU Acceleration Type | The type of GPU acceleration.| -| GPU Acceleration Value | The value of GPU acceleration.| -| Temporary storage | Temporary storage space used by the inference service.| -| Mount existing PVC | Mount an existing Kubernetes Persistent Volume Claim (PVC) as storage.| -| Capacity | Required, The capacity size of temporary storage or PVC.| -| Auto scaling | Enable or disable auto-scaling functionality.| -| Number of instances | Required, The number of instances running the inference service.| -| Environment variables | Key-value pairs injected into the container runtime environment.| -| Add parameters | Parameters passed to the container's entrypoint executable. Array of strings (e.g. ["--port=8080", "--batch_size=4"]).| -| Startup command | Overrides the default ENTRYPOINT instruction in the container image. Executable + arguments (e.g. ["python", "serve.py"])| - +| Parameters | Description | +| :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------- | +| Name | Required, The name of the inference API. | +| Description | A detailed description of the inference API, explaining its functionality and purpose. | +| Model Location | Required, Choose the source of the model files: `Model repository` or `PVC`. | +| Model PVC | Required when Model Location is `PVC`. Select an existing PVC containing the model files. | +| Model PVC Path | Required when Model Location is `PVC`. Specify the path within the PVC where the model resides. | +| Model Type | Required when Model Location is `PVC`. Choose between `Generative AI model` or `Predictive Model`. This selection determines the available Runtimes. | +| Model | Required when Model Location is `Model repository`. The name of the model used for inference. | +| Version | Required when Model Location is `Model repository`. The version of the model. Options include Branch and Tag. | +| Inference Runtimes | Required, The engine used for inference runtime. When `Model Type` is `Generative AI model`, `llm-d` is available for distributed inference. | +| Config Source | Required, Choose the source for deploying resources: `Hardware profile` or `Custom`. | +| Hardware Profile | Required when Config Source is `Hardware profile`. Select a predefined hardware profile that configures the required resources. | +| Requests CPU | Required when Config Source is `Custom`. The amount of CPU resources requested by the inference service. | +| Requests Memory | Required when Config Source is `Custom`. The amount of memory resources requested by the inference service. | +| Limits CPU | Required when Config Source is `Custom`. The maximum amount of CPU resources that the inference service can use. | +| Limits Memory | Required when Config Source is `Custom`. The maximum amount of memory resources that the inference service can use. | +| GPU Acceleration Type | Available when Config Source is `Custom`. The type of GPU acceleration. | +| GPU Acceleration Value | Available when Config Source is `Custom`. The value of GPU acceleration. | +| Storage Capacity | Required when Model Location is `Model repository`. The capacity size of the temporary storage used to cache the model. | +| Auto scaling | Enable or disable auto-scaling functionality. | +| Number of instances | Required, The number of instances running the inference service. | +| Environment variables | Key-value pairs injected into the container runtime environment. | +| Add parameters | Parameters passed to the container's entrypoint executable. Array of strings (e.g. ["--port=8080", "--batch_size=4"]). | +| Startup command | Overrides the default ENTRYPOINT instruction in the container image. Executable + arguments (e.g. ["python", "serve.py"]) | @@ -129,38 +135,40 @@ AML introduces **Template Publish** for quickly deploying inference services. Yo ### Step 1: Create a Template - - In the left navigation bar, click **Inference Service > Create Inference Service** - - Click **Custom Publish** - - Enter the form page and configure parameters - - Click **Create Template** + +- In the left navigation bar, click **Inference Service > Create Inference Service** +- Click **Custom Publish** +- Enter the form page and configure parameters +- Click **Create Template** ### Step 2: Create a New Template from Existing - - In the left navigation bar, click **Inference Service > Create Inference Service** - - Select the model and click **Template Name** - - Edit the parameters as needed - - Click **Create Template** to save as a new template + +- In the left navigation bar, click **Inference Service > Create Inference Service** +- Select the model and click **Template Name** +- Edit the parameters as needed +- Click **Create Template** to save as a new template ### Step 3: Delete a Template - - In the left navigation bar, click **Inference Service > Create Inference Service** - - On the template card, click **Actions > Delete** - - Confirm the deletion - +- In the left navigation bar, click **Inference Service > Create Inference Service** +- On the template card, click **Actions > Delete** +- Confirm the deletion + ## Inference service update + 1. In the left navigation bar, click **Inference Service.** 2. Click the **inference service name.** 3. On the inference service detail page, click **Actions > Update** in the upper right to enter the update page. 4. Modify the necessary fields and click **Update**. The system will perform a rolling update to avoid disrupting existing client requests. - - - ## Calling the published inference service + AML provides a visual **"Inference Experience"** method for common task types to access the published inference service; you can also use the HTTP API method to call the inference service. ### Inference Experience + AML supports the following task type inference service inference demonstration (the task type is specified in the model metadata): - Text generation @@ -171,7 +179,6 @@ AML supports the following task type inference service inference demonstration ( After successfully publishing the inference service of the above task types, you can display the **"Inference Experience"** dialog box on the right side of the model details page and the inference service details page. Depending on the type of inference task, the input and output data types may be different. Taking text generation as an example, enter text, and you can append the model-generated text in blue font after the text entered in the text box. Inference experience supports selecting different inference services deployed in different clusters and published multiple times by the same model. After selecting an inference service, this inference service will be called to return the inference result. - ### Calling by HTTP API After publishing the inference service, you can call this inference service in applications or other services. This document will take Python code as an example to show how to call the published inference API. @@ -180,9 +187,10 @@ After publishing the inference service, you can call this inference service in a 2. Click the **Access Method** tab to get the in-cluster or out-cluster access method. The in-cluster access method can be accessed directly from Notebook or other containers in this K8s cluster. If you need to access it from a location outside the cluster (such as a local laptop), you need to use the out-cluster access method. 3. Click Call Example to view the sample code. - ***Note: The code provided in the call example is only the API call protocol supported by the inference service published using the mlserver runtime (Seldon MLServer). In addition, the Swagger tab also only supports access to the inference service published by the mlserver runtime.*** + **_Note: The code provided in the call example is only the API call protocol supported by the inference service published using the mlserver runtime (Seldon MLServer). In addition, the Swagger tab also only supports access to the inference service published by the mlserver runtime._** ### Inference parameter description + When calling the inference service, you can adjust the model output effect by adjusting the model inference parameters. In the **Inference Experience** interface, common parameters and default values are pre-made, and any custom parameters can also be added. @@ -192,33 +200,33 @@ In the **Inference Experience** interface, common parameters and default values ##### Preset Parameters -| Parameter | Data Type | Description | -|---|---|---| -| `do_sample` | bool | Whether to use sampling; if not, greedy decoding is used. | -| `max_new_tokens` | int | The maximum number of tokens to generate, ignoring tokens in the prompt. | -| `repetition_penalty` | float | Repetition penalty to control repeated content in the generated text; 1.0 means no repetition, 0 means repetition. | -| `temperature` | float | The randomness of the model for the next token when generating text; 1.0 is high randomness, 0 is low randomness. | -| `top_k` | int | When calculating the probability distribution of the next token, only consider the top k tokens with the highest probability. | -| `top_p` | float | Controls the cumulative probability distribution considered by the model when selecting the next token. | -| `use_cache` | bool | Whether to use the intermediate results calculated by the model during the generation process. | +| Parameter | Data Type | Description | +| -------------------- | --------- | ----------------------------------------------------------------------------------------------------------------------------- | +| `do_sample` | bool | Whether to use sampling; if not, greedy decoding is used. | +| `max_new_tokens` | int | The maximum number of tokens to generate, ignoring tokens in the prompt. | +| `repetition_penalty` | float | Repetition penalty to control repeated content in the generated text; 1.0 means no repetition, 0 means repetition. | +| `temperature` | float | The randomness of the model for the next token when generating text; 1.0 is high randomness, 0 is low randomness. | +| `top_k` | int | When calculating the probability distribution of the next token, only consider the top k tokens with the highest probability. | +| `top_p` | float | Controls the cumulative probability distribution considered by the model when selecting the next token. | +| `use_cache` | bool | Whether to use the intermediate results calculated by the model during the generation process. | ##### Other Parameters -| Parameter | Data Type | Description | -|---|---|---| -| `max_length` | int | The maximum number of generated tokens. Corresponds to the number of tokens in the input prompt + `max_new_tokens`. If `max_new_tokens` is set, its effect is overridden by `max_new_tokens`. | -| `min_length` | int | The minimum number of generated tokens. Corresponds to the number of tokens in the input prompt + `min_new_tokens`. If `min_new_tokens` is set, its effect is overridden by `min_new_tokens`. | -| `min_new_tokens` | int | The minimum number of generated tokens, ignoring tokens in the prompt. | -| `early_stop` | bool | Controls the stopping condition for beam-based methods. True: generation stops when `num_beams` complete candidates appear. False: applies heuristics to stop generation when it is unlikely to find better candidates. | -| `num_beams` | int | Number of beams used for beam search. 1 means no beam search. | -| `max_time` | int | The maximum time allowed for calculation to run, in seconds. | -| `num_beam_groups` | int | Divides `num_beams` into groups to ensure diversity among different beam groups. | -| `diversity_penalty` | float | Effective when `num_beam_groups` is enabled. This parameter applies a diversity penalty between groups to ensure that the content generated by each group is as different as possible. | -| `penalty_alpha` | float | Contrastive search is enabled when `penalty_alpha` is greater than 0 and `top_k` is greater than 1. The larger the `penalty_alpha` value, the stronger the contrastive penalty, and the more likely the generated text is to meet expectations. If the `penalty_alpha` value is set too large, it may cause the generated text to be too uniform. | -| `typical_p` | float | Local typicality measures the similarity between the conditional probability of predicting the next target token and the expected conditional probability of predicting the next random token given the generated partial text. If set to a floating-point number less than 1, the smallest set of locally typical tokens whose probabilities add up to or exceed `typical_p` will be retained for generation. | -| `epsilon_cutoff` | float | If set to a floating-point number strictly between 0 and 1, only tokens with conditional probabilities greater than `epsilon_cutoff` will be sampled. Suggested values range from {/* lint ignore unit-case */} 3e-4 to 9e-4, depending on the model size. | -| `eta_cutoff` | float | Eta sampling is a hybrid of local typical sampling and epsilon sampling. If set to a floating-point number strictly between 0 and 1, a token will only be considered if it is greater than `eta_cutoff` or sqrt(`eta_cutoff`) * exp(-entropy(softmax(next_token_logits))). Suggested values range from {/* lint ignore unit-case */} 3e-4 to 2e-3, depending on the model size. | -| `repetition_penalty` | float | Parameter for repetition penalty. 1.0 means no penalty. | +| Parameter | Data Type | Description | +| -------------------- | --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_length` | int | The maximum number of generated tokens. Corresponds to the number of tokens in the input prompt + `max_new_tokens`. If `max_new_tokens` is set, its effect is overridden by `max_new_tokens`. | +| `min_length` | int | The minimum number of generated tokens. Corresponds to the number of tokens in the input prompt + `min_new_tokens`. If `min_new_tokens` is set, its effect is overridden by `min_new_tokens`. | +| `min_new_tokens` | int | The minimum number of generated tokens, ignoring tokens in the prompt. | +| `early_stop` | bool | Controls the stopping condition for beam-based methods. True: generation stops when `num_beams` complete candidates appear. False: applies heuristics to stop generation when it is unlikely to find better candidates. | +| `num_beams` | int | Number of beams used for beam search. 1 means no beam search. | +| `max_time` | int | The maximum time allowed for calculation to run, in seconds. | +| `num_beam_groups` | int | Divides `num_beams` into groups to ensure diversity among different beam groups. | +| `diversity_penalty` | float | Effective when `num_beam_groups` is enabled. This parameter applies a diversity penalty between groups to ensure that the content generated by each group is as different as possible. | +| `penalty_alpha` | float | Contrastive search is enabled when `penalty_alpha` is greater than 0 and `top_k` is greater than 1. The larger the `penalty_alpha` value, the stronger the contrastive penalty, and the more likely the generated text is to meet expectations. If the `penalty_alpha` value is set too large, it may cause the generated text to be too uniform. | +| `typical_p` | float | Local typicality measures the similarity between the conditional probability of predicting the next target token and the expected conditional probability of predicting the next random token given the generated partial text. If set to a floating-point number less than 1, the smallest set of locally typical tokens whose probabilities add up to or exceed `typical_p` will be retained for generation. | +| `epsilon_cutoff` | float | If set to a floating-point number strictly between 0 and 1, only tokens with conditional probabilities greater than `epsilon_cutoff` will be sampled. Suggested values range from {/* lint ignore unit-case */} 3e-4 to 9e-4, depending on the model size. | +| `eta_cutoff` | float | Eta sampling is a hybrid of local typical sampling and epsilon sampling. If set to a floating-point number strictly between 0 and 1, a token will only be considered if it is greater than `eta_cutoff` or sqrt(`eta_cutoff`) _ exp(-entropy(softmax(next_token_logits))). Suggested values range from {/_ lint ignore unit-case \*/} 3E-4 to 2E-3, depending on the model size. | +| `repetition_penalty` | float | Parameter for repetition penalty. 1.0 means no penalty. | For more parameters, please refer to [Text Generation Parameter Configuration](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig). @@ -226,19 +234,19 @@ For more parameters, please refer to [Text Generation Parameter Configuration](h ##### Preset Parameters -| Parameter | Data Type | Description | -|---|---|---| -| `num_inference_steps` | int | The number of denoising steps. More denoising steps usually result in higher quality images but slower inference. | -| `use_cache` | bool | Whether to use the intermediate results calculated by the model during the generation process. | +| Parameter | Data Type | Description | +| --------------------- | --------- | ----------------------------------------------------------------------------------------------------------------- | +| `num_inference_steps` | int | The number of denoising steps. More denoising steps usually result in higher quality images but slower inference. | +| `use_cache` | bool | Whether to use the intermediate results calculated by the model during the generation process. | ##### Other Parameters -| Parameter | Data Type | Description | -|---|---|---| -| `height` | int | The height of the generated image, in pixels. | -| `width` | int | The width of the generated image, in pixels. | -| `guidance_scale` | float | Used to adjust the balance between quality and diversity of the generated image. Larger values increase diversity but reduce quality; suggested range is 7 to 8.5. | -| `negative_prompt` | str or List[str] | Used to guide what content should not be included in the image generation. | +| Parameter | Data Type | Description | +| ----------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `height` | int | The height of the generated image, in pixels. | +| `width` | int | The width of the generated image, in pixels. | +| `guidance_scale` | float | Used to adjust the balance between quality and diversity of the generated image. Larger values increase diversity but reduce quality; suggested range is 7 to 8.5. | +| `negative_prompt` | str or List[str] | Used to guide what content should not be included in the image generation. | For more parameters, please refer to [Text-to-Image Parameter Configuration](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline). @@ -246,10 +254,10 @@ For more parameters, please refer to [Text-to-Image Parameter Configuration](htt ##### Preset Parameters -| Parameter | Data Type | Description | -|---|---|---| -| `top_k` | int | The number of top-scoring type labels. If the provided number is None or higher than the number of labels available in the model configuration, the default is to return the number of labels. | -| `use_cache` | bool | Whether to use the intermediate results calculated by the model during the generation process. | +| Parameter | Data Type | Description | +| ----------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `top_k` | int | The number of top-scoring type labels. If the provided number is None or higher than the number of labels available in the model configuration, the default is to return the number of labels. | +| `use_cache` | bool | Whether to use the intermediate results calculated by the model during the generation process. | For more parameters, please refer to [Text Classification Parameter Configuration](https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/pipelines#transformers.TextClassificationPipeline.__call__)