-
Notifications
You must be signed in to change notification settings - Fork 98
docs: Add missing plugin documentation for resource-strategy-fit, capacity,… #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shiverse94
wants to merge
2
commits into
volcano-sh:master
Choose a base branch
from
shiverse94:shive-plugins
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,180 @@ | ||
| +++ | ||
| title = "Capacity Plugin" | ||
|
|
||
| date = 2025-01-21 | ||
| lastmod = 2025-01-21 | ||
|
|
||
| draft = false # Is this a draft? true/false | ||
| toc = true # Show table of contents? true/false | ||
| type = "docs" # Do not modify. | ||
|
|
||
| # Add menu entry to sidebar. | ||
| linktitle = "Capacity" | ||
| [menu.plugins] | ||
| weight = 2 | ||
| +++ | ||
|
|
||
| ### Capacity | ||
|
|
||
| #### Overview | ||
|
|
||
| The Capacity plugin manages queue resource allocation using a capacity-based model. It enforces queue capacity limits, guarantees minimum resource allocations, and supports hierarchical queue structures. The plugin calculates each queue's deserved resources based on its capacity, guarantee, and the cluster's total available resources. | ||
|
|
||
| #### Features | ||
|
|
||
| - **Queue Capacity Management**: Enforces queue capacity limits based on configured capability | ||
| - **Resource Guarantees**: Supports minimum resource guarantees for queues | ||
| - **Hierarchical Queues**: Supports hierarchical queue structures with parent-child relationships | ||
| - **Dynamic Resource Allocation**: Calculates deserved resources dynamically based on queue configuration | ||
| - **Resource Reclamation**: Supports resource reclamation from queues exceeding their capacity | ||
| - **Job Enqueue Control**: Validates resource availability before allowing jobs to be enqueued | ||
|
|
||
| #### Configuration | ||
|
|
||
| The Capacity plugin is configured through Queue resources. Here's an example: | ||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: queue-capacity-example | ||
| spec: | ||
| weight: 1 | ||
| capability: | ||
| cpu: "100" | ||
| memory: "100Gi" | ||
| guarantee: | ||
| resource: | ||
| cpu: "20" | ||
| memory: "20Gi" | ||
| deserved: | ||
| cpu: "50" | ||
| memory: "50Gi" | ||
| ``` | ||
|
|
||
| ##### Queue Configuration Fields | ||
|
|
||
| - **capability**: Maximum resources the queue can consume | ||
| - **guarantee**: Minimum resources guaranteed to the queue | ||
| - **deserved**: Desired resource allocation for the queue (calculated automatically if not specified) | ||
| - **parent**: Parent queue name for hierarchical queue structures | ||
|
|
||
| ##### Hierarchical Queue Configuration | ||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: root-queue | ||
| spec: | ||
| weight: 1 | ||
| capability: | ||
| cpu: "1000" | ||
| memory: "1000Gi" | ||
| --- | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: child-queue | ||
| spec: | ||
| parent: root-queue | ||
| weight: 1 | ||
| capability: | ||
| cpu: "500" | ||
| memory: "500Gi" | ||
| guarantee: | ||
| resource: | ||
| cpu: "100" | ||
| memory: "100Gi" | ||
| ``` | ||
|
|
||
| #### How It Works | ||
|
|
||
| 1. **Capacity Calculation**: The plugin calculates each queue's real capacity by considering the total cluster resources, total guarantees, and the queue's own guarantee and capability. | ||
| 2. **Deserved Resources**: Deserved resources are calculated based on the queue's real capacity and configured deserved values. | ||
| 3. **Job Enqueue**: Before a job is enqueued, the plugin validates that the queue has sufficient capacity to accommodate the job's minimum resource requirements. | ||
| 4. **Resource Allocation**: During scheduling, the plugin ensures that queues don't exceed their allocated capacity. | ||
| 5. **Reclamation**: Queues that exceed their deserved resources can have tasks reclaimed to make room for other queues. | ||
|
|
||
| #### Scenario | ||
|
|
||
| The Capacity plugin is suitable for: | ||
|
|
||
| - **Resource Quota Management**: Enforcing resource limits per queue or department | ||
| - **Multi-tenant Clusters**: Isolating resources between different tenants or teams | ||
| - **Resource Reservations**: Guaranteeing minimum resources for critical workloads | ||
| - **Hierarchical Organizations**: Organizations with nested resource allocation structures | ||
|
|
||
| #### Examples | ||
|
|
||
| ##### Example 1: Basic Capacity Management | ||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: team-a | ||
| spec: | ||
| weight: 1 | ||
| capability: | ||
| cpu: "200" | ||
| memory: "200Gi" | ||
| nvidia.com/gpu: "8" | ||
| guarantee: | ||
| resource: | ||
| cpu: "50" | ||
| memory: "50Gi" | ||
| nvidia.com/gpu: "2" | ||
| ``` | ||
|
|
||
| ##### Example 2: Hierarchical Capacity | ||
|
|
||
| ```yaml | ||
| # Root queue | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: root | ||
| spec: | ||
| weight: 1 | ||
| capability: | ||
| cpu: "1000" | ||
| memory: "1000Gi" | ||
|
|
||
| --- | ||
| # Development queue | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: dev | ||
| spec: | ||
| parent: root | ||
| weight: 1 | ||
| capability: | ||
| cpu: "300" | ||
| memory: "300Gi" | ||
|
|
||
| --- | ||
| # Production queue | ||
| apiVersion: scheduling.volcano.sh/v1beta1 | ||
| kind: Queue | ||
| metadata: | ||
| name: prod | ||
| spec: | ||
| parent: root | ||
| weight: 1 | ||
| capability: | ||
| cpu: "500" | ||
| memory: "500Gi" | ||
| guarantee: | ||
| resource: | ||
| cpu: "200" | ||
| memory: "200Gi" | ||
| ``` | ||
|
|
||
| #### Notes | ||
|
|
||
| - When hierarchical queues are enabled, only leaf queues can allocate tasks | ||
| - Queues without a capacity configuration are treated as best-effort queues | ||
| - The plugin automatically calculates real capacity considering parent queue constraints | ||
| - Resource guarantees cannot exceed queue capabilities | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| +++ | ||
| title = "Device Share Plugin" | ||
|
|
||
| date = 2025-01-21 | ||
| lastmod = 2025-01-21 | ||
|
|
||
| draft = false # Is this a draft? true/false | ||
| toc = true # Show table of contents? true/false | ||
| type = "docs" # Do not modify. | ||
|
|
||
| # Add menu entry to sidebar. | ||
| linktitle = "Device Share" | ||
| [menu.plugins] | ||
| weight = 3 | ||
| +++ | ||
|
|
||
| ### Device Share | ||
|
|
||
| #### Overview | ||
|
|
||
| The Device Share plugin manages the sharing and allocation of device resources such as GPUs, NPUs, and other accelerators. It supports multiple device types including NVIDIA GPUs (both GPU sharing and vGPU), Ascend NPUs, and provides flexible scheduling policies for device allocation. The plugin enables efficient utilization of expensive accelerator resources through sharing capabilities. | ||
|
|
||
| #### Features | ||
|
|
||
| - **GPU Sharing**: Enable sharing of GPU resources among multiple pods | ||
| - **GPU Number**: Schedule based on the number of GPUs requested | ||
| - **vGPU Support**: Support for virtual GPU (vGPU) allocation | ||
| - **Ascend NPU Support**: Support for Ascend NPU devices including MindCluster VNPU and HAMi VNPU | ||
| - **Node Locking**: Optional node-level locking to prevent concurrent device allocations | ||
| - **Flexible Scheduling Policies**: Configurable scoring policies for device allocation | ||
| - **Batch Node Scoring**: Support for batch scoring of nodes for NPU devices | ||
|
|
||
| #### Configuration | ||
|
|
||
| The Device Share plugin can be configured with the following arguments: | ||
|
|
||
| ```yaml | ||
| actions: "allocate, backfill" | ||
| tiers: | ||
| - plugins: | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.GPUSharingEnable: true | ||
| deviceshare.GPUNumberEnable: false | ||
| deviceshare.VGPUEnable: false | ||
| deviceshare.NodeLockEnable: false | ||
| deviceshare.SchedulePolicy: "binpack" | ||
| deviceshare.ScheduleWeight: 10 | ||
| deviceshare.AscendMindClusterVNPUEnable: false | ||
| deviceshare.AscendHAMiVNPUEnable: false | ||
| deviceshare.KnownGeometriesCMName: "volcano-vgpu-device-config" | ||
| deviceshare.KnownGeometriesCMNamespace: "kube-system" | ||
| ``` | ||
|
|
||
| ##### Configuration Parameters | ||
|
|
||
| - **deviceshare.GPUSharingEnable** (bool): Enable GPU sharing mode | ||
| - **deviceshare.GPUNumberEnable** (bool): Enable GPU number-based scheduling (mutually exclusive with GPUSharingEnable) | ||
| - **deviceshare.VGPUEnable** (bool): Enable vGPU support (mutually exclusive with GPU sharing) | ||
| - **deviceshare.NodeLockEnable** (bool): Enable node-level locking for device allocation | ||
| - **deviceshare.SchedulePolicy** (string): Scheduling policy for device scoring (e.g., "binpack", "spread") | ||
| - **deviceshare.ScheduleWeight** (int): Weight for device scoring in node ordering | ||
| - **deviceshare.AscendMindClusterVNPUEnable** (bool): Enable Ascend MindCluster VNPU support | ||
| - **deviceshare.AscendHAMiVNPUEnable** (bool): Enable Ascend HAMi VNPU support | ||
| - **deviceshare.KnownGeometriesCMName** (string): ConfigMap name for vGPU geometries | ||
| - **deviceshare.KnownGeometriesCMNamespace** (string): Namespace for vGPU geometries ConfigMap | ||
|
|
||
| #### Device Types | ||
|
|
||
| ##### NVIDIA GPU Sharing | ||
|
|
||
| Enable GPU sharing to allow multiple pods to share a single GPU: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.GPUSharingEnable: true | ||
| deviceshare.ScheduleWeight: 10 | ||
| ``` | ||
|
|
||
| Pods request GPU resources using: | ||
|
|
||
| ```yaml | ||
| resources: | ||
| requests: | ||
| nvidia.com/gpu: 2 # Request 2 GPU units (out of 100 per GPU) | ||
| limits: | ||
| nvidia.com/gpu: 2 | ||
| ``` | ||
|
|
||
| ##### NVIDIA GPU Number | ||
|
|
||
| Schedule based on the number of physical GPUs: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.GPUNumberEnable: true | ||
| deviceshare.ScheduleWeight: 10 | ||
| ``` | ||
|
|
||
| Pods request whole GPUs: | ||
|
|
||
| ```yaml | ||
| resources: | ||
| requests: | ||
| nvidia.com/gpu: 1 # Request 1 whole GPU | ||
| limits: | ||
| nvidia.com/gpu: 1 | ||
| ``` | ||
|
|
||
| ##### vGPU | ||
|
|
||
| Enable virtual GPU support: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.VGPUEnable: true | ||
| deviceshare.ScheduleWeight: 10 | ||
| deviceshare.KnownGeometriesCMName: "volcano-vgpu-device-config" | ||
| deviceshare.KnownGeometriesCMNamespace: "kube-system" | ||
| ``` | ||
|
|
||
| ##### Ascend NPU | ||
|
|
||
| Enable Ascend NPU support: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.AscendMindClusterVNPUEnable: true | ||
| # or | ||
| deviceshare.AscendHAMiVNPUEnable: true | ||
| deviceshare.ScheduleWeight: 10 | ||
| ``` | ||
|
|
||
| #### Scenario | ||
|
|
||
| The Device Share plugin is suitable for: | ||
|
|
||
| - **GPU Clusters**: Clusters with NVIDIA GPU resources requiring efficient sharing | ||
| - **AI Training**: Machine learning training workloads requiring GPU acceleration | ||
| - **Multi-tenant GPU Sharing**: Environments where multiple users need access to GPU resources | ||
| - **NPU Workloads**: Workloads running on Ascend NPU devices | ||
| - **Cost Optimization**: Maximizing utilization of expensive accelerator hardware | ||
|
|
||
| #### Examples | ||
|
|
||
| ##### Example 1: GPU Sharing for Small Workloads | ||
|
|
||
| Configure GPU sharing for workloads that don't require full GPU resources: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.GPUSharingEnable: true | ||
| deviceshare.SchedulePolicy: "binpack" | ||
| deviceshare.ScheduleWeight: 10 | ||
| ``` | ||
|
|
||
| ##### Example 2: Whole GPU Allocation | ||
|
|
||
| Configure for workloads requiring full GPU resources: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.GPUNumberEnable: true | ||
| deviceshare.SchedulePolicy: "spread" | ||
| deviceshare.ScheduleWeight: 10 | ||
| ``` | ||
|
|
||
| ##### Example 3: vGPU with Custom ConfigMap | ||
|
|
||
| Configure vGPU with custom geometry configuration: | ||
|
|
||
| ```yaml | ||
| - name: deviceshare | ||
| arguments: | ||
| deviceshare.VGPUEnable: true | ||
| deviceshare.ScheduleWeight: 10 | ||
| deviceshare.KnownGeometriesCMName: "custom-vgpu-config" | ||
| deviceshare.KnownGeometriesCMNamespace: "gpu-system" | ||
| ``` | ||
|
|
||
| #### Notes | ||
|
|
||
| - GPU sharing and GPU number modes are mutually exclusive | ||
| - GPU sharing and vGPU cannot be enabled simultaneously | ||
| - Node locking prevents race conditions in device allocation | ||
| - The plugin automatically registers supported devices based on configuration | ||
| - Batch scoring is used for NPU devices to optimize allocation decisions |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better clarity and consistency with other parts of the documentation that refer to configuration fields, consider formatting
capabilityas code to indicate it's a specific field name.