Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions content/en/docs/plugins/capacity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
+++
title = "Capacity Plugin"

date = 2025-01-21
lastmod = 2025-01-21

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.

# Add menu entry to sidebar.
linktitle = "Capacity"
[menu.plugins]
weight = 2
+++

### Capacity

#### Overview

The Capacity plugin manages queue resource allocation using a capacity-based model. It enforces queue capacity limits, guarantees minimum resource allocations, and supports hierarchical queue structures. The plugin calculates each queue's deserved resources based on its capacity, guarantee, and the cluster's total available resources.

#### Features

- **Queue Capacity Management**: Enforces queue capacity limits based on configured capability

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better clarity and consistency with other parts of the documentation that refer to configuration fields, consider formatting capability as code to indicate it's a specific field name.

Suggested change
- **Queue Capacity Management**: Enforces queue capacity limits based on configured capability
- **Queue Capacity Management**: Enforces queue capacity limits based on configured `capability`

- **Resource Guarantees**: Supports minimum resource guarantees for queues
- **Hierarchical Queues**: Supports hierarchical queue structures with parent-child relationships
- **Dynamic Resource Allocation**: Calculates deserved resources dynamically based on queue configuration
- **Resource Reclamation**: Supports resource reclamation from queues exceeding their capacity
- **Job Enqueue Control**: Validates resource availability before allowing jobs to be enqueued

#### Configuration

The Capacity plugin is configured through Queue resources. Here's an example:

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: queue-capacity-example
spec:
weight: 1
capability:
cpu: "100"
memory: "100Gi"
guarantee:
resource:
cpu: "20"
memory: "20Gi"
deserved:
cpu: "50"
memory: "50Gi"
```

##### Queue Configuration Fields

- **capability**: Maximum resources the queue can consume
- **guarantee**: Minimum resources guaranteed to the queue
- **deserved**: Desired resource allocation for the queue (calculated automatically if not specified)
- **parent**: Parent queue name for hierarchical queue structures

##### Hierarchical Queue Configuration

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: root-queue
spec:
weight: 1
capability:
cpu: "1000"
memory: "1000Gi"
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: child-queue
spec:
parent: root-queue
weight: 1
capability:
cpu: "500"
memory: "500Gi"
guarantee:
resource:
cpu: "100"
memory: "100Gi"
```

#### How It Works

1. **Capacity Calculation**: The plugin calculates each queue's real capacity by considering the total cluster resources, total guarantees, and the queue's own guarantee and capability.
2. **Deserved Resources**: Deserved resources are calculated based on the queue's real capacity and configured deserved values.
3. **Job Enqueue**: Before a job is enqueued, the plugin validates that the queue has sufficient capacity to accommodate the job's minimum resource requirements.
4. **Resource Allocation**: During scheduling, the plugin ensures that queues don't exceed their allocated capacity.
5. **Reclamation**: Queues that exceed their deserved resources can have tasks reclaimed to make room for other queues.

#### Scenario

The Capacity plugin is suitable for:

- **Resource Quota Management**: Enforcing resource limits per queue or department
- **Multi-tenant Clusters**: Isolating resources between different tenants or teams
- **Resource Reservations**: Guaranteeing minimum resources for critical workloads
- **Hierarchical Organizations**: Organizations with nested resource allocation structures

#### Examples

##### Example 1: Basic Capacity Management

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: team-a
spec:
weight: 1
capability:
cpu: "200"
memory: "200Gi"
nvidia.com/gpu: "8"
guarantee:
resource:
cpu: "50"
memory: "50Gi"
nvidia.com/gpu: "2"
```

##### Example 2: Hierarchical Capacity

```yaml
# Root queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: root
spec:
weight: 1
capability:
cpu: "1000"
memory: "1000Gi"

---
# Development queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: dev
spec:
parent: root
weight: 1
capability:
cpu: "300"
memory: "300Gi"

---
# Production queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: prod
spec:
parent: root
weight: 1
capability:
cpu: "500"
memory: "500Gi"
guarantee:
resource:
cpu: "200"
memory: "200Gi"
```

#### Notes

- When hierarchical queues are enabled, only leaf queues can allocate tasks
- Queues without a capacity configuration are treated as best-effort queues
- The plugin automatically calculates real capacity considering parent queue constraints
- Resource guarantees cannot exceed queue capabilities
193 changes: 193 additions & 0 deletions content/en/docs/plugins/deviceshare.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
+++
title = "Device Share Plugin"

date = 2025-01-21
lastmod = 2025-01-21

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.

# Add menu entry to sidebar.
linktitle = "Device Share"
[menu.plugins]
weight = 3
+++

### Device Share

#### Overview

The Device Share plugin manages the sharing and allocation of device resources such as GPUs, NPUs, and other accelerators. It supports multiple device types including NVIDIA GPUs (both GPU sharing and vGPU), Ascend NPUs, and provides flexible scheduling policies for device allocation. The plugin enables efficient utilization of expensive accelerator resources through sharing capabilities.

#### Features

- **GPU Sharing**: Enable sharing of GPU resources among multiple pods
- **GPU Number**: Schedule based on the number of GPUs requested
- **vGPU Support**: Support for virtual GPU (vGPU) allocation
- **Ascend NPU Support**: Support for Ascend NPU devices including MindCluster VNPU and HAMi VNPU
- **Node Locking**: Optional node-level locking to prevent concurrent device allocations
- **Flexible Scheduling Policies**: Configurable scoring policies for device allocation
- **Batch Node Scoring**: Support for batch scoring of nodes for NPU devices

#### Configuration

The Device Share plugin can be configured with the following arguments:

```yaml
actions: "allocate, backfill"
tiers:
- plugins:
- name: deviceshare
arguments:
deviceshare.GPUSharingEnable: true
deviceshare.GPUNumberEnable: false
deviceshare.VGPUEnable: false
deviceshare.NodeLockEnable: false
deviceshare.SchedulePolicy: "binpack"
deviceshare.ScheduleWeight: 10
deviceshare.AscendMindClusterVNPUEnable: false
deviceshare.AscendHAMiVNPUEnable: false
deviceshare.KnownGeometriesCMName: "volcano-vgpu-device-config"
deviceshare.KnownGeometriesCMNamespace: "kube-system"
```

##### Configuration Parameters

- **deviceshare.GPUSharingEnable** (bool): Enable GPU sharing mode
- **deviceshare.GPUNumberEnable** (bool): Enable GPU number-based scheduling (mutually exclusive with GPUSharingEnable)
- **deviceshare.VGPUEnable** (bool): Enable vGPU support (mutually exclusive with GPU sharing)
- **deviceshare.NodeLockEnable** (bool): Enable node-level locking for device allocation
- **deviceshare.SchedulePolicy** (string): Scheduling policy for device scoring (e.g., "binpack", "spread")
- **deviceshare.ScheduleWeight** (int): Weight for device scoring in node ordering
- **deviceshare.AscendMindClusterVNPUEnable** (bool): Enable Ascend MindCluster VNPU support
- **deviceshare.AscendHAMiVNPUEnable** (bool): Enable Ascend HAMi VNPU support
- **deviceshare.KnownGeometriesCMName** (string): ConfigMap name for vGPU geometries
- **deviceshare.KnownGeometriesCMNamespace** (string): Namespace for vGPU geometries ConfigMap

#### Device Types

##### NVIDIA GPU Sharing

Enable GPU sharing to allow multiple pods to share a single GPU:

```yaml
- name: deviceshare
arguments:
deviceshare.GPUSharingEnable: true
deviceshare.ScheduleWeight: 10
```

Pods request GPU resources using:

```yaml
resources:
requests:
nvidia.com/gpu: 2 # Request 2 GPU units (out of 100 per GPU)
limits:
nvidia.com/gpu: 2
```

##### NVIDIA GPU Number

Schedule based on the number of physical GPUs:

```yaml
- name: deviceshare
arguments:
deviceshare.GPUNumberEnable: true
deviceshare.ScheduleWeight: 10
```

Pods request whole GPUs:

```yaml
resources:
requests:
nvidia.com/gpu: 1 # Request 1 whole GPU
limits:
nvidia.com/gpu: 1
```

##### vGPU

Enable virtual GPU support:

```yaml
- name: deviceshare
arguments:
deviceshare.VGPUEnable: true
deviceshare.ScheduleWeight: 10
deviceshare.KnownGeometriesCMName: "volcano-vgpu-device-config"
deviceshare.KnownGeometriesCMNamespace: "kube-system"
```

##### Ascend NPU

Enable Ascend NPU support:

```yaml
- name: deviceshare
arguments:
deviceshare.AscendMindClusterVNPUEnable: true
# or
deviceshare.AscendHAMiVNPUEnable: true
deviceshare.ScheduleWeight: 10
```

#### Scenario

The Device Share plugin is suitable for:

- **GPU Clusters**: Clusters with NVIDIA GPU resources requiring efficient sharing
- **AI Training**: Machine learning training workloads requiring GPU acceleration
- **Multi-tenant GPU Sharing**: Environments where multiple users need access to GPU resources
- **NPU Workloads**: Workloads running on Ascend NPU devices
- **Cost Optimization**: Maximizing utilization of expensive accelerator hardware

#### Examples

##### Example 1: GPU Sharing for Small Workloads

Configure GPU sharing for workloads that don't require full GPU resources:

```yaml
- name: deviceshare
arguments:
deviceshare.GPUSharingEnable: true
deviceshare.SchedulePolicy: "binpack"
deviceshare.ScheduleWeight: 10
```

##### Example 2: Whole GPU Allocation

Configure for workloads requiring full GPU resources:

```yaml
- name: deviceshare
arguments:
deviceshare.GPUNumberEnable: true
deviceshare.SchedulePolicy: "spread"
deviceshare.ScheduleWeight: 10
```

##### Example 3: vGPU with Custom ConfigMap

Configure vGPU with custom geometry configuration:

```yaml
- name: deviceshare
arguments:
deviceshare.VGPUEnable: true
deviceshare.ScheduleWeight: 10
deviceshare.KnownGeometriesCMName: "custom-vgpu-config"
deviceshare.KnownGeometriesCMNamespace: "gpu-system"
```

#### Notes

- GPU sharing and GPU number modes are mutually exclusive
- GPU sharing and vGPU cannot be enabled simultaneously
- Node locking prevents race conditions in device allocation
- The plugin automatically registers supported devices based on configuration
- Batch scoring is used for NPU devices to optimize allocation decisions
Loading