Skip to content

proposal(scheduler): cross-dimensional resource balance scoring to prevent GPU fragmentation #1373

@0x-auth

Description

@0x-auth

What you would like to be added?

A cross-dimensional resource balance scoring plugin that detects and prevents resource fragmentation at scheduling time, rather than fixing it after the fact.

Currently KAI supports bin-packing (fill nodes fully) and spread (distribute evenly), but neither considers the shape of resource usage across dimensions. A node can be 95% on GPU-Memory but 15% on GPU-Compute — bin-packing sees it as "mostly full" and spread sees it as "partially used." Both are wrong. That node has stranded compute.

The proposed plugin scores nodes by measuring how much more balanced they become after placing a pod, using cosine alignment between the pod's request vector and the node's free capacity vector, combined with Shannon entropy reduction.

Why is this needed?

In GPU clusters running mixed inference workloads, resource fragmentation happens across dimensions, not just within them:

  • Large model serving (LLaMA 70B, DeepSeek): VRAM full, GPU compute ~20% → compute is stranded
  • Many small models (batch inference): GPU compute saturated, VRAM ~30% used → memory is stranded
  • Mixed CPU+GPU nodes: CPU 90%, GPU 40% → GPU capacity wasted

This is the "jagged cluster" problem described in #1311, but caught at scheduling time instead of after the fact. Prevention > cure.

With the vectorized resource representation landing (#1353), adding a vector-based scoring plugin becomes natural — the infrastructure is already there.

Proposed approach

Score each node using a 6D resource vector [CPU, Memory, GPU-Compute, GPU-Memory, IOPS, Network]:

alignment     = cosine_similarity(pod_request, node_free_capacity)
exhaustion    = φ × entropy_reduction(node_before, node_after) + utilization_delta
leak_penalty  = 0.15 × count(stranded_dimensions)

score = φ × alignment + exhaustion - leak_penalty + headroom × 0.3

Where φ = 1.618 (golden ratio) provides parameter-free weighting — no per-cluster tuning needed.

Benchmark (synthetic, 50 nodes × 500 pods): 13% improvement in resource balance vs LeastAllocated, zero stranded nodes in 4/5 scenarios. Scoring latency: ~115ns/op, zero allocations.

This is already proposed as a Koordinator scheduler plugin (koordinator-sh/koordinator#2839). Open-source auditor that detects existing imbalance: github.com/0x-auth/lambda-g-auditor

Happy to contribute the implementation if there's interest. The scoring function is stateless and fits naturally into the existing GpuOrderFn plugin pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions