Skip to content

Add L3 cache-aware CPU isolation and full L3 cache claims to CPU DRA #104

@SchSeba

Description

@SchSeba

Summary

Support L3 cache-aware CPU allocation in CPU DRA.
Users should be able to request an entire L3 cache domain as a resource. When such a request is allocated, all CPUs that belong to that L3 cache should be treated as isolated for that workload.

Motivation

For latency-sensitive and packet-processing workloads, CPU isolation alone is not always sufficient. Two workloads can still interfere with each other if they share the same L3 cache, even when they do not share the exact same CPUs.
We need a way to express cache-domain exclusivity, not just CPU-count exclusivity.

Requested behavior

Two related behaviors are needed.

1. Full L3 cache request

A user can request one full L3 cache domain.
When that request is allocated:

  • the allocation returns the CPUs that belong to that L3 cache domain
  • all CPUs in that domain are treated as isolated for that workload
  • other workloads cannot allocate CPUs from that same L3 cache domain

2. Partial CPU request blocks future full-cache allocation

If a workload allocates even a single CPU from a given L3 cache domain, that L3 cache domain should no longer be eligible for a later "full L3 cache" allocation for another workload.
The inverse should also be true:

  • once a full L3 cache domain is allocated, later per-CPU allocations from that domain must be blocked for other workloads
    This is needed to avoid placing unrelated applications on CPUs that would still share the same L3 cache and impact latency / determinism.

Why the new shared consumable capacity KEP looks relevant

KEP-5075: DRA Consumable Capacity looks like a strong planning reference for this feature.
It seems like a good fit for modeling each L3 cache domain as a DRA allocation domain with shared capacity:

  • a full L3 request can consume the entire capacity of that cache domain
  • an individual CPU allocation can consume part of the same domain capacity
  • once part of the domain is consumed, a later request for the full domain becomes unschedulable
  • once the full domain is consumed, later per-CPU allocations from that domain are also blocked
    That is very close to the behavior we want.

Acceptance criteria

  • a workload can request a full L3 cache domain as a resource
  • the allocation result exposes the selected L3 cache domain and the CPUs that belong to it
  • all CPUs in that L3 cache domain are treated as isolated for that workload
  • if any CPU in an L3 domain is already allocated, that domain is excluded from future full-domain claims
  • if a full L3 domain is allocated, future per-CPU allocations from that domain are excluded
  • the behavior is clearly defined across NUMA nodes, sockets, and SMT topologies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions