-
Notifications
You must be signed in to change notification settings - Fork 6
Description
🚀 Feature Description and Motivation
Feature Description and Motivation
I'm proposing dynamic weight calculation for HTTPRoute backendRefs based on Application ReadyReplicas because the current static weight configuration causes unfair traffic distribution in multi-tenant environments.
Current Problem
In the current Arks operator implementation (internal/controller/arksendpoint_controller.go), HTTPRoute backend weights are statically set to DefaultWeight (typically 1) for all applications, regardless of their replica counts. When multiple applications with different replica counts share the same HTTPRoute endpoint, this causes resource allocation inefficiency.
Concrete Example:
Consider three applications sharing the same endpoint:
- Application A: 10 replicas, weight=1 → receives 33.3% traffic (should be 62.5%)
- Application B: 5 replicas, weight=1 → receives 33.3% traffic (should be 31.25%)
- Application C: 1 replica, weight=1 → receives 33.3% traffic (should be 6.25%)
Result: Application C (1 replica) is overloaded with 5x expected traffic, while Application A (10 replicas) is underutilized with 67% capacity wasted.
Root Cause
The root cause is a mismatch between Gateway-layer traffic distribution (based on static weights) and Application capacity (based on dynamic replica counts).
Proposed Feature
Implement dynamic weight calculation: weight = Application.Status.ReadyReplicas
This would help solve the multi-tenant load balancing problem and benefit Arks users by:
- Fair Traffic Distribution: Traffic distributed proportionally to each application's capacity
- Automatic Scaling Support: Weight automatically adjusts during HPA scaling operations
- Progressive Deployment: Allow partial Ready state (e.g., 7/10 Ready) instead of requiring 100% Ready before adding to HTTPRoute
- Better Resource Utilization: High-capacity applications receive proportionally more traffic, preventing overload of small applications
Use Case
Use Case
In my enterprise multi-tenant deployment, I often need to share a single LLM model endpoint across multiple tenants with different QPS requirements and replica counts.
Specific Scenario:
We have a production Kubernetes cluster running Arks with multiple tenants:
- Tenant A (high-traffic customer): Deploys ArksApplication with 10 replicas for high QPS
- Tenant B (medium-traffic customer): Deploys ArksApplication with 5 replicas
- Tenant C (low-traffic customer): Deploys ArksApplication with 1 replica
All three tenants share the same ArksEndpoint for cost efficiency and resource sharing.
Current Problem:
With static weights (all set to 1), the Gateway distributes traffic equally (33.3% each), causing:
- Tenant C's single replica is overwhelmed with excessive traffic → high latency, potential service degradation
- Tenant A's 10 replicas are underutilized → wasted infrastructure costs
- Poor multi-tenant experience and unfair resource allocation
During Scaling Operations:
When Tenant A scales from 3 to 10 replicas (HPA-triggered), the current implementation requires all replicas Ready before adding to HTTPRoute. This causes traffic interruption during normal scaling operations.
Expected Outcome with Dynamic Weight:
This feature would allow me to:
- Achieve fair traffic distribution: Each tenant receives traffic proportional to their capacity (Tenant A: 62.5%, Tenant B: 31.25%, Tenant C: 6.25%)
- Support progressive scaling: During scaling, Tenant A receives traffic as soon as pods become Ready (e.g., 4/10 Ready → weight=4), preventing traffic loss
- Eliminate manual weight configuration: Weights automatically adjust as replicas scale up/down
- Improve resource efficiency: Better utilization across all tenants, reducing infrastructure waste
This aligns with Arks' positioning as an enterprise multi-tenant LLM inference platform and addresses a core requirement for fair resource sharing in shared infrastructure environments.
Proposed Solution
Proposed Solution
One possible approach could be to modify the ArksEndpoint controller to calculate HTTPRoute backend weights dynamically based on Application ReadyReplicas. This would integrate with Arks's existing reconciliation logic by updating the weight calculation during HTTPRoute backendRef creation.
Implementation Approach
Modified Component: internal/controller/arksendpoint_controller.go (ArksEndpoint reconciliation logic)
Change 1: ArksApplication (Monolithic Architecture)
Current Implementation (Lines 292-322):
// Require all replicas Ready
if app.Spec.Replicas != int(app.Status.ReadyReplicas) {
klog.V(4).InfoS("application status not ready", "service", svcName)
continue // Skip if not fully Ready
}
backendRef := gatewayv1.HTTPBackendRef{
Weight: &ep.Spec.DefaultWeight, // Static weight
}Proposed Implementation:
// Calculate dynamic weight based on ReadyReplicas
weight := int32(app.Status.ReadyReplicas)
if weight == 0 {
klog.V(4).InfoS("application has no ready replicas, skipping",
"service", svcName,
"replicas", app.Spec.Replicas,
"readyReplicas", app.Status.ReadyReplicas)
continue // Skip only if zero Ready
}
backendRef := gatewayv1.HTTPBackendRef{
Weight: &weight, // Dynamic weight
}
klog.InfoS("application added to http route with dynamic weight",
"service", svcName,
"weight", weight,
"readyReplicas", app.Status.ReadyReplicas,
"replicas", app.Spec.Replicas)Change 2: ArksDisaggregatedApplication (Prefill/Decode Separation)
Current Implementation (Lines 326-365):
// Require Prefill and Decode fully Ready
if app.Status.Router.ReadyReplicas > 0 &&
app.Status.Prefill.ReadyReplicas == app.Status.Prefill.Replicas &&
app.Status.Decode.ReadyReplicas == app.Status.Decode.Replicas {
// OK
} else {
continue
}
backendRef := gatewayv1.HTTPBackendRef{
Weight: &ep.Spec.DefaultWeight,
}Proposed Implementation:
// Use Router.ReadyReplicas as weight
weight := int32(app.Status.Router.ReadyReplicas)
if weight == 0 {
klog.InfoS("disaggregated application has no ready router pods, skipping",
"service", svcName,
"routerReplicas", app.Status.Router.Replicas,
"routerReadyReplicas", app.Status.Router.ReadyReplicas)
continue
}
// Verify Prefill and Decode have at least some ready replicas
if app.Status.Prefill.ReadyReplicas == 0 || app.Status.Decode.ReadyReplicas == 0 {
klog.InfoS("disaggregated application prefill/decode not ready, skipping",
"service", svcName,
"prefillReadyReplicas", app.Status.Prefill.ReadyReplicas,
"decodeReadyReplicas", app.Status.Decode.ReadyReplicas)
continue
}
backendRef := gatewayv1.HTTPBackendRef{
Weight: &weight, // Dynamic weight = Router.ReadyReplicas
}Rationale: In disaggregated architecture, Router pods serve as the entry point and concurrency bottleneck. The weight should reflect Router capacity for Gateway-layer load balancing.
Integration with Existing Components
This approach integrates with Arks's existing cloud-native architecture by:
- ArksEndpoint CRD: No schema changes required, uses existing
Status.ReadyReplicasfield - Gateway API: Leverages standard HTTPRoute
weightfield (fully compatible with Kubernetes Gateway API spec) - Reconciliation Loop: Automatically updates weights during normal ArksEndpoint reconciliation cycles
- Envoy Gateway: No changes needed, Envoy respects updated HTTPRoute weights immediately
Compatibility
Backward Compatible:
- No API/CRD breaking changes
- No migration required
- Existing deployments automatically benefit after operator upgrade
- Safe rollback using
kubectl rollout undo
Validation
This approach has been validated in a production-like multi-tenant environment:
- Test 1: Multi-tenant load balancing with 3 applications (1, 10, 5 replicas) - PASS
- Test 2: Progressive deployment during scaling - PASS (weight tracks ReadyReplicas in real-time)
- Test 3: Edge cases (zero replicas, recovery) - PASS
- Test 4: Performance and stability - PASS (0 restarts, 3s reconcile time)