Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
FROM node:12-alpine
FROM node:20-alpine

#ENV PATH $PATH:/node_modules/.bin

COPY . /hyperflow
RUN npm install -g /hyperflow
WORKDIR /hyperflow
RUN npm install
RUN npm install -g .
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ all: push
container: image

image:
docker build -t $(PREFIX)/$(REPO_NAME) . # Build new image and automatically tag it as latest
docker build --no-cache -t $(PREFIX)/$(REPO_NAME) . # Build new image and automatically tag it as latest
docker tag $(PREFIX)/$(REPO_NAME) $(PREFIX)/$(REPO_NAME):$(TAG) # Add the version tag to the latest image

push: image
Expand Down
145 changes: 145 additions & 0 deletions functions/kubernetes/ADMISSION_CONTROLLER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Kubernetes Admission Controller

## Overview

The K8s Admission Controller prevents overwhelming the Kubernetes scheduler when submitting large numbers of workflow tasks as Pods.

### Problem

When thousands of tasks are ready to execute, creating thousands of Pods immediately can overwhelm the Kubernetes scheduler, resulting in many Pods stuck in Pending state and API throttling (HTTP 429 errors).

### Solution

The admission controller implements a local gate using:

1. **Watch-based state tracking** - Monitors Pod state via long-lived watch (no polling)
2. **Token bucket rate limiter** - Smoothly paces Pod creation requests
3. **Concurrency window** - Caps the number of pending Pods
4. **Adaptive tuning** - Automatically adjusts limits based on observed cluster throughput

## How It Works

### Two-Gate Model

Every task submission must pass through two gates:

**Gate 1: Concurrency Window**
```
IF pendingCount >= pendingMax THEN wait
```
- Prevents too many Pods stuck in Pending state
- `pendingCount` tracked in real-time via Kubernetes Watch API
- `pendingMax` is the maximum allowed pending Pods

**Gate 2: Token Bucket**
```
IF tokens < 1 THEN wait
tokens = tokens - 1
```
- Smooths out bursts and paces submissions over time
- Tokens refill continuously at rate `fillRate` (tokens/second)
- Maximum token accumulation capped at `burst`

### Token Bucket Refill

```
elapsed = currentTime - lastRefillTime
newTokens = elapsed * fillRate
tokens = min(burst, tokens + newTokens)
```

**Parameters:**
- `fillRate` - Tokens added per second (controls submission rate)
- `burst` - Maximum token accumulation (handles bursts)
- `tokens` - Current available tokens

### Adaptive Tuning

Every second, the controller measures cluster performance and adjusts parameters:

**Throughput Measurement:**
```
currentRate = Pending→Running transitions / elapsed time (pods/sec)
runningRateEWMA = α × currentRate + (1 - α) × runningRateEWMA
```

**Fill Rate Adaptation:**
```
targetRate = max(1.0, 1.2 × runningRateEWMA)
errorPenalty = max(0.5, 1 - 2 × createErrorEWMA)
fillRate = max(1.0, min(configuredRate × 2, targetRate × errorPenalty))
```

**Pending Max Adaptation (with Hysteresis):**
```
minuteBuffer = round(60 × max(0.5, runningRateEWMA))
targetPendingMax = clamp(minuteBuffer, minPendingMax, maxPendingMax)

IF targetPendingMax > pendingMax THEN
pendingMax = targetPendingMax // Increase immediately (aggressive)
ELSE IF targetPendingMax < pendingMax AND timeSinceLastDecrease > 2 minutes THEN
pendingMax = targetPendingMax // Decrease only if sustained (conservative)
lastPendingMaxDecrease = now
ELSE
// Keep current pendingMax (ignore temporary dip)
END
```

**Parameters:**
- `α` (alpha) = 0.2 - EWMA smoothing factor (higher = more reactive)
- `runningRateEWMA` - Exponentially weighted moving average of scheduling rate
- `createErrorEWMA` - EWMA of API error rate (0-1)
- `minPendingMax` = 50 - Minimum allowed pending limit
- `maxPendingMax` = 2000 - Maximum allowed pending limit
- `hysteresisWindow` = 2 minutes - Minimum time before decreasing pendingMax

**Adaptive behavior:**
- **High throughput observed** → increase `fillRate` to submit more
- **API errors detected** → reduce `fillRate` via error penalty
- **Fast scheduling** → increase `pendingMax` immediately (aggressive)
- **Temporary slowdown** → keep `pendingMax` high (prevent under-utilization)
- **Sustained slowdown (>2 min)** → decrease `pendingMax` conservatively

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `HF_VAR_ADMISSION_CONTROLLER` | `1` | Set to `0` to disable |
| `HF_VAR_ADMISSION_PENDING_MAX` | `200` | Max pending Pods |
| `HF_VAR_ADMISSION_FILL_RATE` | `1` | Token fill rate (tokens/sec) |
| `HF_VAR_ADMISSION_BURST` | `20` | Max token bucket size |
| `HF_VAR_ADMISSION_ADAPTIVE` | `1` | Set to `0` to disable adaptive tuning |
| `HF_VAR_ADMISSION_DEBUG` | `0` | Set to `1` to enable debug logging |

## Usage

The admission controller is **automatically enabled** when using the standard HyperFlow k8s executor. No code changes required.

To disable:
```bash
export HF_VAR_ADMISSION_CONTROLLER=0
```

## Tuning Examples

### Conservative (Small Clusters)
```bash
export HF_VAR_ADMISSION_PENDING_MAX=50
export HF_VAR_ADMISSION_FILL_RATE=1
export HF_VAR_ADMISSION_BURST=10
```

### Aggressive (Large Clusters)
```bash
export HF_VAR_ADMISSION_PENDING_MAX=500
export HF_VAR_ADMISSION_FILL_RATE=10
export HF_VAR_ADMISSION_BURST=50
```

### Adaptive (Recommended)
```bash
export HF_VAR_ADMISSION_ADAPTIVE=1
# Controller automatically adjusts based on observed cluster performance
```
Loading