-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Performance Issue Description
The Performance: zsh rprompt GitHub Actions job is flaky because it enforces a hard average-time threshold of 60ms, and recent runs sometimes land slightly above it on hosted runners.
This makes CI fail even when the benchmark is only marginally slower and there is no clear functional regression. Recent failures show averages like 64ms while other nearby runs on the same workflow passed, which strongly suggests runner variance rather than a deterministic regression.
Issue Type
Tool execution time
Steps to Reproduce
- Open the CI workflow that includes the
Performance: zsh rpromptjob. - Run the workflow on GitHub-hosted
ubuntu-latestrunners. - Observe that the benchmark executes
./scripts/benchmark.sh --threshold 60 zsh rprompt. - Compare multiple runs of the same job.
- Some runs pass while others fail because the average runtime drifts slightly above 60ms.
Example failing runs:
- https://github.com/antinomyhq/forge/actions/runs/23092691613/job/67080023936
- https://github.com/antinomyhq/forge/actions/runs/23100069509/job/67099178762
Example nearby passing runs:
- https://github.com/antinomyhq/forge/actions/runs/23127232999/job/67172783383
- https://github.com/antinomyhq/forge/actions/runs/23127275990/job/67172898894
Expected Performance
The benchmark should be stable enough that normal variance on GitHub-hosted runners does not fail CI. Small fluctuations around the threshold should not block unrelated changes.
Actual Performance
The benchmark occasionally reports an average runtime above 60ms and exits with code 1, causing the workflow to fail.
Performance Measurements
Recent failing measurement from CI:
- Threshold: 60ms
- Average: 64ms
- Min: 58ms
- Max: 80ms
Example output:
📊 Summary
avg 64 ms
min 58 ms
max 80 ms
❌ Performance regression detected!
Average time 64ms exceeds threshold 60msProject Size
Full Forge workspace running in the main CI workflow on GitHub-hosted runners.
Environment Details
- Forge: built from the current repository revision in CI
- OS: Ubuntu 24.04 GitHub-hosted runner (
ubuntu-latest) - CPU: GitHub-hosted runner VM (shared/variable performance)
- RAM: GitHub-hosted runner VM
- Provider: N/A
- Model: N/A
- Shell:
/usr/bin/bash
Configuration
zsh_rprompt_perf:
name: 'Performance: zsh rprompt'
runs-on: ubuntu-latest
steps:
- name: Run performance benchmark
run: './scripts/benchmark.sh --threshold 60 zsh rprompt'The benchmark script currently fails the job whenever the computed average exceeds the threshold:
threshold: 60ms average
iterations: 10
failure_condition: avg > thresholdProfiling Data
🚀 Performance Test — forge zsh rprompt
📦 Building...
Finished `dev` profile [unoptimized] target(s) in 32.79s
📋 Sample output:
%B%F{240} FORGE%f%b
⏱️ Running 10 iterations...
1 59 ms
2 67 ms
3 80 ms
4 58 ms
5 59 ms
6 63 ms
7 65 ms
8 59 ms
9 65 ms
10 74 ms
📊 Summary
avg 64 ms
min 58 ms
max 80 ms
❌ Performance regression detected!
Average time 64ms exceeds threshold 60msFrequency
Sometimes (25-50% of the time)
Workarounds
Re-run the workflow. The same job has passed on nearby runs without code changes to the benchmark logic.
Additional Context
The current implementation is very strict for a shared CI environment:
- CI config hardcodes
--threshold 60for thezsh rpromptbenchmark. - The benchmark script fails immediately when
avg > threshold. - Recent history shows both failing and passing runs for the same job, which makes the signal noisy.
This should likely be stabilized by relaxing the threshold, reducing sensitivity to outliers, warming up before measurement, or comparing against a more robust metric than a single average on shared runners.