Skip to content

[Performance]: Stabilize flaky zsh rprompt CI benchmark #2574

@tusharmath

Description

@tusharmath

Performance Issue Description

The Performance: zsh rprompt GitHub Actions job is flaky because it enforces a hard average-time threshold of 60ms, and recent runs sometimes land slightly above it on hosted runners.

This makes CI fail even when the benchmark is only marginally slower and there is no clear functional regression. Recent failures show averages like 64ms while other nearby runs on the same workflow passed, which strongly suggests runner variance rather than a deterministic regression.

Issue Type

Tool execution time

Steps to Reproduce

  1. Open the CI workflow that includes the Performance: zsh rprompt job.
  2. Run the workflow on GitHub-hosted ubuntu-latest runners.
  3. Observe that the benchmark executes ./scripts/benchmark.sh --threshold 60 zsh rprompt.
  4. Compare multiple runs of the same job.
  5. Some runs pass while others fail because the average runtime drifts slightly above 60ms.

Example failing runs:

Example nearby passing runs:

Expected Performance

The benchmark should be stable enough that normal variance on GitHub-hosted runners does not fail CI. Small fluctuations around the threshold should not block unrelated changes.

Actual Performance

The benchmark occasionally reports an average runtime above 60ms and exits with code 1, causing the workflow to fail.

Performance Measurements

Recent failing measurement from CI:

  • Threshold: 60ms
  • Average: 64ms
  • Min: 58ms
  • Max: 80ms

Example output:

📊 Summary

  avg     64 ms
  min     58 ms
  max     80 ms

❌ Performance regression detected!
   Average time 64ms exceeds threshold 60ms

Project Size

Full Forge workspace running in the main CI workflow on GitHub-hosted runners.

Environment Details

  • Forge: built from the current repository revision in CI
  • OS: Ubuntu 24.04 GitHub-hosted runner (ubuntu-latest)
  • CPU: GitHub-hosted runner VM (shared/variable performance)
  • RAM: GitHub-hosted runner VM
  • Provider: N/A
  • Model: N/A
  • Shell: /usr/bin/bash

Configuration

zsh_rprompt_perf:
  name: 'Performance: zsh rprompt'
  runs-on: ubuntu-latest
  steps:
    - name: Run performance benchmark
      run: './scripts/benchmark.sh --threshold 60 zsh rprompt'

The benchmark script currently fails the job whenever the computed average exceeds the threshold:

threshold: 60ms average
iterations: 10
failure_condition: avg > threshold

Profiling Data

🚀 Performance Test — forge zsh rprompt
📦 Building...
Finished `dev` profile [unoptimized] target(s) in 32.79s
📋 Sample output:
 %B%F{240}󱙺 FORGE%f%b
⏱️  Running 10 iterations...
   1     59 ms
   2     67 ms
   3     80 ms
   4     58 ms
   5     59 ms
   6     63 ms
   7     65 ms
   8     59 ms
   9     65 ms
  10     74 ms
📊 Summary
  avg     64 ms
  min     58 ms
  max     80 ms
❌ Performance regression detected!
   Average time 64ms exceeds threshold 60ms

Frequency

Sometimes (25-50% of the time)

Workarounds

Re-run the workflow. The same job has passed on nearby runs without code changes to the benchmark logic.

Additional Context

The current implementation is very strict for a shared CI environment:

  • CI config hardcodes --threshold 60 for the zsh rprompt benchmark.
  • The benchmark script fails immediately when avg > threshold.
  • Recent history shows both failing and passing runs for the same job, which makes the signal noisy.

This should likely be stabilized by relaxing the threshold, reducing sensitivity to outliers, warming up before measurement, or comparing against a more robust metric than a single average on shared runners.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions