[Performance]: Stabilize flaky zsh rprompt CI benchmark

## Performance Issue Description

The `Performance: zsh rprompt` GitHub Actions job is flaky because it enforces a hard average-time threshold of 60ms, and recent runs sometimes land slightly above it on hosted runners.

This makes CI fail even when the benchmark is only marginally slower and there is no clear functional regression. Recent failures show averages like 64ms while other nearby runs on the same workflow passed, which strongly suggests runner variance rather than a deterministic regression.

## Issue Type

Tool execution time

## Steps to Reproduce

1. Open the CI workflow that includes the `Performance: zsh rprompt` job.
2. Run the workflow on GitHub-hosted `ubuntu-latest` runners.
3. Observe that the benchmark executes `./scripts/benchmark.sh --threshold 60 zsh rprompt`.
4. Compare multiple runs of the same job.
5. Some runs pass while others fail because the average runtime drifts slightly above 60ms.

Example failing runs:
- https://github.com/antinomyhq/forge/actions/runs/23092691613/job/67080023936
- https://github.com/antinomyhq/forge/actions/runs/23100069509/job/67099178762

Example nearby passing runs:
- https://github.com/antinomyhq/forge/actions/runs/23127232999/job/67172783383
- https://github.com/antinomyhq/forge/actions/runs/23127275990/job/67172898894

## Expected Performance

The benchmark should be stable enough that normal variance on GitHub-hosted runners does not fail CI. Small fluctuations around the threshold should not block unrelated changes.

## Actual Performance

The benchmark occasionally reports an average runtime above 60ms and exits with code 1, causing the workflow to fail.

## Performance Measurements

Recent failing measurement from CI:

- Threshold: 60ms
- Average: 64ms
- Min: 58ms
- Max: 80ms

Example output:

```markdown
📊 Summary

  avg     64 ms
  min     58 ms
  max     80 ms

❌ Performance regression detected!
   Average time 64ms exceeds threshold 60ms
```

## Project Size

Full Forge workspace running in the main CI workflow on GitHub-hosted runners.

## Environment Details

- Forge: built from the current repository revision in CI
- OS: Ubuntu 24.04 GitHub-hosted runner (`ubuntu-latest`)
- CPU: GitHub-hosted runner VM (shared/variable performance)
- RAM: GitHub-hosted runner VM
- Provider: N/A
- Model: N/A
- Shell: `/usr/bin/bash`

## Configuration

```yaml
zsh_rprompt_perf:
  name: 'Performance: zsh rprompt'
  runs-on: ubuntu-latest
  steps:
    - name: Run performance benchmark
      run: './scripts/benchmark.sh --threshold 60 zsh rprompt'
```

The benchmark script currently fails the job whenever the computed average exceeds the threshold:

```yaml
threshold: 60ms average
iterations: 10
failure_condition: avg > threshold
```

## Profiling Data

```shell
🚀 Performance Test — forge zsh rprompt
📦 Building...
Finished `dev` profile [unoptimized] target(s) in 32.79s
📋 Sample output:
 %B%F{240}󱙺 FORGE%f%b
⏱️  Running 10 iterations...
   1     59 ms
   2     67 ms
   3     80 ms
   4     58 ms
   5     59 ms
   6     63 ms
   7     65 ms
   8     59 ms
   9     65 ms
  10     74 ms
📊 Summary
  avg     64 ms
  min     58 ms
  max     80 ms
❌ Performance regression detected!
   Average time 64ms exceeds threshold 60ms
```

## Frequency

Sometimes (25-50% of the time)

## Workarounds

Re-run the workflow. The same job has passed on nearby runs without code changes to the benchmark logic.

## Additional Context

The current implementation is very strict for a shared CI environment:

- CI config hardcodes `--threshold 60` for the `zsh rprompt` benchmark.
- The benchmark script fails immediately when `avg > threshold`.
- Recent history shows both failing and passing runs for the same job, which makes the signal noisy.

This should likely be stabilized by relaxing the threshold, reducing sensitivity to outliers, warming up before measurement, or comparing against a more robust metric than a single average on shared runners.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Stabilize flaky zsh rprompt CI benchmark #2574

Performance Issue Description

Issue Type

Steps to Reproduce

Expected Performance

Actual Performance

Performance Measurements

Project Size

Environment Details

Configuration

Profiling Data

Frequency

Workarounds

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Stabilize flaky zsh rprompt CI benchmark #2574

Description

Performance Issue Description

Issue Type

Steps to Reproduce

Expected Performance

Actual Performance

Performance Measurements

Project Size

Environment Details

Configuration

Profiling Data

Frequency

Workarounds

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions