feat: /benchmark — catch performance regressions before users feel them by HMAKT99 · Pull Request #153 · garrytan/gstack

HMAKT99 · 2026-03-18T04:58:41Z

Performance dies by a thousand paper cuts

No single PR makes the app slow. But every PR adds 50ms here, 20KB there. In 6 weeks the landing page takes 4 seconds to load and nobody can point to when it happened. The boiling frog problem.

/benchmark gives you the thermometer. Capture baselines, measure after every PR, catch regressions before they compound. Uses the $B perf command that's been in gstack since day one — but nobody's been using systematically.

What it looks like

You:   /benchmark https://myapp.com --baseline
Claude: Baseline captured for 5 pages. Deploy your changes, then run /benchmark.

You:   [make changes, deploy]

You:   /benchmark https://myapp.com
Claude: PERFORMANCE REPORT — myapp.com
        Branch: feature-xyz vs baseline (main)
        ═══════════════════════════════════════════════════
        Metric          Baseline    Current     Delta    Status
        ─────           ────────    ───────     ─────    ──────
        TTFB            120ms       135ms       +15ms    OK
        FCP             450ms       480ms       +30ms    OK
        LCP             800ms       1600ms      +800ms   REGRESSION ←
        DOM Complete    1200ms      1350ms      +150ms   WARNING
        JS Bundle       450KB       720KB       +270KB   REGRESSION ←
        CSS Bundle      85KB        88KB        +3KB     OK
        Requests        42          58          +16      WARNING

        REGRESSIONS DETECTED: 2
        [1] LCP doubled (800ms → 1600ms)
            Likely cause: new hero image loaded synchronously
        [2] JS bundle +60% (450KB → 720KB)
            Likely cause: new dependency or missing tree-shaking

        TOP 5 SLOWEST RESOURCES:
        #  Resource              Size     Duration
        1  vendor.chunk.js      320KB    480ms    ← consider code-splitting
        2  main.js              250KB    320ms
        3  hero-image.webp      180KB    280ms    ← add width/height, lazy load
        4  analytics.js          45KB    250ms    ← load async
        5  inter-var.woff2       95KB    180ms

        PERFORMANCE BUDGET:
        FCP < 1.8s    PASS    (0.48s)
        LCP < 2.5s    PASS    (1.6s)
        JS  < 500KB   FAIL    (720KB)  ←
        CSS < 100KB   PASS    (88KB)
        Grade: B (was A)

What it measures

Real data from performance.getEntries() — not estimates, not Lighthouse scores, actual browser timing:

Metric	Source	Regression Threshold
TTFB	Navigation Timing API	>50% or >500ms increase
FCP	Paint Timing API	>50% or >500ms increase
LCP	Largest Contentful Paint	>50% or >500ms increase
DOM Complete	Navigation Timing API	>50% or >500ms increase
JS bundle size	Resource Timing API	>25% increase
CSS bundle size	Resource Timing API	>25% increase
Request count	Resource Timing API	>30% increase

The `$B perf` command finally gets a home

gstack has had $B perf since v0.1. It returns page load performance data. But no skill uses it systematically:

/qa checks for visual bugs, not performance
/review checks code quality, not runtime speed
/ship runs tests, not benchmarks

/benchmark is the skill that $B perf was waiting for.

Features

Baselines: Capture before deploying, compare after. Baselines are JSON — diffable, trackable, reviewable.
Trend analysis: /benchmark --trend shows performance over time from historical data. Spot the week things started getting slow.
Resource waterfall: Top 10 slowest resources with specific fix recommendations (code-split, lazy-load, async, compress).
Performance budget: Grade against industry standards (FCP < 1.8s, LCP < 2.5s, JS < 500KB).
Diff-aware: /benchmark --diff only benchmarks pages affected by current branch changes.

Arguments

/benchmark <url>              — full benchmark with baseline comparison
/benchmark <url> --baseline   — capture baseline (run before changes)
/benchmark <url> --quick      — single-pass timing (no baseline needed)
/benchmark <url> --pages ...  — specific pages
/benchmark --diff             — benchmark pages affected by current branch
/benchmark --trend            — show historical performance trends

This slots into the engineering workflow

/plan-eng-review    → what to build
/review             → is the code correct?
/benchmark          → is it fast?        ← NEW
/a11y               → is it accessible?  ← NEW
/ship               → push it
/canary             → did it break?      ← NEW
/qa                 → full QA pass

Test plan

bun test — all tests pass, 0 failures
bun run gen:skill-docs --dry-run — FRESH
Uses {{PREAMBLE}} + {{BROWSE_SETUP}} — follows template pipeline
Reports saved to .gstack/benchmark-reports/

…wse daemon Catches the death-by-a-thousand-cuts performance decay: - Before/after comparison using browse daemon's perf command - Core Web Vitals tracking (TTFB, FCP, LCP, DOM Complete) - JS/CSS bundle size monitoring with regression thresholds - Resource waterfall analysis with optimization recommendations - Performance budget checking against industry standards - Trend analysis from historical benchmark data - Diff-aware mode: only benchmark pages affected by current branch

Per maintainer feedback: generated .md files should not be committed. Only the .tmpl template is source of truth. Build generates the .md.

HMAKT99 changed the title ~~feat: add /benchmark — performance regression detection via browse daemon~~ feat: /benchmark — catch performance regressions before users feel them Mar 18, 2026

HMAKT99 added 2 commits March 18, 2026 10:44

fix: remove generated SKILL.md — only commit .tmpl template

5c6b66f

Per maintainer feedback: generated .md files should not be committed. Only the .tmpl template is source of truth. Build generates the .md.

fix: remove leaked cross-branch entries from test arrays

7eea4da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: /benchmark — catch performance regressions before users feel them#153

feat: /benchmark — catch performance regressions before users feel them#153
HMAKT99 wants to merge 3 commits intogarrytan:mainfrom
HMAKT99:arun/benchmark-skill

HMAKT99 commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HMAKT99 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance dies by a thousand paper cuts

What it looks like

What it measures

The $B perf command finally gets a home

Features

Arguments

This slots into the engineering workflow

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HMAKT99 commented Mar 18, 2026 •

edited

Loading

The `$B perf` command finally gets a home