Conversation
📝 WalkthroughWalkthroughThis PR introduces Quipbench, a standalone benchmarking system with live self-play rounds, Elo-based ratings, SQLite persistence, configurable concurrency, and a web dashboard. Includes database layer, orchestration, export functionality, CLI tools, and integration tests. Changes
Sequence DiagramsequenceDiagram
participant User
participant RunOrch as run.ts<br/>(Orchestrator)
participant Worker as Concurrent Workers
participant AI as AI Integration
participant DB as SQLite DB
participant Export as export.ts
participant Dashboard as Dashboard UI
User->>RunOrch: runQuipbench(options)
RunOrch->>DB: openBenchDb, insertRunStart
RunOrch->>RunOrch: Initialize ratings, create workers
loop Concurrent Rounds
RunOrch->>Worker: Schedule round
Worker->>AI: generatePrompt, generateAnswer
AI-->>Worker: responses
Worker->>AI: vote (all participants)
AI-->>Worker: votes
Worker->>Worker: Aggregate winner, compute Elo
Worker->>DB: insertMatch (serialized write)
end
RunOrch->>DB: updateRunProgress, replaceRatings
RunOrch->>DB: finalizeRun
RunOrch->>Export: exportLatestSnapshot
Export->>DB: getRunRow, getRatingsForRun
Export->>Export: Build snapshot + leaderboard
Export->>Export: Write latest.json, latest.js
Export-->>RunOrch: snapshot + paths
User->>Dashboard: Open dashboard (CLI or browser)
Dashboard->>Dashboard: Fetch latest.js
Dashboard->>Dashboard: Render leaderboard chart + table
Dashboard-->>User: Display results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add QuipBench by implementing
|
| @@ -0,0 +1,277 @@ | |||
| const root = document.getElementById("app"); | |||
There was a problem hiding this comment.
🟢 Low dashboard/app.js:1
Consider adding a null check for root before using innerHTML, or document that the HTML file is expected to always contain an element with id="app".
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench/dashboard/app.js around line 1:
Consider adding a null check for `root` before using `innerHTML`, or document that the HTML file is expected to always contain an element with `id="app"`.
Evidence trail:
bench/dashboard/app.js line 1: `const root = document.getElementById("app");` - bench/dashboard/app.js lines 168, 190: `root.innerHTML = ...` used without null check - Verified at commit 630bd0c62165216ec01b4c5ed48a098eea0db153
| } | ||
|
|
||
| function render(snapshotData) { | ||
| const meta = snapshotData.runMeta; |
There was a problem hiding this comment.
🟡 Medium dashboard/app.js:186
The guard at line 272 only checks snapshot.leaderboard, but render accesses snapshotData.runMeta properties. Consider adding snapshot.runMeta to the guard to prevent a TypeError if it's missing.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench/dashboard/app.js around line 186:
The guard at line 272 only checks `snapshot.leaderboard`, but `render` accesses `snapshotData.runMeta` properties. Consider adding `snapshot.runMeta` to the guard to prevent a TypeError if it's missing.
Evidence trail:
bench/dashboard/app.js:272-276 shows guard `if (!snapshot || !snapshot.leaderboard)` without checking `runMeta`. bench/dashboard/app.js:186 shows `const meta = snapshotData.runMeta;` and lines 212-234 access `meta.runId`, `meta.roundsCompleted`, `meta.roundsRequested`, `meta.failures`, `meta.startedAt`, `meta.endedAt`, `meta.seed`. Commit 630bd0c62165216ec01b4c5ed48a098eea0db153.
| import { spawnSync } from "node:child_process"; | ||
| import { resolve } from "node:path"; | ||
|
|
||
| const dashboardPath = resolve(import.meta.dir, "dashboard", "index.html"); |
There was a problem hiding this comment.
🟢 Low bench/open.ts:4
import.meta.dir is Bun-specific and will be undefined in Node.js. If Node.js support is needed, consider using import.meta.dirname (Node 20.11+) or dirname(fileURLToPath(import.meta.url)). If this is intentionally Bun-only, consider adding a comment to clarify.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench/open.ts around line 4:
`import.meta.dir` is Bun-specific and will be `undefined` in Node.js. If Node.js support is needed, consider using `import.meta.dirname` (Node 20.11+) or `dirname(fileURLToPath(import.meta.url))`. If this is intentionally Bun-only, consider adding a comment to clarify.
Evidence trail:
bench/open.ts line 4 (commit 630bd0c62165216ec01b4c5ed48a098eea0db153): `const dashboardPath = resolve(import.meta.dir, "dashboard", "index.html");` - Bun documentation confirms `import.meta.dir` is a Bun-specific API: https://bun.sh/docs/api/import-meta. Node.js does not provide this property.
| }); | ||
| } | ||
| }), | ||
| ); |
There was a problem hiding this comment.
🟡 Medium bench/run.ts:265
The votes array order is non-deterministic because Promise.all callbacks resolve in unpredictable order. Consider sorting votes by voterId after the Promise.all completes, or collecting results with indices to preserve deterministic ordering.
}),
+ ).then(() => votes.sort((a, b) => a.voterId.localeCompare(b.voterId)));
- );🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file bench/run.ts around line 265:
The `votes` array order is non-deterministic because `Promise.all` callbacks resolve in unpredictable order. Consider sorting `votes` by `voterId` after the `Promise.all` completes, or collecting results with indices to preserve deterministic ordering.
Evidence trail:
bench/run.ts lines 227-264 at commit 630bd0c62165216ec01b4c5ed48a098eea0db153: Line 227 shows `const votes: VoteRecord[] = [];`, lines 231-264 show `await Promise.all(voters.map(async (voter) => { ... votes.push({...}); ... }))`. The shared array is mutated inside async callbacks that can complete in any order.
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (7)
bench/open.ts (1)
1-19: ConsiderBun.$for shell invocation to stay idiomatic with the Bun runtime.
node:child_process.spawnSyncworks fine here sincedashboardPathis a static resolved path with no injection risk, but the project's Bun coding conventions favourBun.$for shell commands. Top-levelawaitis already available in ESM.♻️ Bun.$-based alternative
-import { spawnSync } from "node:child_process"; import { resolve } from "node:path"; const dashboardPath = resolve(import.meta.dir, "dashboard", "index.html"); -function openPath(target: string) { - if (process.platform === "darwin") { - return spawnSync("open", [target], { stdio: "inherit" }); - } - - if (process.platform === "win32") { - return spawnSync("cmd", ["/c", "start", "", target], { - stdio: "inherit", - shell: false, - }); - } - - return spawnSync("xdg-open", [target], { stdio: "inherit" }); -} - -const result = openPath(dashboardPath); -if (result.error) { - console.error(`Could not open dashboard: ${result.error.message}`); - process.exit(1); -} -if (typeof result.status === "number" && result.status !== 0) { - process.exit(result.status); -} +try { + if (process.platform === "darwin") { + await Bun.$`open ${dashboardPath}`; + } else if (process.platform === "win32") { + await Bun.$`start "" ${dashboardPath}`; + } else { + await Bun.$`xdg-open ${dashboardPath}`; + } +} catch (err) { + console.error(`Could not open dashboard: ${err instanceof Error ? err.message : err}`); + process.exit(1); +} console.log(`Opened Quipbench dashboard: ${dashboardPath}`);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/open.ts` around lines 1 - 19, Replace the spawnSync-based shell invocations in openPath (and remove unused spawnSync import) with Bun.$ calls to follow Bun idioms: keep the platform checks in openPath and call await Bun.$`open ${target}` on darwin, await Bun.$`xdg-open ${target}` on Linux, and for win32 use Bun.$`cmd /c start "" ${target}` (ensuring proper quoting/empty title). Also update usage of dashboardPath if needed and ensure openPath becomes async to await Bun.$.bench/dashboard/app.js (1)
1-1: Add anullguard before writingroot.innerHTML.If the
#appelement is ever missing (renamed, template changed), bothrenderEmpty()andrender()will throwTypeError: Cannot set properties of null. A one-line guard keeps it safe.♻️ Proposed fix
const root = document.getElementById("app"); +if (!root) throw new Error("Missing `#app` mount point"); const snapshot = window.__QUIPBENCH_LATEST__;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/dashboard/app.js` at line 1, The code currently grabs const root = document.getElementById("app") and later writes to root.innerHTML in renderEmpty() and render(), which will throw if root is null; add a one-line null guard right after obtaining root (e.g., if (!root) return; or throw a clear error) so both renderEmpty and render only run when root exists, or have those functions check root before setting innerHTML; reference the existing root variable and the renderEmpty()/render() functions when making the guard.bench/export.ts (1)
68-72: ReplacewriteFileSyncwithawait Bun.write().
writeFileSyncfromnode:fsblocks the event loop inside anasyncfunction and violates the project guideline to prefer Bun's native file APIs.♻️ Proposed refactor
-import { mkdirSync, writeFileSync } from "node:fs"; +import { mkdirSync } from "node:fs";- writeFileSync(latestJsonPath, JSON.stringify(snapshot, null, 2)); - writeFileSync( - latestJsPath, - `window.__QUIPBENCH_LATEST__ = ${JSON.stringify(snapshot, null, 2)};\n`, - ); + await Bun.write(latestJsonPath, JSON.stringify(snapshot, null, 2)); + await Bun.write( + latestJsPath, + `window.__QUIPBENCH_LATEST__ = ${JSON.stringify(snapshot, null, 2)};\n`, + );As per coding guidelines: "Prefer
Bun.fileover Node.jsnode:fsreadFile/writeFile for file operations."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/export.ts` around lines 68 - 72, Replace the blocking node:fs.writeFileSync calls with Bun's async file API: use await Bun.write(Bun.file(latestJsonPath), JSON.stringify(snapshot, null, 2)) and await Bun.write(Bun.file(latestJsPath), `window.__QUIPBENCH_LATEST__ = ${JSON.stringify(snapshot, null, 2)};\n`) (or Bun.write(path, data) per project style), and remove the node:fs import; update code that references writeFileSync to use these await Bun.write calls within the async function so the event loop is not blocked.bench/integration.test.ts (1)
2-2: ReplacereadFileSync+JSON.parsewithBun.file(...).json().Both test functions are
async, so the synchronousreadFileSyncis unnecessary and violates the guideline to prefer Bun's native file APIs overnode:fsread/write operations.♻️ Proposed refactor
-import { mkdtempSync, readFileSync } from "node:fs"; +import { mkdtempSync } from "node:fs";Then at lines 65–69:
- const latest = JSON.parse(readFileSync(result.snapshotPathJson, "utf8")) as { - runMeta: { runId: string }; - leaderboard: unknown[]; - chart: unknown[]; - }; + const latest = await Bun.file(result.snapshotPathJson).json() as { + runMeta: { runId: string }; + leaderboard: unknown[]; + chart: unknown[]; + };Same pattern for lines 116–118:
- const latest = JSON.parse(readFileSync(result.snapshotPathJson, "utf8")) as { - runMeta: { failures: number }; - }; + const latest = await Bun.file(result.snapshotPathJson).json() as { + runMeta: { failures: number }; + };As per coding guidelines: "Prefer
Bun.fileover Node.jsnode:fsreadFile/writeFile for file operations."Also applies to: 65-69, 116-119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/integration.test.ts` at line 2, Replace synchronous JSON file reads that use readFileSync + JSON.parse with Bun's async API: call await Bun.file(path).json() in the async test functions instead of readFileSync and JSON.parse; remove readFileSync from the import list (keep mkdtempSync if used); update both occurrences where readFileSync is used (the blocks referencing JSON.parse) to use await Bun.file(filePath).json() and ensure the surrounding code awaits the result.bench/run.ts (2)
440-440:await writeLockat line 440 is redundant.
Promise.all(workers)at line 439 already awaits every worker to completion. Each worker unconditionallyawaits its ownserializeWritecall before looping, so by the timePromise.allresolves, all DB writes are committed andwriteLockis already settled.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/run.ts` at line 440, The standalone "await writeLock" is redundant because Promise.all(workers) already waits for each worker (which itself awaits serializeWrite) to finish; remove the extraneous "await writeLock" statement at the end of the run logic so completion relies on Promise.all(workers) and no extra awaiting of writeLock remains (refer to the writeLock variable, the workers array, Promise.all(workers), and each worker's serializeWrite call to locate the relevant code).
321-340: Stubaiobject is dead code.Lines 321-331 initialise
aito a placeholder that always throws, but theif (!options.ai)block immediately below (lines 335-340) unconditionally overwrites it with the live implementation. Because the two branches are mutually exclusive (options.aiprovided vs. not), the placeholder is never called. Drop it and assign directly:♻️ Proposed refactor
- let ai: QuipbenchAi = options.ai ?? { - async generatePrompt() { - throw new Error("Live AI is not loaded"); - }, - async generateAnswer() { - throw new Error("Live AI is not loaded"); - }, - async vote() { - throw new Error("Live AI is not loaded"); - }, - }; - let retry: RetryFn = defaultWithRetry; - let isRealStringFn = defaultIsRealString; - - if (!options.ai) { + let retry: RetryFn = defaultWithRetry; + let isRealStringFn = defaultIsRealString; + let ai: QuipbenchAi; + + if (options.ai) { + ai = options.ai; + } else { const live = await loadLiveAi(); ai = live.ai; retry = live.retry; isRealStringFn = live.isRealStringFn; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/run.ts` around lines 321 - 340, The placeholder QuipbenchAi that throws ("Live AI is not loaded") is dead code because the subsequent if (!options.ai) branch overwrites ai; remove the throwy stub and simplify initialization: declare ai (type QuipbenchAi), retry (RetryFn) and isRealStringFn, then if options.ai use options.ai/defaultWithRetry/defaultIsRealString, else await loadLiveAi() and assign ai, retry, isRealStringFn from the returned live object; update references to defaultWithRetry and defaultIsRealString only when options.ai is present and keep loadLiveAi, ai, retry, and isRealStringFn as the unique symbols to locate the change.bench/db.ts (1)
169-226:insertMatchre-prepares the statement on every invocation.
db.prepare(...)parses and compiles the SQL on each call.insertMatchis called once per round (100 by default) and is the hot path in the benchmark. The same applies toupdateRunProgress(lines 137–148), though it's less critical. Consider module-level or lazy caching of prepared statements, or restructure to accept a pre-compiled statement.A minimal approach is to memoize per
Databaseinstance:♻️ Proposed refactor: prepare once via a module-level WeakMap cache
+const insertMatchStmtCache = new WeakMap<Database, ReturnType<Database["prepare"]>>(); + export function insertMatch(db: Database, match: MatchRecord) { - const stmt = db.prepare(` + let stmt = insertMatchStmtCache.get(db); + if (!stmt) { + stmt = db.prepare(` INSERT INTO matches ( run_id, round_num, prompter_id, prompter_name, contestant_a_id, contestant_a_name, contestant_b_id, contestant_b_name, prompt, answer_a, answer_b, votes_a, votes_b, winner, error, payload_json ) VALUES ( $run_id, $round_num, $prompter_id, $prompter_name, $contestant_a_id, $contestant_a_name, $contestant_b_id, $contestant_b_name, $prompt, $answer_a, $answer_b, $votes_a, $votes_b, $winner, $error, $payload_json ) - `); + `); + insertMatchStmtCache.set(db, stmt); + } stmt.run({ ... }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench/db.ts` around lines 169 - 226, The insertMatch function currently calls db.prepare(...) on every invocation which re-parses the SQL and is hot — change it to reuse a prepared statement by caching it per Database instance (e.g., a module-level WeakMap<Database, Statement> keyed by db) and have insertMatch retrieve the cached statement (creating and storing it on first use) and then call stmt.run(...); do the same memoization for updateRunProgress (referencing the updateRunProgress function name) so both use precompiled statements instead of preparing on each call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bench/dashboard/app.js`:
- Around line 23-40: Add an HTML-escape helper (e.g., esc) and use it whenever
inserting untrusted strings into template HTML; specifically update rowHtml (use
esc(row.modelName) and any other string fields like row.*), and likewise replace
raw interpolations for meta.runId, champion.modelName, meta.startedAt,
meta.endedAt and other occurrences in the 190-269 region with esc(...). Ensure
esc converts &, <, >, and " to their HTML entities and apply it to any
String(...) values used inside template literals.
- Around line 108-117: The hardcoded 8-entry backgroundColor array used when
building the Chart.js dataset causes bars beyond index 7 to be transparent when
QUIPBENCH_MODELS has more than 8 entries; update the dataset creation (the
backgroundColor property) to generate an array of colors programmatically with
length equal to QUIPBENCH_MODELS.length (or the data array length) instead of
the fixed 8 values — e.g., produce colors via a helper that returns N distinct
hex/RGBA colors (or interpolates a HSL hue range) and assign that array to
backgroundColor where the existing literal currently appears.
In `@bench/dashboard/index.html`:
- Line 18: Update the Chart.js CDN <script> tag to include Subresource Integrity
and CORS: fetch the SRI hash for chart.js v4.4.6 from jsDelivr
(https://www.jsdelivr.com/package/npm/chart.js?version=4.4.6) and add an
integrity="..." attribute to the existing <script
src="https://cdn.jsdelivr.net/npm/chart.js@4.4.6/dist/chart.umd.min.js"></script>
tag, and also add crossorigin="anonymous" to that same <script> element to
enable proper SRI verification.
In `@bench/dashboard/styles.css`:
- Around line 93-97: The CSS rule .header-links a svg uses the fill value
"currentColor" which violates stylelint's value-keyword-case rule; change the
fill value to the lowercase "currentcolor" in the .header-links a svg selector
to conform to the linter.
- Around line 17-19: The CSS custom property --serif uses an uppercase font
keyword "Georgia" which triggers Stylelint's value-keyword-case rule; update the
--serif declaration in styles.css (the --serif variable assignment) to use the
lowercase font keyword georgia (i.e., "DM Serif Display", georgia, serif) so the
value-keyword-case linter error is resolved.
In `@bench/db.ts`:
- Around line 25-80: The foreign key constraints in initSchema are declared but
not enforced because SQLite defaults FK enforcement off; update the
database-creation path (openBenchDb where a Database instance is obtained) to
execute the PRAGMA that enables foreign keys on that connection (e.g., run
"PRAGMA foreign_keys = ON" via db.exec()) before running initSchema/any other
migrations so FK checks apply for matches and ratings referencing runs; ensure
this PRAGMA is set every time a Database is opened.
- Around line 228-272: The DELETE before the batched INSERT is executed outside
the transaction, risking lost data and race conditions; move the DELETE
statement (the db.prepare("DELETE FROM ratings WHERE run_id = $run_id").run({
$run_id: runId }) call) into the transaction callback used by tx so it runs
inside the same transaction as the INSERTs (i.e., call the DELETE at the start
of the tx callback before iterating rows and running stmt.run(...)), keeping
replaceRatings, stmt, tx, and the final tx(leaderboard) flow unchanged.
In `@bench/export.ts`:
- Around line 82-83: The dbPath/outputDir construction uses path.join which
mis-handles absolute args (join(process.cwd(), args.db) / join(process.cwd(),
args.out)); change both to use path.resolve so absolute inputs are preserved
(e.g., use path.resolve(process.cwd(), args.db) or path.resolve(args.db) as
appropriate) and ensure you import/use path.resolve from 'path'; keep the
fallback to DEFAULT_DB_PATH and DEFAULT_OUTPUT_DIR when args.db/args.out are
unset.
In `@bench/finalize-partial.ts`:
- Around line 1-32: The code currently seeds ratings from QUIPBENCH_MODELS which
breaks recovery for runs using custom models; instead query the DB for the
distinct contestant IDs for the run and build the ratings map from those IDs.
Replace the for-loop that iterates QUIPBENCH_MODELS with a DB query (using
db.query / run.id) like SELECT DISTINCT contestant_a_id AS id FROM matches WHERE
run_id = ? UNION SELECT DISTINCT contestant_b_id AS id FROM matches WHERE run_id
= ?; then for each returned id call QUIPBENCH_MODELS.find(m => m.id === id) to
get the model metadata (or create a minimal fallback { id }) and initialize
ratings.set(id, { model, elo: run.initial_elo, wins: 0, games: 0 }) so later
ratings.get(match.contestant_a_id) / ratings.get(match.contestant_b_id) will not
be undefined.
In `@bench/run.ts`:
- Around line 434-478: The code finalizes the run as "completed" before calling
exportLatestSnapshot, so if exportLatestSnapshot throws the catch block blindly
marks the run "failed" via finalizeRun(db, runId, "failed", ...); change the
flow to avoid double-finalization: either move finalizeRun(db, runId,
"completed", endedAt) to after exportLatestSnapshot completes, or in the catch
block check the current run status in the DB for runId and only call
finalizeRun(..., "failed", ...) if the status is not already "completed" (use
runId and db to query the run row or an existing helper), ensuring finalizeRun
is only used when appropriate.
- Around line 267-269: The code currently treats votesA===0 && votesB===0 as a
"TIE" which causes a bogus Elo update; change the logic around winner
determination to detect the all-voter-error case (votesA === 0 && votesB === 0)
and set winner = "ERROR" instead of "TIE", and then skip any Elo/rating updates
and avoid incrementing roundsCompleted for that round (the places to change are
the winner assignment logic and the subsequent Elo update/roundsCompleted code
paths); also ensure MatchRecord["winner"] and any consumers accept the "ERROR"
value or update its type if necessary.
---
Nitpick comments:
In `@bench/dashboard/app.js`:
- Line 1: The code currently grabs const root = document.getElementById("app")
and later writes to root.innerHTML in renderEmpty() and render(), which will
throw if root is null; add a one-line null guard right after obtaining root
(e.g., if (!root) return; or throw a clear error) so both renderEmpty and render
only run when root exists, or have those functions check root before setting
innerHTML; reference the existing root variable and the renderEmpty()/render()
functions when making the guard.
In `@bench/db.ts`:
- Around line 169-226: The insertMatch function currently calls db.prepare(...)
on every invocation which re-parses the SQL and is hot — change it to reuse a
prepared statement by caching it per Database instance (e.g., a module-level
WeakMap<Database, Statement> keyed by db) and have insertMatch retrieve the
cached statement (creating and storing it on first use) and then call
stmt.run(...); do the same memoization for updateRunProgress (referencing the
updateRunProgress function name) so both use precompiled statements instead of
preparing on each call.
In `@bench/export.ts`:
- Around line 68-72: Replace the blocking node:fs.writeFileSync calls with Bun's
async file API: use await Bun.write(Bun.file(latestJsonPath),
JSON.stringify(snapshot, null, 2)) and await Bun.write(Bun.file(latestJsPath),
`window.__QUIPBENCH_LATEST__ = ${JSON.stringify(snapshot, null, 2)};\n`) (or
Bun.write(path, data) per project style), and remove the node:fs import; update
code that references writeFileSync to use these await Bun.write calls within the
async function so the event loop is not blocked.
In `@bench/integration.test.ts`:
- Line 2: Replace synchronous JSON file reads that use readFileSync + JSON.parse
with Bun's async API: call await Bun.file(path).json() in the async test
functions instead of readFileSync and JSON.parse; remove readFileSync from the
import list (keep mkdtempSync if used); update both occurrences where
readFileSync is used (the blocks referencing JSON.parse) to use await
Bun.file(filePath).json() and ensure the surrounding code awaits the result.
In `@bench/open.ts`:
- Around line 1-19: Replace the spawnSync-based shell invocations in openPath
(and remove unused spawnSync import) with Bun.$ calls to follow Bun idioms: keep
the platform checks in openPath and call await Bun.$`open ${target}` on darwin,
await Bun.$`xdg-open ${target}` on Linux, and for win32 use Bun.$`cmd /c start
"" ${target}` (ensuring proper quoting/empty title). Also update usage of
dashboardPath if needed and ensure openPath becomes async to await Bun.$.
In `@bench/run.ts`:
- Line 440: The standalone "await writeLock" is redundant because
Promise.all(workers) already waits for each worker (which itself awaits
serializeWrite) to finish; remove the extraneous "await writeLock" statement at
the end of the run logic so completion relies on Promise.all(workers) and no
extra awaiting of writeLock remains (refer to the writeLock variable, the
workers array, Promise.all(workers), and each worker's serializeWrite call to
locate the relevant code).
- Around line 321-340: The placeholder QuipbenchAi that throws ("Live AI is not
loaded") is dead code because the subsequent if (!options.ai) branch overwrites
ai; remove the throwy stub and simplify initialization: declare ai (type
QuipbenchAi), retry (RetryFn) and isRealStringFn, then if options.ai use
options.ai/defaultWithRetry/defaultIsRealString, else await loadLiveAi() and
assign ai, retry, isRealStringFn from the returned live object; update
references to defaultWithRetry and defaultIsRealString only when options.ai is
present and keep loadLiveAi, ai, retry, and isRealStringFn as the unique symbols
to locate the change.
| function rowHtml(row) { | ||
| const logo = logoFor(row.modelName); | ||
| return ` | ||
| <tr> | ||
| <td class="mono rank">${row.rank}</td> | ||
| <td> | ||
| <div class="model-cell"> | ||
| ${logo ? `<img src="${logo}" alt="" />` : ""} | ||
| <span>${row.modelName}</span> | ||
| </div> | ||
| </td> | ||
| <td class="mono">${row.elo.toFixed(2)}</td> | ||
| <td class="mono">${row.wins}</td> | ||
| <td class="mono">${row.games}</td> | ||
| <td class="mono">${row.winRate.toFixed(2)}%</td> | ||
| </tr> | ||
| `; | ||
| } |
There was a problem hiding this comment.
Unescaped strings interpolated into innerHTML — consider sanitizing model name fields.
row.modelName, meta.runId, champion.modelName, meta.startedAt, and meta.endedAt are all interpolated raw into template-literal HTML. In practice the data is trusted (local SQLite, hardcoded model names), but if QUIPBENCH_MODELS ever includes a name with <, >, or " (e.g., a model whose display name contains HTML), it would be rendered as markup. A small escape helper is low-cost insurance:
♻️ Proposed escape helper
function esc(str) {
return String(str)
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, """);
}Then wrap all string interpolations: ${esc(row.modelName)}, ${esc(meta.runId)}, etc.
Also applies to: 190-269
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/dashboard/app.js` around lines 23 - 40, Add an HTML-escape helper
(e.g., esc) and use it whenever inserting untrusted strings into template HTML;
specifically update rowHtml (use esc(row.modelName) and any other string fields
like row.*), and likewise replace raw interpolations for meta.runId,
champion.modelName, meta.startedAt, meta.endedAt and other occurrences in the
190-269 region with esc(...). Ensure esc converts &, <, >, and " to their HTML
entities and apply it to any String(...) values used inside template literals.
| backgroundColor: [ | ||
| "#e8ab97", | ||
| "#e09a81", | ||
| "#d98367", | ||
| "#d97757", | ||
| "#ca6b4b", | ||
| "#bc6141", | ||
| "#ae5637", | ||
| "#9f4b2d", | ||
| ], |
There was a problem hiding this comment.
Hardcoded 8-color backgroundColor array will produce transparent bars if model count exceeds 8.
There are 8 hardcoded colors but Chart.js won't cycle them for bar charts — additional bars beyond index 7 get undefined (transparent). As models are added to QUIPBENCH_MODELS, this silently breaks the chart. Generate colors programmatically based on the actual data length instead.
♻️ Proposed fix
- backgroundColor: [
- "#e8ab97",
- "#e09a81",
- "#d98367",
- "#d97757",
- "#ca6b4b",
- "#bc6141",
- "#ae5637",
- "#9f4b2d",
- ],
+ backgroundColor: sorted.map((_, i) => {
+ const t = sorted.length > 1 ? i / (sorted.length - 1) : 0;
+ // interpolate from light (`#e8ab97`) to dark (`#9f4b2d`)
+ const r = Math.round(0xe8 + t * (0x9f - 0xe8));
+ const g = Math.round(0xab + t * (0x4b - 0xab));
+ const b = Math.round(0x97 + t * (0x2d - 0x97));
+ return `rgb(${r},${g},${b})`;
+ }),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/dashboard/app.js` around lines 108 - 117, The hardcoded 8-entry
backgroundColor array used when building the Chart.js dataset causes bars beyond
index 7 to be transparent when QUIPBENCH_MODELS has more than 8 entries; update
the dataset creation (the backgroundColor property) to generate an array of
colors programmatically with length equal to QUIPBENCH_MODELS.length (or the
data array length) instead of the fixed 8 values — e.g., produce colors via a
helper that returns N distinct hex/RGBA colors (or interpolates a HSL hue range)
and assign that array to backgroundColor where the existing literal currently
appears.
| <body> | ||
| <div id="app" class="app"></div> | ||
| <script src="../out/latest.js"></script> | ||
| <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.6/dist/chart.umd.min.js"></script> |
There was a problem hiding this comment.
Add integrity and crossorigin to the Chart.js CDN <script> tag.
The CDN tag has no Subresource Integrity hash. Without SRI hashes, if a CDN is compromised or serves malicious code, your website will unknowingly execute that code, potentially exposing your users to attacks. Since the version is pinned to 4.4.6, jsDelivr provides SRI hashes — obtain it from https://www.jsdelivr.com/package/npm/chart.js?version=4.4.6 and add it alongside crossorigin="anonymous".
🛡️ Proposed fix
- <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.6/dist/chart.umd.min.js"></script>
+ <script
+ src="https://cdn.jsdelivr.net/npm/chart.js@4.4.6/dist/chart.umd.min.js"
+ integrity="sha384-<HASH>"
+ crossorigin="anonymous"
+ ></script>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/dashboard/index.html` at line 18, Update the Chart.js CDN <script> tag
to include Subresource Integrity and CORS: fetch the SRI hash for chart.js
v4.4.6 from jsDelivr
(https://www.jsdelivr.com/package/npm/chart.js?version=4.4.6) and add an
integrity="..." attribute to the existing <script
src="https://cdn.jsdelivr.net/npm/chart.js@4.4.6/dist/chart.umd.min.js"></script>
tag, and also add crossorigin="anonymous" to that same <script> element to
enable proper SRI verification.
| --mono: "JetBrains Mono", "SF Mono", monospace; | ||
| --sans: "Inter", -apple-system, sans-serif; | ||
| --serif: "DM Serif Display", Georgia, serif; |
There was a problem hiding this comment.
Stylelint value-keyword-case error: Georgia must be lowercase.
🛠️ Proposed fix
- --serif: "DM Serif Display", Georgia, serif;
+ --serif: "DM Serif Display", georgia, serif;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --mono: "JetBrains Mono", "SF Mono", monospace; | |
| --sans: "Inter", -apple-system, sans-serif; | |
| --serif: "DM Serif Display", Georgia, serif; | |
| --mono: "JetBrains Mono", "SF Mono", monospace; | |
| --sans: "Inter", -apple-system, sans-serif; | |
| --serif: "DM Serif Display", georgia, serif; |
🧰 Tools
🪛 Stylelint (17.3.0)
[error] 19-19: Expected "Georgia" to be "georgia" (value-keyword-case)
(value-keyword-case)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/dashboard/styles.css` around lines 17 - 19, The CSS custom property
--serif uses an uppercase font keyword "Georgia" which triggers Stylelint's
value-keyword-case rule; update the --serif declaration in styles.css (the
--serif variable assignment) to use the lowercase font keyword georgia (i.e.,
"DM Serif Display", georgia, serif) so the value-keyword-case linter error is
resolved.
| .header-links a svg { | ||
| width: 13px; | ||
| height: 13px; | ||
| fill: currentColor; | ||
| } |
There was a problem hiding this comment.
Stylelint value-keyword-case error: currentColor must be currentcolor.
🛠️ Proposed fix
- fill: currentColor;
+ fill: currentcolor;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .header-links a svg { | |
| width: 13px; | |
| height: 13px; | |
| fill: currentColor; | |
| } | |
| .header-links a svg { | |
| width: 13px; | |
| height: 13px; | |
| fill: currentcolor; | |
| } |
🧰 Tools
🪛 Stylelint (17.3.0)
[error] 96-96: Expected "currentColor" to be "currentcolor" (value-keyword-case)
(value-keyword-case)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/dashboard/styles.css` around lines 93 - 97, The CSS rule .header-links
a svg uses the fill value "currentColor" which violates stylelint's
value-keyword-case rule; change the fill value to the lowercase "currentcolor"
in the .header-links a svg selector to conform to the linter.
| export function replaceRatings( | ||
| db: Database, | ||
| runId: string, | ||
| leaderboard: LeaderboardRow[], | ||
| ) { | ||
| db.prepare("DELETE FROM ratings WHERE run_id = $run_id").run({ $run_id: runId }); | ||
|
|
||
| const stmt = db.prepare(` | ||
| INSERT INTO ratings ( | ||
| run_id, | ||
| model_id, | ||
| model_name, | ||
| elo, | ||
| wins, | ||
| games, | ||
| win_rate, | ||
| rank | ||
| ) VALUES ( | ||
| $run_id, | ||
| $model_id, | ||
| $model_name, | ||
| $elo, | ||
| $wins, | ||
| $games, | ||
| $win_rate, | ||
| $rank | ||
| ) | ||
| `); | ||
|
|
||
| const tx = db.transaction((rows: LeaderboardRow[]) => { | ||
| for (const row of rows) { | ||
| stmt.run({ | ||
| $run_id: runId, | ||
| $model_id: row.modelId, | ||
| $model_name: row.modelName, | ||
| $elo: row.elo, | ||
| $wins: row.wins, | ||
| $games: row.games, | ||
| $win_rate: row.winRate, | ||
| $rank: row.rank, | ||
| }); | ||
| } | ||
| }); | ||
|
|
||
| tx(leaderboard); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check the actual file and line numbers
if [ -f "bench/db.ts" ]; then
wc -l bench/db.ts
echo "---"
sed -n '228,272p' bench/db.ts
fiRepository: T3-Content/quipslop
Length of output: 952
🏁 Script executed:
#!/bin/bash
# Search for other transaction() usage patterns in the codebase to understand the API
rg "db\.transaction" --type ts -A 10 -B 2Repository: T3-Content/quipslop
Length of output: 554
🏁 Script executed:
#!/bin/bash
# Search for the import statement and other db operations
rg "from.*bun:sqlite" --type tsRepository: T3-Content/quipslop
Length of output: 274
🌐 Web query:
bun:sqlite transaction API prepare statement inside callback
💡 Result:
In bun:sqlite, db.transaction(fn) wraps a synchronous function and runs it between BEGIN … COMMIT (or ROLLBACK on throw). The docs’ examples prepare statements outside the transaction callback and only .run() them inside. [1]
Can you .prepare() inside the transaction callback?
Yes, but it’s usually the wrong tradeoff:
db.prepare(sql)creates a freshStatementeach time (no caching), so doing it inside the callback will recompile every transaction execution. [2]- If you want to create/lookup a reusable statement from inside the callback, prefer
db.query(sql), which returns a cachedStatementfor the same SQL string (safe to reuse with different params). [2]
Recommended pattern
import { Database } from "bun:sqlite";
const db = new Database("app.db");
const insertUser = db.prepare("INSERT INTO users (name) VALUES (?)");
const insertMany = db.transaction((names: string[]) => {
for (const name of names) insertUser.run(name);
});
insertMany(["Alice", "Bob"]);This matches the intended usage shown in the Bun docs. [1]
If you share what error you hit (and a minimal snippet), I can pinpoint whether it’s due to (1) accidentally using async/await in the transaction body, (2) statement lifetime/finalize, or (3) preparing dynamic SQL each run.
🏁 Script executed:
#!/bin/bash
# Look for test files or integration tests that might show transaction usage
find . -name "*.test.ts" -o -name "*.test.js" | xargs grep -l "transaction" 2>/dev/null | head -5Repository: T3-Content/quipslop
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Check if there are any integration tests using replaceRatings or transaction patterns
rg "replaceRatings\|\.transaction" --type ts bench/ -A 5 -B 2Repository: T3-Content/quipslop
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Check how often replaceRatings is called to assess performance impact
rg "replaceRatings" --type ts -B 2 -A 2Repository: T3-Content/quipslop
Length of output: 1227
Move the DELETE statement inside the transaction to ensure atomicity and crash-safety.
Currently, the DELETE at line 233 executes outside the transaction, while the INSERT batch runs inside tx(leaderboard) at line 272. This creates two critical risks:
- Crash-safety: If the process dies between the DELETE and the INSERT transaction, the run's ratings are permanently deleted but never re-inserted. The run will show
status = 'completed'butgetRatingsForRunreturns empty, breaking any subsequent export. - Concurrent readers:
exportLatestSnapshot(called viaquipbench:exportfrom other processes) may readgetRatingsForRunduring the gap between DELETE and INSERT and throw"Run X has no ratings".
Bun:sqlite allows .prepare(...).run(...) inside the transaction callback, so the fix is straightforward: move the DELETE into the transaction:
Proposed fix
const tx = db.transaction((rows: LeaderboardRow[]) => {
+ db.prepare("DELETE FROM ratings WHERE run_id = $run_id").run({ $run_id: runId });
for (const row of rows) {
stmt.run({
$run_id: runId,
$model_id: row.modelId,
$model_name: row.modelName,
$elo: row.elo,
$wins: row.wins,
$games: row.games,
$win_rate: row.winRate,
$rank: row.rank,
});
}
});
tx(leaderboard);Remove the DELETE line before const stmt = db.prepare(...).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/db.ts` around lines 228 - 272, The DELETE before the batched INSERT is
executed outside the transaction, risking lost data and race conditions; move
the DELETE statement (the db.prepare("DELETE FROM ratings WHERE run_id =
$run_id").run({ $run_id: runId }) call) into the transaction callback used by tx
so it runs inside the same transaction as the INSERTs (i.e., call the DELETE at
the start of the tx callback before iterating rows and running stmt.run(...)),
keeping replaceRatings, stmt, tx, and the final tx(leaderboard) flow unchanged.
| const dbPath = args.db ? join(process.cwd(), args.db) : DEFAULT_DB_PATH; | ||
| const outputDir = args.out ? join(process.cwd(), args.out) : DEFAULT_OUTPUT_DIR; |
There was a problem hiding this comment.
path.join silently produces wrong paths when --db or --out are absolute.
path.join() does not discriminate between absolute and relative paths — it will combine the inputs and provide you with the joined path. So join(process.cwd(), '/absolute/path') produces ${cwd}/absolute/path instead of /absolute/path. path.resolve() resolves a sequence of paths into an absolute path, processing segments right-to-left until an absolute path is constructed, so it handles both relative and absolute inputs correctly.
🐛 Proposed fix
-import { join } from "node:path";
+import { join, resolve } from "node:path";- const dbPath = args.db ? join(process.cwd(), args.db) : DEFAULT_DB_PATH;
- const outputDir = args.out ? join(process.cwd(), args.out) : DEFAULT_OUTPUT_DIR;
+ const dbPath = args.db ? resolve(args.db) : DEFAULT_DB_PATH;
+ const outputDir = args.out ? resolve(args.out) : DEFAULT_OUTPUT_DIR;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const dbPath = args.db ? join(process.cwd(), args.db) : DEFAULT_DB_PATH; | |
| const outputDir = args.out ? join(process.cwd(), args.out) : DEFAULT_OUTPUT_DIR; | |
| const dbPath = args.db ? resolve(args.db) : DEFAULT_DB_PATH; | |
| const outputDir = args.out ? resolve(args.out) : DEFAULT_OUTPUT_DIR; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/export.ts` around lines 82 - 83, The dbPath/outputDir construction uses
path.join which mis-handles absolute args (join(process.cwd(), args.db) /
join(process.cwd(), args.out)); change both to use path.resolve so absolute
inputs are preserved (e.g., use path.resolve(process.cwd(), args.db) or
path.resolve(args.db) as appropriate) and ensure you import/use path.resolve
from 'path'; keep the fallback to DEFAULT_DB_PATH and DEFAULT_OUTPUT_DIR when
args.db/args.out are unset.
| import { QUIPBENCH_MODELS } from "./models"; | ||
| import { updatePairElo } from "./elo"; | ||
| import { buildLeaderboard } from "./leaderboard"; | ||
| import { openBenchDb, replaceRatings } from "./db"; | ||
| import { DEFAULT_DB_PATH, DEFAULT_OUTPUT_DIR } from "./config"; | ||
| import { exportLatestSnapshot } from "./export"; | ||
| import type { RatingState } from "./types"; | ||
|
|
||
| const db = openBenchDb(DEFAULT_DB_PATH); | ||
|
|
||
| try { | ||
| const run = db.query(` | ||
| SELECT id, initial_elo, elo_k, rounds_requested | ||
| FROM runs | ||
| WHERE status = 'running' | ||
| ORDER BY datetime(started_at) DESC | ||
| LIMIT 1 | ||
| `).get() as { id: string; initial_elo: number; elo_k: number; rounds_requested: number } | null; | ||
|
|
||
| if (!run) { | ||
| throw new Error("No running Quipbench run found to finalize"); | ||
| } | ||
|
|
||
| const ratings = new Map<string, RatingState>(); | ||
| for (const model of QUIPBENCH_MODELS) { | ||
| ratings.set(model.id, { | ||
| model, | ||
| elo: run.initial_elo, | ||
| wins: 0, | ||
| games: 0, | ||
| }); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine bench/run.ts to see if --models flag is exposed
cat -n bench/run.ts | head -100Repository: T3-Content/quipslop
Length of output: 3231
🏁 Script executed:
# Check how finalize-partial.ts is invoked and if it's a CLI entry point
file bench/finalize-partial.ts && head -40 bench/finalize-partial.tsRepository: T3-Content/quipslop
Length of output: 107
🏁 Script executed:
# Verify the ratings.get() behavior at lines 49-50 mentioned in the review
sed -n '45,55p' bench/finalize-partial.tsRepository: T3-Content/quipslop
Length of output: 339
🏁 Script executed:
# Check the database schema to see if matches table has model_id
rg "model_id" bench/db.ts -A 2 -B 2 | head -50Repository: T3-Content/quipslop
Length of output: 859
🏁 Script executed:
# Check if --models is defined as a command-line argument anywhere
rg "\-\-models" bench/ --type tsRepository: T3-Content/quipslop
Length of output: 45
🏁 Script executed:
# Look for more of bench/run.ts to see CLI argument parsing
tail -150 bench/run.tsRepository: T3-Content/quipslop
Length of output: 4127
🏁 Script executed:
# Check if finalize-partial.ts is a CLI entry point and how it's structured
cat -n bench/finalize-partial.tsRepository: T3-Content/quipslop
Length of output: 3485
🏁 Script executed:
# Check how matches table is structured
rg "CREATE TABLE.*matches" bench/db.ts -A 15Repository: T3-Content/quipslop
Length of output: 627
🏁 Script executed:
# Search for any programmatic invocation of runQuipbench or finalize functions
rg "runQuipbench|finalize" --type ts -B 2 -A 2 | head -80Repository: T3-Content/quipslop
Length of output: 2673
🏁 Script executed:
# Check the RunQuipbenchOptions type definition
rg "type RunQuipbenchOptions" bench/run.ts -A 20Repository: T3-Content/quipslop
Length of output: 735
🏁 Script executed:
# Verify whether models parameter is actually used in runQuipbench function
sed -n '100,250p' bench/run.ts | grep -i "model" -A 2 -B 2Repository: T3-Content/quipslop
Length of output: 459
🏁 Script executed:
# Check if there's any way to recover the model list from matches or runs table
rg "CREATE TABLE.*runs" bench/db.ts -A 25Repository: T3-Content/quipslop
Length of output: 927
🏁 Script executed:
# Check what contestant_a_id and contestant_b_id look like in practice
rg "contestant_a_id|contestant_a_name" bench/ --type ts -B 2 -A 2 | head -60Repository: T3-Content/quipslop
Length of output: 1978
🏁 Script executed:
# Check whether contestant_a_id is derived from model.id
rg "model\.id" bench/run.ts -B 3 -A 3 | head -40Repository: T3-Content/quipslop
Length of output: 224
🏁 Script executed:
# Check if there's any documentation or comment about finalize-partial.ts usage
head -50 bench/finalize-partial.ts && grep -r "finalize-partial" bench/ --type tsRepository: T3-Content/quipslop
Length of output: 1633
finalize-partial.ts uses hardcoded QUIPBENCH_MODELS instead of recovering the actual models from the database.
This is a real issue for the recovery scenario: if runQuipbench() is invoked programmatically with custom models and the run is interrupted, calling finalize-partial.ts separately will initialize ratings from the hardcoded QUIPBENCH_MODELS instead of the models that were actually used. Subsequent ratings.get(match.contestant_a_id) calls on lines 49–50 will return undefined for any non-default models, causing the if (!a || !b) continue guard on line 51 to silently skip those matches.
(The CLI itself is safe—bench/run.ts main() doesn't expose a --models flag, so it always uses defaults. The risk only applies to programmatic usage of runQuipbench({ models: [...] }) followed by manual finalize-partial.ts invocation.)
Mitigation: Query the distinct contestant_a_id values from the matches table for the target run and reconstruct the ratings map dynamically rather than relying on the hardcoded constant. This ensures recovery works regardless of which model set was originally used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/finalize-partial.ts` around lines 1 - 32, The code currently seeds
ratings from QUIPBENCH_MODELS which breaks recovery for runs using custom
models; instead query the DB for the distinct contestant IDs for the run and
build the ratings map from those IDs. Replace the for-loop that iterates
QUIPBENCH_MODELS with a DB query (using db.query / run.id) like SELECT DISTINCT
contestant_a_id AS id FROM matches WHERE run_id = ? UNION SELECT DISTINCT
contestant_b_id AS id FROM matches WHERE run_id = ?; then for each returned id
call QUIPBENCH_MODELS.find(m => m.id === id) to get the model metadata (or
create a minimal fallback { id }) and initialize ratings.set(id, { model, elo:
run.initial_elo, wins: 0, games: 0 }) so later
ratings.get(match.contestant_a_id) / ratings.get(match.contestant_b_id) will not
be undefined.
| let winner: MatchRecord["winner"] = "TIE"; | ||
| if (votesA > votesB) winner = "A"; | ||
| else if (votesB > votesA) winner = "B"; |
There was a problem hiding this comment.
All-voter-error case is silently counted as a TIE and updates Elo.
When every voter in a round throws (retry exhausted), votesA and votesB both remain 0, so winner defaults to "TIE" at line 267. The round is then counted as roundsCompleted, and both contestants receive a neutral 0.5 Elo update. A round where no valid judgment was rendered should arguably be treated as "ERROR" to avoid polluting the ratings.
🐛 Proposed fix: promote zero-vote rounds to ERROR
let winner: MatchRecord["winner"] = "TIE";
if (votesA > votesB) winner = "A";
else if (votesB > votesA) winner = "B";
+
+ if (winner === "TIE" && votesA === 0 && votes.every((v) => v.votedFor === null)) {
+ return {
+ match: {
+ ...baseMatch,
+ prompt,
+ answerA,
+ answerB,
+ votesA: 0,
+ votesB: 0,
+ votes,
+ winner: "ERROR",
+ error: "All voter decisions failed",
+ },
+ };
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let winner: MatchRecord["winner"] = "TIE"; | |
| if (votesA > votesB) winner = "A"; | |
| else if (votesB > votesA) winner = "B"; | |
| let winner: MatchRecord["winner"] = "TIE"; | |
| if (votesA > votesB) winner = "A"; | |
| else if (votesB > votesA) winner = "B"; | |
| if (winner === "TIE" && votesA === 0 && votes.every((v) => v.votedFor === null)) { | |
| return { | |
| match: { | |
| ...baseMatch, | |
| prompt, | |
| answerA, | |
| answerB, | |
| votesA: 0, | |
| votesB: 0, | |
| votes, | |
| winner: "ERROR", | |
| error: "All voter decisions failed", | |
| }, | |
| }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/run.ts` around lines 267 - 269, The code currently treats votesA===0 &&
votesB===0 as a "TIE" which causes a bogus Elo update; change the logic around
winner determination to detect the all-voter-error case (votesA === 0 && votesB
=== 0) and set winner = "ERROR" instead of "TIE", and then skip any Elo/rating
updates and avoid incrementing roundsCompleted for that round (the places to
change are the winner assignment logic and the subsequent Elo
update/roundsCompleted code paths); also ensure MatchRecord["winner"] and any
consumers accept the "ERROR" value or update its type if necessary.
| try { | ||
| const workers = Array.from( | ||
| { length: Math.min(concurrency, rounds) }, | ||
| () => worker(), | ||
| ); | ||
| await Promise.all(workers); | ||
| await writeLock; | ||
|
|
||
| const leaderboard = buildLeaderboard(Array.from(ratings.values())); | ||
| replaceRatings(db, runId, leaderboard); | ||
|
|
||
| const endedAt = new Date().toISOString(); | ||
| finalizeRun(db, runId, "completed", endedAt); | ||
|
|
||
| const snapshotPaths = await exportLatestSnapshot({ | ||
| dbPath, | ||
| outputDir, | ||
| runId, | ||
| }); | ||
|
|
||
| process.stdout.write("\n"); | ||
|
|
||
| return { | ||
| runMeta: { | ||
| runId, | ||
| startedAt, | ||
| endedAt, | ||
| roundsRequested: rounds, | ||
| roundsCompleted, | ||
| failures, | ||
| concurrency, | ||
| eloK, | ||
| initialElo, | ||
| seed, | ||
| }, | ||
| leaderboard, | ||
| snapshotPathJson: snapshotPaths.latestJsonPath, | ||
| snapshotPathJs: snapshotPaths.latestJsPath, | ||
| }; | ||
| } catch (error) { | ||
| finalizeRun(db, runId, "failed", new Date().toISOString()); | ||
| throw error; | ||
| } finally { | ||
| db.close(); | ||
| } |
There was a problem hiding this comment.
Run finalized as "failed" even when rounds completed successfully if exportLatestSnapshot throws.
finalizeRun(db, runId, "completed", endedAt) is called at line 446, then exportLatestSnapshot is awaited at line 448. If the snapshot export throws (filesystem error, empty ratings, etc.), the catch block at line 474 overwrites the already-committed "completed" status with "failed". Because getLatestCompletedRunId filters on status = 'completed', the fully-computed run becomes invisible to subsequent exports and the dashboard.
🐛 Proposed fix: guard against double-finalization
+ let runFinalized = false;
try {
const workers = Array.from(
{ length: Math.min(concurrency, rounds) },
() => worker(),
);
await Promise.all(workers);
await writeLock;
const leaderboard = buildLeaderboard(Array.from(ratings.values()));
replaceRatings(db, runId, leaderboard);
const endedAt = new Date().toISOString();
finalizeRun(db, runId, "completed", endedAt);
+ runFinalized = true;
const snapshotPaths = await exportLatestSnapshot({
dbPath,
outputDir,
runId,
});
// ...
} catch (error) {
- finalizeRun(db, runId, "failed", new Date().toISOString());
+ if (!runFinalized) {
+ finalizeRun(db, runId, "failed", new Date().toISOString());
+ }
throw error;
} finally {
db.close();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try { | |
| const workers = Array.from( | |
| { length: Math.min(concurrency, rounds) }, | |
| () => worker(), | |
| ); | |
| await Promise.all(workers); | |
| await writeLock; | |
| const leaderboard = buildLeaderboard(Array.from(ratings.values())); | |
| replaceRatings(db, runId, leaderboard); | |
| const endedAt = new Date().toISOString(); | |
| finalizeRun(db, runId, "completed", endedAt); | |
| const snapshotPaths = await exportLatestSnapshot({ | |
| dbPath, | |
| outputDir, | |
| runId, | |
| }); | |
| process.stdout.write("\n"); | |
| return { | |
| runMeta: { | |
| runId, | |
| startedAt, | |
| endedAt, | |
| roundsRequested: rounds, | |
| roundsCompleted, | |
| failures, | |
| concurrency, | |
| eloK, | |
| initialElo, | |
| seed, | |
| }, | |
| leaderboard, | |
| snapshotPathJson: snapshotPaths.latestJsonPath, | |
| snapshotPathJs: snapshotPaths.latestJsPath, | |
| }; | |
| } catch (error) { | |
| finalizeRun(db, runId, "failed", new Date().toISOString()); | |
| throw error; | |
| } finally { | |
| db.close(); | |
| } | |
| let runFinalized = false; | |
| try { | |
| const workers = Array.from( | |
| { length: Math.min(concurrency, rounds) }, | |
| () => worker(), | |
| ); | |
| await Promise.all(workers); | |
| await writeLock; | |
| const leaderboard = buildLeaderboard(Array.from(ratings.values())); | |
| replaceRatings(db, runId, leaderboard); | |
| const endedAt = new Date().toISOString(); | |
| finalizeRun(db, runId, "completed", endedAt); | |
| runFinalized = true; | |
| const snapshotPaths = await exportLatestSnapshot({ | |
| dbPath, | |
| outputDir, | |
| runId, | |
| }); | |
| process.stdout.write("\n"); | |
| return { | |
| runMeta: { | |
| runId, | |
| startedAt, | |
| endedAt, | |
| roundsRequested: rounds, | |
| roundsCompleted, | |
| failures, | |
| concurrency, | |
| eloK, | |
| initialElo, | |
| seed, | |
| }, | |
| leaderboard, | |
| snapshotPathJson: snapshotPaths.latestJsonPath, | |
| snapshotPathJs: snapshotPaths.latestJsPath, | |
| }; | |
| } catch (error) { | |
| if (!runFinalized) { | |
| finalizeRun(db, runId, "failed", new Date().toISOString()); | |
| } | |
| throw error; | |
| } finally { | |
| db.close(); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bench/run.ts` around lines 434 - 478, The code finalizes the run as
"completed" before calling exportLatestSnapshot, so if exportLatestSnapshot
throws the catch block blindly marks the run "failed" via finalizeRun(db, runId,
"failed", ...); change the flow to avoid double-finalization: either move
finalizeRun(db, runId, "completed", endedAt) to after exportLatestSnapshot
completes, or in the catch block check the current run status in the DB for
runId and only call finalizeRun(..., "failed", ...) if the status is not already
"completed" (use runId and db to query the run row or an existing helper),
ensuring finalizeRun is only used when appropriate.
Summary
I added Quipbench, because I can :)
What’s included
bench/benchmark runner with live OpenRouter self-play roundswins,games,winRate) with SQLite persistencebench/out/latest.jsonandbench/out/latest.jsbench/dashboard/(no integration into app routes)package.json:quipbench:runquipbench:exportquipbench:openNotes
/,/history,/admin,/broadcast).Summary by CodeRabbit
Release Notes
New Features
quipbench:runto execute benchmarks,quipbench:exportto save results, andquipbench:opento view the dashboard.Documentation