Skip to content

feat: add experimental verbal sampling CoT prompt generation#36

Open
zeno205 wants to merge 1 commit intoT3-Content:mainfrom
zeno205:feat/verbal-sample-cot
Open

feat: add experimental verbal sampling CoT prompt generation#36
zeno205 wants to merge 1 commit intoT3-Content:mainfrom
zeno205:feat/verbal-sample-cot

Conversation

@zeno205
Copy link

@zeno205 zeno205 commented Feb 23, 2026

Verbalized Sampling CoT prompt generation experiment

Adds an additional prompt generation strategy where the model generates 5 candidate prompts with a verbal chain-of-thought reasoning step, then uses a second model call to select the best one (Inspired by this paper).

Changes

  • llm-json-fixer.ts (new): Standalone JSON extraction/repair utility for LLM responses. Handles common failure modes: markdown fences, trailing text, unescaped quotes, and missing commas. Exposes extractJSON, parseJSON, and tryParseJSON (based on code from this repo).

  • game.ts — Adds buildVerbalSampleCotSystem() and callSelectBestPrompt(). callGeneratePrompt() now branches on the EXPERIMENTAL_VERBAL_SAMPLE_COT=1 env flag; existing behavior is unchanged when the flag is absent.

No existing behavior is affected without the flag.

Summary by CodeRabbit

  • New Features

    • Introduced experimental Chain-Of-Thought prompt generation that creates multiple prompt candidates and automatically selects the best option when enabled. Standard single-prompt generation remains available as default.
  • Infrastructure

    • Added robust JSON parsing with automated error correction to improve handling of AI-generated responses.

@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

A new experimental Chain-Of-Thought prompt generation feature is introduced, which generates multiple JSON-formatted prompt candidates, parses them with error correction, validates them, and selects the best one. A complementary JSON parsing utility module is added to handle automated fixes for common JSON formatting issues.

Changes

Cohort / File(s) Summary
Chain-of-Thought Prompt Generation
game.ts
Adds experimental EXPERIMENTAL_VERBAL_SAMPLE_COT feature that generates 5 JSON-formatted prompts via LLM, parses and validates them, then selects the best candidate. Introduces new callSelectBestPrompt export and integrates JSON parsing utility. Maintains backward compatibility with existing single-prompt flow when disabled.
JSON Parsing Utility
llm-json-fixer.ts
New module providing robust JSON parsing with automatic error recovery. Includes helpers for extracting JSON from markdown, removing trailing content, fixing unescaped quotes, and adding missing commas. Public API exports tryParseJSON, parseJSON, and extractJSON with diagnostic reporting.

Sequence Diagram

sequenceDiagram
    actor User
    participant callGeneratePrompt
    participant LLM
    participant tryParseJSON
    participant callSelectBestPrompt
    participant callSelectBestPrompt2 as LLM<br/>(Selection)

    User->>callGeneratePrompt: Request prompt (CoT enabled)
    callGeneratePrompt->>LLM: Generate 5 JSON-formatted prompts
    LLM-->>callGeneratePrompt: Raw JSON response
    callGeneratePrompt->>tryParseJSON: Parse JSON with fixes
    tryParseJSON-->>callGeneratePrompt: Parsed prompts + fixes applied
    callGeneratePrompt->>callSelectBestPrompt: Select best prompt
    callSelectBestPrompt->>callSelectBestPrompt2: LLM selection request
    callSelectBestPrompt2-->>callSelectBestPrompt: Selected prompt
    callSelectBestPrompt-->>callGeneratePrompt: Best prompt
    callGeneratePrompt-->>User: Final prompt
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A JSON chain of thought so bright,
Five prompts parsed with all our might,
Commas fixed and quotes aligned,
The very best prompt we will find!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add experimental verbal sampling CoT prompt generation' clearly and accurately describes the main change: introducing an experimental verbal sampling chain-of-thought approach for prompt generation. It is specific, concise, and directly reflects the primary functionality added in this changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@macroscopeapp
Copy link

macroscopeapp bot commented Feb 23, 2026

Add experimental verbal sampling CoT prompt generation and branch game.callGeneratePrompt to request 5 JSON-formatted Quiplash prompts with probabilities when EXPERIMENTAL_VERBAL_SAMPLE_COT='1'

Introduce a feature flag that switches prompt generation to a JSON-based, 5-candidate flow with a selection step, and add tolerant JSON parsing for model responses.

📍Where to Start

Start with game.callGeneratePrompt in game.ts, then review buildVerbalSampleCotSystem and game.callSelectBestPrompt, followed by JSON parsing in llm-json-fixer.ts.


Macroscope summarized 6b4ef7f.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
game.ts (1)

321-345: callSelectBestPrompt failure causes full re-generation of candidates.

If this API call throws (network error, rate limit, etc.), the error propagates up through callGeneratePrompt, which is wrapped in withRetry. This causes the entire 5-candidate generation to be re-executed. Consider catching the error here and falling back to the probability-based selection directly, since valid candidates already exist.

♻️ Suggested approach in callGeneratePrompt
-  const selected = await callSelectBestPrompt(
-    model,
-    candidates.map((c) => c.joke),
-  );
+  let selected: string | null = null;
+  try {
+    selected = await callSelectBestPrompt(
+      model,
+      candidates.map((c) => c.joke),
+    );
+  } catch (err) {
+    log("WARN", `prompt:${model.name}`, "callSelectBestPrompt failed, using fallback", {
+      error: err instanceof Error ? err.message : String(err),
+    });
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@game.ts` around lines 321 - 345, callSelectBestPrompt currently lets errors
from generateText bubble up (causing callGeneratePrompt wrapped in withRetry to
regenerate all candidates); modify callSelectBestPrompt to catch exceptions
thrown by generateText (or any part of the API call), log the error via
log("ERROR", ...) including model.id/name and the exception, and then return a
deterministic fallback choice (e.g., pick the highest-probability candidate by
index from the provided jokes array or use an existing probability-based
selector) so callers still receive a valid string; ensure the returned value is
passed through cleanResponse before returning and preserve types/signature of
callSelectBestPrompt.
llm-json-fixer.ts (1)

252-269: Redundant parse attempt when a fix is a no-op.

When a fix function returns the same string (line 265–267), processed isn't updated, but the loop still runs JSON.parse again on the unchanged input in the next iteration, producing a duplicate warning. Consider skipping the iteration when the fix is a no-op.

♻️ Suggested improvement
       warnings.push((error as Error).message);
       if (i === attempts.length) {
         break;
       }
       const next = attempts[i]!();
       if (next !== processed) {
         processed = next;
+      } else {
+        continue;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm-json-fixer.ts` around lines 252 - 269, The loop currently pushes the same
JSON.parse error repeatedly when a fix is a no-op; in the catch block, only push
the error message to warnings if it differs from the last warning (e.g., check
warnings[warnings.length-1] !== (error as Error).message before pushing), and
after computing next = attempts[i]!(), if next === processed simply continue to
the next attempt without updating processed (so you avoid redundant parse
attempts and duplicate warnings). Update references in this block around
attempts, processed, warnings and the catch handling to implement these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@game.ts`:
- Around line 267-269: The strict check throws when parsed.jokes.length !== 5;
relax this to allow variable-length LLM output by validating parsed.jokes is an
array and has at least one item (e.g., parsed.jokes.length < 1) instead of
requiring exactly 5; reference the existing downstream validation in the
mapping/filtering logic around the jokes processing (the map/filter block
following parsed.jokes) to ensure invalid entries are still discarded and at
least one valid candidate is present.
- Around line 297-305: callSelectBestPrompt returns model text that may not
exactly match candidate.joke, so the exact equality check (selected === c.joke)
in the matched lookup is fragile; update selection handling in game.ts to
robustly map the model output to a candidate by either (A) changing the prompt
in callSelectBestPrompt to instruct the model to return a single index and then
parse that index to pick from candidates, or (B) perform a normalized fuzzy
match after receiving selected — e.g., trim/normalize whitespace and punctuation
and compare lowercased strings or use a small Levenshtein/similarity threshold
to find the closest candidate.joke before falling back to the probability-based
path; apply this logic where matched is computed so matched.joke reliably
resolves to the intended candidate.

In `@llm-json-fixer.ts`:
- Around line 193-197: The depth counter currently increments/decrements for
every brace/brace character in the raw line (variables depth, line, char), which
miscounts when those characters appear inside JSON string values; update the
loop to track whether we're inside a quoted string (e.g., an inString boolean
toggled when encountering an unescaped double-quote, handling backslash escapes)
and only modify depth for "{" "[" "}" "]" when inString is false; apply the same
change to the other matching loop referenced around lines 211-214 so brackets
inside strings do not affect depth.

---

Nitpick comments:
In `@game.ts`:
- Around line 321-345: callSelectBestPrompt currently lets errors from
generateText bubble up (causing callGeneratePrompt wrapped in withRetry to
regenerate all candidates); modify callSelectBestPrompt to catch exceptions
thrown by generateText (or any part of the API call), log the error via
log("ERROR", ...) including model.id/name and the exception, and then return a
deterministic fallback choice (e.g., pick the highest-probability candidate by
index from the provided jokes array or use an existing probability-based
selector) so callers still receive a valid string; ensure the returned value is
passed through cleanResponse before returning and preserve types/signature of
callSelectBestPrompt.

In `@llm-json-fixer.ts`:
- Around line 252-269: The loop currently pushes the same JSON.parse error
repeatedly when a fix is a no-op; in the catch block, only push the error
message to warnings if it differs from the last warning (e.g., check
warnings[warnings.length-1] !== (error as Error).message before pushing), and
after computing next = attempts[i]!(), if next === processed simply continue to
the next attempt without updating processed (so you avoid redundant parse
attempts and duplicate warnings). Update references in this block around
attempts, processed, warnings and the catch handling to implement these checks.

Comment on lines +267 to +269
if (!Array.isArray(parsed.jokes) || parsed.jokes.length !== 5) {
throw new Error("Invalid verbal sample CoT output: jokes must contain 5 items");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Strict === 5 check is brittle for LLM output.

LLMs don't always follow instructions exactly — they may return 4 or 6 items. The downstream map/filter at lines 271–295 already discards invalid entries and checks for at least one valid candidate. Consider relaxing this to a minimum-length check (e.g., < 1) rather than requiring exactly 5.

♻️ Suggested change
-  if (!Array.isArray(parsed.jokes) || parsed.jokes.length !== 5) {
-    throw new Error("Invalid verbal sample CoT output: jokes must contain 5 items");
+  if (!Array.isArray(parsed.jokes) || parsed.jokes.length === 0) {
+    throw new Error("Invalid verbal sample CoT output: jokes array is empty or missing");
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!Array.isArray(parsed.jokes) || parsed.jokes.length !== 5) {
throw new Error("Invalid verbal sample CoT output: jokes must contain 5 items");
}
if (!Array.isArray(parsed.jokes) || parsed.jokes.length === 0) {
throw new Error("Invalid verbal sample CoT output: jokes array is empty or missing");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@game.ts` around lines 267 - 269, The strict check throws when
parsed.jokes.length !== 5; relax this to allow variable-length LLM output by
validating parsed.jokes is an array and has at least one item (e.g.,
parsed.jokes.length < 1) instead of requiring exactly 5; reference the existing
downstream validation in the mapping/filtering logic around the jokes processing
(the map/filter block following parsed.jokes) to ensure invalid entries are
still discarded and at least one valid candidate is present.

Comment on lines +297 to +305
const selected = await callSelectBestPrompt(
model,
candidates.map((c) => c.joke),
);

const matched = candidates.find((c) => c.joke === selected);
if (matched) {
return matched.joke;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exact-match comparison with LLM output is fragile.

callSelectBestPrompt asks the model to echo back the chosen prompt, then line 302 does an exact string match against the candidates. LLMs frequently introduce minor deviations — a leading number, trailing period, extra whitespace, or slight rewording — causing the match to silently fail and fall through to the probability-based fallback every time.

Consider a fuzzy match (e.g., normalized/trimmed comparison, or includes/Levenshtein) or have the model return just the index number instead:

♻️ Option A: Have the model return the index
-    prompt: `Choose exactly one of these Quiplash prompts and reply with ONLY the exact prompt text, nothing else:\n\n${jokes
+    prompt: `Choose exactly one of these Quiplash prompts and reply with ONLY the number (1-${jokes.length}), nothing else:\n\n${jokes
       .map((joke, i) => `${i + 1}. ${joke}`)
       .join("\n")}`,

Then parse the returned number and index into the candidates array.

♻️ Option B: Normalize before matching
-  const matched = candidates.find((c) => c.joke === selected);
+  const normalize = (s: string) => s.replace(/^\d+[\.\)]\s*/, "").trim().toLowerCase();
+  const normalizedSelected = normalize(selected);
+  const matched = candidates.find((c) => normalize(c.joke) === normalizedSelected);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const selected = await callSelectBestPrompt(
model,
candidates.map((c) => c.joke),
);
const matched = candidates.find((c) => c.joke === selected);
if (matched) {
return matched.joke;
}
const selected = await callSelectBestPrompt(
model,
candidates.map((c) => c.joke),
);
const normalize = (s: string) => s.replace(/^\d+[\.\)]\s*/, "").trim().toLowerCase();
const normalizedSelected = normalize(selected);
const matched = candidates.find((c) => normalize(c.joke) === normalizedSelected);
if (matched) {
return matched.joke;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@game.ts` around lines 297 - 305, callSelectBestPrompt returns model text that
may not exactly match candidate.joke, so the exact equality check (selected ===
c.joke) in the matched lookup is fragile; update selection handling in game.ts
to robustly map the model output to a candidate by either (A) changing the
prompt in callSelectBestPrompt to instruct the model to return a single index
and then parse that index to pick from candidates, or (B) perform a normalized
fuzzy match after receiving selected — e.g., trim/normalize whitespace and
punctuation and compare lowercased strings or use a small Levenshtein/similarity
threshold to find the closest candidate.joke before falling back to the
probability-based path; apply this logic where matched is computed so
matched.joke reliably resolves to the intended candidate.

Comment on lines +193 to +197
for (const char of line) {
if (char === "{" || char === "[") depth++;
else if (char === "}" || char === "]") depth--;
}
continue;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Depth tracking counts brackets inside string values.

The depth counter iterates raw characters, so brackets within JSON string values (e.g., "What {thing}...") will incorrectly adjust depth, potentially causing spurious comma insertion or suppression. For the current use case (simple joke objects) this is unlikely to trigger, but worth noting as a known limitation.

Also applies to: 211-214

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm-json-fixer.ts` around lines 193 - 197, The depth counter currently
increments/decrements for every brace/brace character in the raw line (variables
depth, line, char), which miscounts when those characters appear inside JSON
string values; update the loop to track whether we're inside a quoted string
(e.g., an inString boolean toggled when encountering an unescaped double-quote,
handling backslash escapes) and only modify depth for "{" "[" "}" "]" when
inString is false; apply the same change to the other matching loop referenced
around lines 211-214 so brackets inside strings do not affect depth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant