From 4ea665387f883634a1a748a453af5194c65b4353 Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Mon, 15 Dec 2025 13:36:00 +0200 Subject: [PATCH 1/2] docs: add GPT-5.2 benchmark results and remove O3/O4 Mini entries from PR benchmark documentation --- docs/docs/pr_benchmark/index.md | 71 +++++++++++---------------------- 1 file changed, 24 insertions(+), 47 deletions(-) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 136f17a29..171ce9e65 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -40,6 +40,12 @@ A list of the models used for generating the baseline suggestions, and example r medium 80.8 + + GPT-5.2 + 2025-12-11 + low + 79.1 + GPT-5-pro 2025-10-06 @@ -166,24 +172,12 @@ A list of the models used for generating the baseline suggestions, and example r 32.8 - - Claude-3.7-sonnet - 2025-02-19 - - 32.4 - Claude-opus-4.5 2025-11-01 high 30.3 - - GPT-4.1 - 2025-04-14 - - 26.5 - @@ -205,6 +199,24 @@ Weaknesses: - **Occasional inaccurate or harmful fixes:** It sometimes introduces non-compiling code, speculative refactors, or misguided changes to unchanged lines, lowering reliability. - **Inconsistent guideline adherence:** A non-trivial set of replies add off-scope edits, non-critical style nits, or empty suggestion lists when clear bugs exist, leading to avoidable downgrades. +### GPT-5.2 ('low' thinking budget) + +Final score: **79.1** + +Strengths: + +- **Often spots multiple critical regressions:** In many cases the model is the only or one of very few answers that simultaneously catches several high-impact bugs (e.g. Examples 25, 55, 134, 206, 371). +- **Produces concise, actionable patches:** Suggestions are usually well-scoped, supply minimal working code/YAML snippets and respect the three-item limit, so reviewers can apply them quickly. +- **Good rule compliance most of the time:** It generally limits itself to '+' lines, avoids stylistic nit-picking, and honours output format and suggestion cap, which keeps responses focused. +- **Broad language & domain coverage:** The model successfully reviews changes in many stacks (C/C++, Rust, Go, TS/JS, Python, Kotlin, SQL, CSS/MD/PO, CI scripts), showing solid cross-domain competence. + +Weaknesses: + +- **Misses higher-severity issues fairly often:** In a substantial fraction of examples it overlooks a more critical bug that other answers find (e.g. 6, 21, 30, 94, 310), lowering its relative rank. +- **Occasional invented or non-critical advice:** Sometimes raises speculative, cosmetic or out-of-scope points (17, 106, 171, 230, 390), violating the "critical bugs only" rule and hurting ranking. +- **Technical inaccuracies & unsafe fixes:** A number of replies introduce uncompilable code, wrong APIs, or contradictory edits (24, 341, 346, 375), indicating imperfect code-level precision. +- **Inconsistency in restraint:** While usually concise, the model sporadically adds redundant or excessive suggestions, touches unchanged lines, or conflicts with its own fixes (238, 270, 330), showing uneven guideline adherence. + ### GPT-5-pro Final score: **73.4** @@ -223,41 +235,6 @@ Weaknesses: - **Formatting / guideline slips:** Sporadic duplication of suggestions, missing or empty `improved_code` blocks, or YAML mishaps undermine otherwise good answers. - **Uneven criticality judgement:** Some suggestions drift into low-impact territory while overlooking more severe problems, indicating inconsistent prioritisation. -### O3 - -Final score: **62.5** - -Strengths: - -- **High precision & compliance:** Generally respects task rules (limits, “added lines” scope, YAML schema) and avoids false-positive advice, often returning an empty list when appropriate. -- **Clear, actionable output:** Suggestions are concise, well-explained and include correct before/after patches, so reviewers can apply them directly. -- **Good critical-bug detection rate:** Frequently spots compile-breakers or obvious runtime faults (nil / NPE, overflow, race, wrong selector, etc.), putting it at least on par with many peers. -- **Consistent formatting:** Produces syntactically valid YAML with correct labels, making automated consumption easy. - -Weaknesses: - -- **Narrow coverage:** Tends to stop after 1-2 issues; regularly misses additional critical defects that better answers catch, so it is seldom the top-ranked review. -- **Occasional inaccuracies:** A few replies introduce new bugs, give partial/duplicate fixes, or (rarely) violate rules (e.g., import suggestions), hurting trust. -- **Conservative bias:** Prefers silence over risk; while this keeps precision high, it lowers recall and overall usefulness on larger diffs. -- **Little added insight:** Rarely offers broader context, optimisations or holistic improvements, causing it to rank only mid-tier in many comparisons. - -### O4 Mini ('medium' thinking tokens) - -Final score: **57.7** - -Strengths: - -- **Good rule adherence:** Most answers respect the “new-lines only”, 3-suggestion, and YAML-schema limits, and frequently choose the safe empty list when the diff truly adds no critical bug. -- **Clear, minimal patches:** When the model does spot a defect it usually supplies terse, valid before/after snippets and short, targeted explanations, making fixes easy to read and apply. -- **Language & domain breadth:** Demonstrates competence across many ecosystems (C/C++, Java, TS/JS, Go, Rust, Python, Bash, Markdown, YAML, SQL, CSS, translation files, etc.) and can detect both compile-time and runtime mistakes. -- **Often competitive:** In a sizeable minority of cases the model ties for best or near-best answer, occasionally being the only response to catch a subtle crash or build blocker. - -Weaknesses: - -- **High miss rate:** A large share of examples show the model returning an empty list or only minor advice while other reviewers catch clear, high-impact bugs—indicative of weak defect-detection recall. -- **False or harmful fixes:** Several answers introduce new compilation errors, propose out-of-scope changes, or violate explicit rules (e.g., adding imports, version bumps, touching untouched lines), reducing trustworthiness. -- **Shallow coverage:** Even when it identifies one real issue it often stops there, missing additional critical problems found by stronger peers; breadth and depth are inconsistent. - ### Gemini-3-pro-review (high thinking budget) Final score: **57.3** From e062f67dc70712470ac1c1a0b86615ae75c4e4af Mon Sep 17 00:00:00 2001 From: ofir-frd Date: Mon, 15 Dec 2025 13:41:45 +0200 Subject: [PATCH 2/2] docs: update Qodo Merge default model from GPT-5 to GPT-5.2 in model documentation --- docs/docs/usage-guide/qodo_merge_models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/usage-guide/qodo_merge_models.md b/docs/docs/usage-guide/qodo_merge_models.md index b150846c6..d6052d34a 100644 --- a/docs/docs/usage-guide/qodo_merge_models.md +++ b/docs/docs/usage-guide/qodo_merge_models.md @@ -1,5 +1,5 @@ -The default models used by Qodo Merge 💎 (December 2025) are a combination of GPT-5, Haiku-4.5, and Gemini 2.5 Pro. +The default models used by Qodo Merge 💎 (December 2025) are a combination of GPT-5.2, Haiku-4.5, and Gemini 2.5 Pro. ### Selecting a Specific Model