kaito benchmark blog #5497

bangqipropel · 2025-12-08T13:12:07Z

No description provided.

Copilot

Pull request overview

This pull request introduces a comprehensive blog post about benchmarking KAITO's Retrieval-Augmented Generation (RAG) service on Azure Kubernetes Service (AKS). The post presents detailed performance comparisons between RAG and baseline LLM approaches across two distinct use cases: document question answering and code modification tasks.

Key Changes

Added new "benchmarking" tag to the blog taxonomy for performance testing content
Added new author "Bangqi Zhu" to the authors registry
Published detailed technical blog post with benchmark methodologies, results, and insights demonstrating RAG's 68% improvement on document Q&A and 20% improvement on code modification tasks

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 4 comments.

File	Description
website/blog/tags.yml	Added new "benchmarking" tag with label, permalink, and description for performance testing content
website/blog/authors.yml	Added author profile for Bangqi Zhu including LinkedIn, GitHub links, and profile image
website/blog/2025-12-08-kaito-rag-benchmarking/index.md	Comprehensive blog post covering RAG benchmarking methodology, results showing significant performance improvements, detailed implementation guides, and best practices for running benchmarks on AKS

Copilot · 2025-12-08T13:16:53Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+<!-- truncate -->
+
+## Why Benchmark RAG?


Missing hero image after the truncate marker. According to the blog post guidelines, a hero image should be placed immediately after  using the pattern ![Hero Image](./hero-image.png). The required structure is:

 ![Hero Image Description](./hero-image.png) ## First Section

Consider adding a hero image that visually represents the blog post's main topic (benchmarking KAITO RAG).

Copilot · 2025-12-08T13:16:54Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+<!-- truncate -->
+
+## Why Benchmark RAG?


Missing hero image after the truncate marker. According to the blog post guidelines, "Hero image: Use ./hero-image.png for same-directory assets" should be placed immediately after . The pattern should be:

 ![Hero Image](./hero-image.png) ## Section 1: Problem/Context

Currently, the post jumps directly to section content without a hero image.

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

Signed-off-by: Bangqi Zhu <bangqizhu@microsoft.com>

sdesai345 · 2025-12-09T16:23:56Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+- **How much does RAG improve answer quality?** Traditional LLMs rely solely on pre-trained knowledge, which can be outdated or incomplete for domain-specific queries.
+- **Is RAG cost-effective?** Token usage directly impacts operational costs at scale.
+- **Where does RAG struggle?** Understanding failure modes guides system improvements.


Suggested change

- **Where does RAG struggle?** Understanding failure modes guides system improvements.

- **Where does RAG experience bottlenecks?** Understanding different failure modes guides system improvements.

sdesai345 · 2025-12-09T16:24:44Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+Let's dive into each benchmark and its results.
+
+---


Suggested change

---

sdesai345 · 2025-12-09T16:27:05Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+Our comprehensive benchmarking reveals nuanced insights about RAG performance:
+
+✅ **RAG Excels At:**


Suggested change

✅ **RAG Excels At:**

**RAG Excels At:**

sdesai345 · 2025-12-09T16:27:17Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+- Automatic context retrieval with high precision
+- Reducing hallucination on domain-specific queries
+
+💡 **Optimization Insights:**


Suggested change

💡 **Optimization Insights:**

**Optimization Insights:**

sdesai345 · 2025-12-09T16:28:24Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+  --output comparison_report.json
+```
+
+---


Suggested change

---

sdesai345 · 2025-12-09T16:28:55Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+4. **Better coverage**: RAG considers all indexed files, not just obvious candidates
+
+:::tip Benchmark Validation
+RAG's 60% success validates the TOP-4 filtering approach! This proves that:


Suggested change

RAG's 60% success validates the TOP-4 filtering approach! This proves that:

RAG's 60% success validates the TOP-4 filtering approach, demonstrating that:

sdesai345 · 2025-12-09T16:29:54Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+Higher token usage with RAG is expected since we include retrieved context. The trade-off between accuracy gain and cost must be evaluated for your specific use case.
+:::
+
+---


Suggested change

---

sdesai345 · 2025-12-09T16:31:33Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+  --rag-url http://localhost:5000 \
+  --llm-url http://your-llm-api.com \
+  --judge-url http://your-llm-api.com \
+  --llm-model "deepseek-v3.1" \
+  --judge-model "deepseek-v3.1" \


Can you expand on the context of these inputs and how they're used for benchmarking against the chosen LLM?

sdesai345

Included feedback

Copilot

Pull request overview

Copilot reviewed 3 out of 7 changed files in this pull request and generated 5 comments.

Copilot · 2025-12-10T23:18:54Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+**Architecture Flow:**
+
+![Document Q&A Benchmark Flow](./document-benchmark-flow.png)


The image reference is missing descriptive alt text. According to the blog post guidelines, all images MUST have descriptive alt text for accessibility and SEO. The alt text should describe what the image shows, not just repeat the caption.

Suggested change

![Document Q&A Benchmark Flow](./document-benchmark-flow.png)

![Flow diagram showing the Document Q&A Benchmark process: documents are indexed, test questions are generated, both RAG and baseline LLM answer the questions, an LLM judge scores the answers, and results are analyzed.](./document-benchmark-flow.png)

Copilot · 2025-12-10T23:18:54Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+| **Overall Average** | 8.15/10 | 4.85/10 | **+68%** |
+| **Token Usage** | Variable | Baseline | Context-dependent |
+
+![Performance Comparison: Document Q&A vs Code Modification](./performance-comparison.png)


The image reference is missing descriptive alt text. According to the blog post guidelines, all images MUST have descriptive alt text for accessibility and SEO. The alt text should describe what the image shows, not just repeat the caption.

Suggested change

![Performance Comparison: Document Q&A vs Code Modification](./performance-comparison.png)

![Bar chart comparing RAG and pure LLM performance scores for closed and open document Q&A questions, showing RAG significantly outperforming pure LLM, especially on closed questions.](./performance-comparison.png)

Copilot · 2025-12-10T23:18:54Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+### Architecture Flow
+
+![Code Modification Benchmark Flow](./code-benchmark-flow.png)


The image reference is missing descriptive alt text. According to the blog post guidelines, all images MUST have descriptive alt text for accessibility and SEO. The alt text should describe what the image shows, not just repeat the caption.

Suggested change

![Code Modification Benchmark Flow](./code-benchmark-flow.png)

![Diagram showing the KAITO RAG code modification benchmark flow, including user query, document retrieval, top-4 file selection, LLM response generation, and evaluation steps](./code-benchmark-flow.png)

Copilot · 2025-12-10T23:18:55Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+✓ Tests passed (3/5 issues succeed, 2/5 fail)
+```
+
+![TOP-4 Relevance Filtering Process](./top4-filtering-diagram.png)


The image reference is missing descriptive alt text. According to the blog post guidelines, all images MUST have descriptive alt text for accessibility and SEO. The alt text should describe what the image shows, not just repeat the caption.

Suggested change

![TOP-4 Relevance Filtering Process](./top4-filtering-diagram.png)

![Diagram showing how RAG filters retrieved files by relevance score and selects the top 4 files for code modification tasks](./top4-filtering-diagram.png)

Copilot · 2025-12-10T23:18:55Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+  --create-namespace
+```
+
+1. **Index Your Content**:


The numbered list formatting is inconsistent. Item 2 "RAG Engine Deployed" should be numbered as "2." to continue the sequence. However, there's a list item "1. Index Your Content" starting at line 297, which appears to be a third item but is numbered as 1. Please renumber this as "3." to maintain proper list sequence.

Suggested change

1. **Index Your Content**:

3. **Index Your Content**:

sabbour · 2025-12-10T23:26:58Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+### Prerequisites
+
+1. **AKS Cluster with KAITO**: Follow [KAITO installation guide](https://kaito-project.github.io/kaito/docs/installation)


Why is this referencing OSS installation instructions and not the KAITO add-on?

sabbour · 2025-12-10T23:28:15Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+
+---
+
+## Key Takeaways


Please remove the emojis. This screams "AI generated" at me :)

sabbour · 2025-12-10T23:29:09Z

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

+- **Document Benchmark**: [`rag_benchmark_docs/`](https://github.com/kaito-project/kaito/tree/main/rag_benchmark_docs)
+  - Quick start: [`RAG_BENCHMARK_DOCS_README.md`](https://github.com/kaito-project/kaito/blob/main/rag_benchmark_docs/RAG_BENCHMARK_DOCS_README.md)
+  - Complete guide: [`RAG_BENCHMARK_DOCS_GUIDE.md`](https://github.com/kaito-project/kaito/blob/main/rag_benchmark_docs/RAG_BENCHMARK_DOCS_GUIDE.md)
+
+- **Code Benchmark**: [`code_benchmark/`](https://github.com/kaito-project/kaito/pull/1678) (PR pending merge)
+  - Quick start: [`GETTING_STARTED.md`](https://github.com/kaito-project/kaito/pull/1678/files#diff-d5b183b0a8f37a07a826b64ccfa966be89d3c80c948265bd66be8c53f7dd4f00)
+  - Complete guide: [`CODE_BENCHMARK_GUIDE.md`](https://github.com/kaito-project/kaito/pull/1678/files#diff-9a5ff0d2cd3c7b140aab1d0c9a6f4bfb0f3c91bf0e55fd31b57669289958056c)


These are not really descriptive link text.

bangqipropel requested review from a team, chzbrgr71, Copilot, robbiezhang and thomas1206 December 8, 2025 13:12

bangqipropel requested a review from palma21 as a code owner December 8, 2025 13:12

Copilot started reviewing on behalf of bangqipropel December 8, 2025 13:12 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

kaito benchmark blog

80e993c

Signed-off-by: Bangqi Zhu <bangqizhu@microsoft.com>

bangqipropel force-pushed the kaito_benchmark_blog branch from a19f855 to 80e993c Compare December 8, 2025 18:59

sdesai345 reviewed Dec 9, 2025

View reviewed changes

website/blog/2025-12-08-kaito-rag-benchmarking/index.md

--output comparison_report.json

```

---

Copy link

Contributor

sdesai345 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

---

sdesai345 reviewed Dec 9, 2025

View reviewed changes

sabbour requested a review from Copilot December 10, 2025 23:16

Copilot started reviewing on behalf of sabbour December 10, 2025 23:16 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

sabbour reviewed Dec 10, 2025

View reviewed changes

	- Where does RAG struggle? Understanding failure modes guides system improvements.
	- Where does RAG experience bottlenecks? Understanding different failure modes guides system improvements.


		Our comprehensive benchmarking reveals nuanced insights about RAG performance:

		✅ RAG Excels At:

	RAG's 60% success validates the TOP-4 filtering approach! This proves that:
	RAG's 60% success validates the TOP-4 filtering approach, demonstrating that:


		Architecture Flow:

		![Document Q&A Benchmark Flow](./document-benchmark-flow.png)

	![Document Q&A Benchmark Flow](./document-benchmark-flow.png)
	![Flow diagram showing the Document Q&A Benchmark process: documents are indexed, test questions are generated, both RAG and baseline LLM answer the questions, an LLM judge scores the answers, and results are analyzed.](./document-benchmark-flow.png)

	![Performance Comparison: Document Q&A vs Code Modification](./performance-comparison.png)
	![Bar chart comparing RAG and pure LLM performance scores for closed and open document Q&A questions, showing RAG significantly outperforming pure LLM, especially on closed questions.](./performance-comparison.png)


		### Architecture Flow

		![Code Modification Benchmark Flow](./code-benchmark-flow.png)

	![Code Modification Benchmark Flow](./code-benchmark-flow.png)
	![Diagram showing the KAITO RAG code modification benchmark flow, including user query, document retrieval, top-4 file selection, LLM response generation, and evaluation steps](./code-benchmark-flow.png)

	![TOP-4 Relevance Filtering Process](./top4-filtering-diagram.png)
	![Diagram showing how RAG filters retrieved files by relevance score and selects the top 4 files for code modification tasks](./top4-filtering-diagram.png)


		### Prerequisites

		1. AKS Cluster with KAITO: Follow [KAITO installation guide](https://kaito-project.github.io/kaito/docs/installation)

kaito benchmark blog #5497

Are you sure you want to change the base?

kaito benchmark blog #5497

Conversation

bangqipropel commented Dec 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdesai345 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants