Reproducibility Gap on Ryzen 7 8845HS

I am unable to reproduce the reported AMD results. Below are my evaluation results for GPT-4.1 (Attempt = 10) and DeepSeek V3 (deepseek-chat) for different RAG settings.
CPU: AMD Ryzen 7 8845HS
Attempts: N = 10
Models Tested: GPT-4.1, DeepSeek V3 (deepseek-chat)

GPT-4.1 Results (Attempt = 10)
Configuration | Passed | Failed | Pass Rate (%) | Avg Vector Score | Avg Total Cycles
N = 10 (No RAG) | 75 | 27 | 73.53 | 1.517 | 38076.36
N = 10, RAG = 1 | 64 | 38 | 62.75 | 2.117 | 35839.72
N = 10, RAG = 2 | 69 | 33 | 67.65 | 3.228 | 27555.90
N = 10, RAG = 3 | 69 | 33 | 67.65 | 2.661 | 28977.23
N = 10, RAG = 5 | 69 | 33 | 67.65 | 2.786 | 34524.62

To fit within token limits, I modified the RAG setup by removing some kernel comments to reduce prompt size.

DeepSeek V3 Results (deepseek-chat)
Configuration | Passed | Failed | Pass Rate (%) | Avg Vector Score | Avg Total Cycles
N = 10 (No RAG) | 57 | 45 | 55.88 | 4.237 | 28504.61
N = 10, RAG = 3 | 53 | 49 | 51.96 | 2.978 | 18697.75

Clariification: 
1. Are there any other parameters that I can tune to improve reproducibility and performance?
2. Does reducing the RAG prompt significantly affect performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility Gap on Ryzen 7 8845HS #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducibility Gap on Ryzen 7 8845HS #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions