-
Notifications
You must be signed in to change notification settings - Fork 4
Description
I am unable to reproduce the reported AMD results. Below are my evaluation results for GPT-4.1 (Attempt = 10) and DeepSeek V3 (deepseek-chat) for different RAG settings.
CPU: AMD Ryzen 7 8845HS
Attempts: N = 10
Models Tested: GPT-4.1, DeepSeek V3 (deepseek-chat)
GPT-4.1 Results (Attempt = 10)
Configuration | Passed | Failed | Pass Rate (%) | Avg Vector Score | Avg Total Cycles
N = 10 (No RAG) | 75 | 27 | 73.53 | 1.517 | 38076.36
N = 10, RAG = 1 | 64 | 38 | 62.75 | 2.117 | 35839.72
N = 10, RAG = 2 | 69 | 33 | 67.65 | 3.228 | 27555.90
N = 10, RAG = 3 | 69 | 33 | 67.65 | 2.661 | 28977.23
N = 10, RAG = 5 | 69 | 33 | 67.65 | 2.786 | 34524.62
To fit within token limits, I modified the RAG setup by removing some kernel comments to reduce prompt size.
DeepSeek V3 Results (deepseek-chat)
Configuration | Passed | Failed | Pass Rate (%) | Avg Vector Score | Avg Total Cycles
N = 10 (No RAG) | 57 | 45 | 55.88 | 4.237 | 28504.61
N = 10, RAG = 3 | 53 | 49 | 51.96 | 2.978 | 18697.75
Clariification:
- Are there any other parameters that I can tune to improve reproducibility and performance?
- Does reducing the RAG prompt significantly affect performance?