Skip to content

Reproducibility Gap on Ryzen 7 8845HS #7

@HemaSwetha25

Description

@HemaSwetha25

I am unable to reproduce the reported AMD results. Below are my evaluation results for GPT-4.1 (Attempt = 10) and DeepSeek V3 (deepseek-chat) for different RAG settings.
CPU: AMD Ryzen 7 8845HS
Attempts: N = 10
Models Tested: GPT-4.1, DeepSeek V3 (deepseek-chat)

GPT-4.1 Results (Attempt = 10)
Configuration | Passed | Failed | Pass Rate (%) | Avg Vector Score | Avg Total Cycles
N = 10 (No RAG) | 75 | 27 | 73.53 | 1.517 | 38076.36
N = 10, RAG = 1 | 64 | 38 | 62.75 | 2.117 | 35839.72
N = 10, RAG = 2 | 69 | 33 | 67.65 | 3.228 | 27555.90
N = 10, RAG = 3 | 69 | 33 | 67.65 | 2.661 | 28977.23
N = 10, RAG = 5 | 69 | 33 | 67.65 | 2.786 | 34524.62

To fit within token limits, I modified the RAG setup by removing some kernel comments to reduce prompt size.

DeepSeek V3 Results (deepseek-chat)
Configuration | Passed | Failed | Pass Rate (%) | Avg Vector Score | Avg Total Cycles
N = 10 (No RAG) | 57 | 45 | 55.88 | 4.237 | 28504.61
N = 10, RAG = 3 | 53 | 49 | 51.96 | 2.978 | 18697.75

Clariification:

  1. Are there any other parameters that I can tune to improve reproducibility and performance?
  2. Does reducing the RAG prompt significantly affect performance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions