add testing for Qwen3.5-0.8B to GSM8K eval#36291
add testing for Qwen3.5-0.8B to GSM8K eval#36291puririshi98 wants to merge 15 commits intovllm-project:mainfrom
Qwen3.5-0.8B to GSM8K eval#36291Conversation
Signed-off-by: Rishi Puri <riship@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new accuracy evaluation test for the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model. The test suite is comprehensive, leveraging lm-evaluation-harness to check for correctness across various precisions (bfloat16, float16, fp8) and on multiple benchmarks (gsm8k, arc_easy, hellaswag). It also includes sanity checks and a smoke test. My review found a couple of inconsistencies in the documentation within the new test file that should be addressed for clarity and correctness.
| Covers: | ||
| - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ||
| - deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
There was a problem hiding this comment.
The docstring states that this test covers both the 1.5B and 7B versions of the DeepSeek-R1-Distill-Qwen model. However, the SMALL_DEEPSEEK_MODELS list (line 149) only includes the configuration for the 1.5B model. Please either add the configuration for the 7B model to the test suite or update the docstring to accurately reflect the tested models.
| - hellaswag (10-shot) – sentence completion | ||
|
|
||
| Usage: | ||
| pytest tests/lm_eval_correctness/test_deepseek_small_accuracy.py -v |
There was a problem hiding this comment.
The file path in the Usage example is incorrect. It points to tests/lm_eval_correctness/test_deepseek_small_accuracy.py, but the actual file path is tests/v1/attention/test-qwen-deepseek-1.5-accuracy.py. This should be updated to reflect the correct path to make the test runnable as documented.
| pytest tests/lm_eval_correctness/test_deepseek_small_accuracy.py -v | |
| pytest tests/v1/attention/test-qwen-deepseek-1.5-accuracy.py -v |
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…accuracy.py Signed-off-by: Rishi Puri <riship@nvidia.com>
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
- Test DeepSeek-R1-Distill-Qwen-1.5B across multiple precisions (bf16, fp16, fp8, auto) - Evaluate on GSM8K (5-shot), ARC Easy (25-shot), and HellaSwag (10-shot) - Use vLLM's native evaluate_gsm8k() harness instead of lm-eval for GSM8K - Include FP8 regression test with 6% threshold - Set default GPU memory utilization to 0.3 for test environments Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces a new configuration file for the Qwen3.5-0.8B model and updates the models-small.txt list to include it. The changes are straightforward and align with the stated objective of adding new model evaluations. No critical or high-severity issues were identified in the changes.
Note: Security Review is unavailable for this PR.
jasonlizhengjian
left a comment
There was a problem hiding this comment.
LGTM , please change the title to reflect the contents. Also consider adding a more descriptive PR description
Qwen3.5-0.8B to GSM8K eval
part of #36264
adds testing for Qwen3.5-0.8B to GSM8K eval