Skip to content

add testing for Qwen3.5-0.8B to GSM8K eval#36291

Open
puririshi98 wants to merge 15 commits intovllm-project:mainfrom
puririshi98:patch-3
Open

add testing for Qwen3.5-0.8B to GSM8K eval#36291
puririshi98 wants to merge 15 commits intovllm-project:mainfrom
puririshi98:patch-3

Conversation

@puririshi98
Copy link

@puririshi98 puririshi98 commented Mar 6, 2026

part of #36264
adds testing for Qwen3.5-0.8B to GSM8K eval

Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models v1 labels Mar 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new accuracy evaluation test for the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model. The test suite is comprehensive, leveraging lm-evaluation-harness to check for correctness across various precisions (bfloat16, float16, fp8) and on multiple benchmarks (gsm8k, arc_easy, hellaswag). It also includes sanity checks and a smoke test. My review found a couple of inconsistencies in the documentation within the new test file that should be addressed for clarity and correctness.

Comment on lines +4 to +6
Covers:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring states that this test covers both the 1.5B and 7B versions of the DeepSeek-R1-Distill-Qwen model. However, the SMALL_DEEPSEEK_MODELS list (line 149) only includes the configuration for the 1.5B model. Please either add the configuration for the 7B model to the test suite or update the docstring to accurately reflect the tested models.

- hellaswag (10-shot) – sentence completion

Usage:
pytest tests/lm_eval_correctness/test_deepseek_small_accuracy.py -v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The file path in the Usage example is incorrect. It points to tests/lm_eval_correctness/test_deepseek_small_accuracy.py, but the actual file path is tests/v1/attention/test-qwen-deepseek-1.5-accuracy.py. This should be updated to reflect the correct path to make the test runnable as documented.

Suggested change
pytest tests/lm_eval_correctness/test_deepseek_small_accuracy.py -v
pytest tests/v1/attention/test-qwen-deepseek-1.5-accuracy.py -v

@mergify
Copy link

mergify bot commented Mar 6, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

…accuracy.py

Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

puririshi98 and others added 4 commits March 8, 2026 21:45
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
@mergify
Copy link

mergify bot commented Mar 9, 2026

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

puririshi98 and others added 6 commits March 9, 2026 05:14
- Test DeepSeek-R1-Distill-Qwen-1.5B across multiple precisions (bf16, fp16, fp8, auto)
- Evaluate on GSM8K (5-shot), ARC Easy (25-shot), and HellaSwag (10-shot)
- Use vLLM's native evaluate_gsm8k() harness instead of lm-eval for GSM8K
- Include FP8 regression test with 6% threshold
- Set default GPU memory utilization to 0.3 for test environments

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
@puririshi98 puririshi98 requested a review from mgoin as a code owner March 9, 2026 22:58
Signed-off-by: Rishi Puri <riship@nvidia.com>
@jasonlizhengjian
Copy link
Contributor

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new configuration file for the Qwen3.5-0.8B model and updates the models-small.txt list to include it. The changes are straightforward and align with the stated objective of adding new model evaluations. No critical or high-severity issues were identified in the changes.

Note: Security Review is unavailable for this PR.

Copy link
Contributor

@jasonlizhengjian jasonlizhengjian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , please change the title to reflect the contents. Also consider adding a more descriptive PR description

@puririshi98 puririshi98 changed the title Create test-qwen-deepseek-1.5-accuracy.py add testing for Qwen3.5-0.8B to GSM8K eval Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models qwen Related to Qwen models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants