[ci] Update rtol for test_classification#36556
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the tolerance for a failing test in test_classification.py. The change involves adding an absolute tolerance (atol=1e-4) to the torch.allclose call and explicitly naming the relative tolerance (rtol). This is a standard approach to handle minor floating-point discrepancies that can occur across different hardware or library versions. The change is well-contained and the updated tolerance seems reasonable for this test case. No high or critical severity issues were found in this pull request.
Signed-off-by: angelayi <yiangela7@gmail.com>
|
@noooop any thoughts on this? We observed in experiments that the test is flaky on main.
To fix the issue, we raised the tolerances up, which I think seems reasonable. |
Will seed_everything and flaky(reruns=3) help, like in #32909? The recently merged #36385 modified this threshold, which may cause conflicts with this PR. |
|
@noooop we can try. In the event that seed_everything() doesn't fix the issue, would you be OK with us updating the tolerances? An alternative could also be that we change the prompts. |
|
Actually I just read #36385 - changing the tolerances that way might make this problem go away too, we'll test that as well. |
|
I’m okay with changing the tolerances. The threshold is very very tight and often causes flakiness when upgrading dependencies. We also frequently catch minor numerical issues from this case, so I’m not sure if it’s a bug or a feature. e.g. However, I’ve been very busy recently and don’t have time to dig into the deep numerical precision issues. |
Fixes pytorch/pytorch#175928
This test fails when updating torch to 2.11 on L4s, where the vllm output is
tensor([0.1023, 0.8977]), while the hf output istensor([0.1024, 0.8976]). However, if we narrow down the test case to just the failing prompt, this test also fails in torch 2.10. After chatting with @zou3519 we decided to just update the tolerance.