gh-21: Attempt to make regression tests more reliable #20

connoraird · 2025-11-12T17:01:30Z

Use mean instead on min as it should be more reliable and make less strict

…trict

paddyroddy

Good idea. Assume 15% was required for this to pass? Seems a fairly high level of degradation

connoraird · 2025-11-13T08:06:51Z

Good idea. Assume 15% was required for this to pass? Seems a fairly high level of degradation

TBH the 15% is quite arbitrary. It was very inconsistent running on my local machine. However, the inconsistencies were usually less than 15% but not always.

paddyroddy · 2025-11-13T08:16:38Z

Can the tolerance in the tests be changed?

connoraird · 2025-11-13T09:41:47Z

Can the tolerance in the tests be changed?

As in on a per test basis? Or do you mean not hard code in nox and pass it in posargs? Posargs would be easy.

paddyroddy · 2025-11-13T09:47:08Z

I had meant in the tests. But whatever you think is best. I just think 15% degradation seems too high.

connoraird · 2025-11-13T10:09:52Z

I had meant in the tests. But whatever you think is best. I just think 15% degradation seems too high.

I can't see anyway to have a per test comparison. Unless we filtered the tests within nox via pytest -k ...

connoraird · 2025-11-13T10:16:38Z

Looking at the runs for this PR, most of the stats seem reasonable but you do get some strange ones. For example, this line has a standard deviation much larger than the mean

Name (time in us)                                                                                  Min                    Max                  Mean              StdDev                Median                 IQR            Outliers          OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_broadcast_first[numpy]                                                                    29.3750 (1.72)     43,014.5230 (661.64)      35.0362 (1.92)     382.9808 (134.39)      30.8480 (1.75)       0.8800 (2.31)       1;1213  28,541.9155 (0.52)      12597           1

connoraird · 2025-11-13T10:20:39Z

This snippet from pytest-benchmark docs seems relevant

paddyroddy · 2025-11-13T10:33:02Z

Does asv have this issue? Or pytest-codspeed?

connoraird · 2025-11-13T10:38:24Z

Does asv have this issue? Or pytest-codspeed?

In both pytest-benchmark and pytest-codspeed, there is a "pedantic" mode in which you can specify how the stats should be collected. Perhaps that will be useful?

Also we could try using a different timer function as suggested here

…ure threshold a little more sensitive

connoraird · 2025-11-13T11:10:52Z

Does asv have this issue? Or pytest-codspeed?

I can't see why there would be a difference in relation to this unreliability. In my opinion, the issue is related to the fact we cannot guarantee the state of the machine/VM we are running on which will be the case no matter what tool we use.

paddyroddy · 2025-11-13T15:46:59Z

Also we could try using a different timer function as suggested here

Let's give this a go

connoraird · 2025-11-13T16:40:25Z

Closing as moved to glass repo glass-dev/glass#780

Use mean instead on min as it should be more reliable and make less s…

b11bc9f

…trict

connoraird requested a review from paddyroddy November 12, 2025 17:01

connoraird self-assigned this Nov 12, 2025

connoraird added the enhancement New feature or request label Nov 12, 2025

paddyroddy reviewed Nov 13, 2025

View reviewed changes

Use time.process_time timer which ignores sleeps and IO and make fail…

9f5b860

…ure threshold a little more sensitive

connoraird requested a review from paddyroddy November 13, 2025 11:11

connoraird changed the title ~~Attempt to make regression tests more reliable~~ gh-21: Attempt to make regression tests more reliable Nov 13, 2025

connoraird mentioned this pull request Nov 13, 2025

gh-784: Improve reg test reliability glass-dev/glass#780

Merged

5 tasks

connoraird closed this Nov 13, 2025

paddyroddy deleted the connor/make-benchmarks-more-reliable branch November 13, 2025 16:43

gh-21: Attempt to make regression tests more reliable #20

gh-21: Attempt to make regression tests more reliable #20

Uh oh!

Conversation

connoraird commented Nov 12, 2025

Uh oh!

paddyroddy left a comment

Choose a reason for hiding this comment

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

paddyroddy commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

paddyroddy commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

paddyroddy commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

paddyroddy commented Nov 13, 2025

Uh oh!

connoraird commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants