-
Notifications
You must be signed in to change notification settings - Fork 792
Flakiness in LB2SolrClientTest.testTwoServers #3937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flakiness in LB2SolrClientTest.testTwoServers #3937
Conversation
|
Adding you @dsmiley as you touched SolrJ a lot lately, but I don't think any of the recent work causes these test fails. Due to recent renames, the develocity history for this test is split across two class names. Older history can be seen as |
|
More of a general question... How can we make our test platform not require super "magic" changes like this on a per class basis and solve it more globally? Is there a way that this could have been solved in our core code? Is it possible that in real life someone would hit the same issue of spinning up a Jetty and then sending requests? Could we have a retry or a probe for liveness instead approach that is built in? |
|
And the failing precommit due to Antora / node is also quite annoying, feel I'm seeing it all the time? <BEGIN RANT> Also, could we somehow split the test suite in two tiers, where a "core" tier is what is run on every PR and with normal But there is an elephant in this room, that we have been lazy, that almost all our tests are integration tests, where perhaps half of them could have been done as a unit test with mocks etc. Our test suite should be possible to run by a normal drive-by contributor in 5-10 mins. Not as today 1.5 hours and failing half of the runs. |
|
I looked at this for 15min just now. This may help a little but I'm not optimistic; obviously is just a bandaid and test anti-pattern -- what I tell new engineers not to do. I wish we had better means of flaky reproduction other than beasting. In particular, I wonder if there's a technique and/or tool that can simulate the JVM slowing down a lot without actually burdening one's machine. I suspect the average test on Crave takes longer, but it's massively parallelized to run a crazy number of tests in parallel. |
|
A better compromise without more sleeps on the happy path would be to execute the test that follows this call with |
|
Thanks for bringing attention to |
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Just some minor feedback.
solr/solrj/src/test/org/apache/solr/client/solrj/impl/LB2SolrClientTest.java
Outdated
Show resolved
Hide resolved
(cherry picked from commit 510cdcf)
(cherry picked from commit 510cdcf)
On Crave runs, about one failure per week.
The failure did not reproduce locally even after beasting, which makes it likely to be a timing issue which may be more visible on the ultra-fast Crave hardware. The failure was analyzed by an AI and a possible fix (though not proven) is to add a small sleep between observing the new Jetty starting up (being added to liveNodes) and actually sending requests to it.