Flakiness in LB2SolrClientTest.testTwoServers #3937

janhoy · 2025-12-10T09:32:28Z

On Crave runs, about one failure per week.

> Task :solr:core:wipeTaskTemp
ERROR: The following test(s) have failed:
  - org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers (:solr:solrj)
    Test history: 
https://develocity.apache.org/scans/tests?search.rootProjectNames=solr-root&tests.container=org.apache.solr.client.solrj.impl.LB2SolrClientTest&tests.test=testTwoServers 
http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers
    Test output: /tmp/src/solr/solr/solrj/build/test-results/test/outputs/OUTPUT-org.apache.solr.client.solrj.impl.LB2SolrClientTest.txt
    Reproduce with: ./gradlew :solr:solrj:test --tests "org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers" "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 
-XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=BFC8AC5F327E55CE -Ptests.timeoutSuite=600000! -Ptests.useSecurityManager=true 
-Ptests.file.encoding=US-ASCII

The failure did not reproduce locally even after beasting, which makes it likely to be a timing issue which may be more visible on the ultra-fast Crave hardware. The failure was analyzed by an AI and a possible fix (though not proven) is to add a small sleep between observing the new Jetty starting up (being added to liveNodes) and actually sending requests to it.

janhoy · 2025-12-10T09:38:39Z

Adding you @dsmiley as you touched SolrJ a lot lately, but I don't think any of the recent work causes these test fails.

Due to recent renames, the develocity history for this test is split across two class names. Older history can be seen as LBHttp2SolrClientIntegrationTest here https://develocity.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=solr-root&search.timeZoneId=Europe%2FOslo&tests.container=org.apache.solr.client.solrj.impl.LBHttp2SolrClientIntegrationTest which confirms that this test has been flaky for a long time.

epugh · 2025-12-10T11:29:20Z

More of a general question... How can we make our test platform not require super "magic" changes like this on a per class basis and solve it more globally? Is there a way that this could have been solved in our core code? Is it possible that in real life someone would hit the same issue of spinning up a Jetty and then sending requests? Could we have a retry or a probe for liveness instead approach that is built in?

janhoy · 2025-12-10T12:33:25Z

And the failing precommit due to Antora / node is also quite annoying, feel I'm seeing it all the time?

* What went wrong:
Execution failed for task ':solr:solr-ref-guide:buildLocalAntoraSite'.
> Process 'command '/home/runner/work/solr/solr/solr/solr-ref-guide/.gradle/node/nodejs/node-v22.18.0-linux-x64/bin/npx'' finished with non-zero exit value 1

<BEGIN RANT>
Flaky test is a productivity drain on the whole community. Seeing a bunch of solrbot PRs red due to test flakiness delays dependency upgrades and reduces velocity.

Also, could we somehow split the test suite in two tiers, where a "core" tier is what is run on every PR and with normal gradle test. Then the next tier contains long-running and flaky tests. Then we already have nightly tier, perhaps we can just move more tests to nightly, dunno.

But there is an elephant in this room, that we have been lazy, that almost all our tests are integration tests, where perhaps half of them could have been done as a unit test with mocks etc. Our test suite should be possible to run by a normal drive-by contributor in 5-10 mins. Not as today 1.5 hours and failing half of the runs.
</END RANT>

dsmiley · 2025-12-11T21:30:11Z

I looked at this for 15min just now. This may help a little but I'm not optimistic; obviously is just a bandaid and test anti-pattern -- what I tell new engineers not to do. I wish we had better means of flaky reproduction other than beasting. In particular, I wonder if there's a technique and/or tool that can simulate the JVM slowing down a lot without actually burdening one's machine.

I suspect the average test on Crave takes longer, but it's massively parallelized to run a crazy number of tests in parallel.

dsmiley · 2025-12-11T21:33:09Z

A better compromise without more sleeps on the happy path would be to execute the test that follows this call with org.apache.solr.common.util.RetryUtil. That, I'd get behind.

janhoy · 2025-12-11T23:05:49Z

Thanks for bringing attention to RetryUtil, haven't seen it before. I switched to that strategy and improved RetryUtil with docs at the same time.

dsmiley

Thanks.
Just some minor feedback.

solr/solrj/src/java/org/apache/solr/common/util/RetryUtil.java

solr/solrj/src/test/org/apache/solr/client/solrj/impl/LB2SolrClientTest.java

(cherry picked from commit 510cdcf)

Add time for state to propagate after jetty start

ee400c6

janhoy requested a review from jdyer1 December 10, 2025 09:32

github-actions bot added client:solrj tests labels Dec 10, 2025

janhoy requested a review from dsmiley December 10, 2025 09:32

Change strategy to retry

ae36c3c

Tidy

351be50

dsmiley approved these changes Dec 12, 2025

View reviewed changes

Review feedback

f280a99

janhoy merged commit 510cdcf into apache:main Dec 14, 2025
4 of 6 checks passed

janhoy deleted the fix-test-instability-LB2SolrClientTest branch December 14, 2025 18:32

janhoy added a commit that referenced this pull request Dec 14, 2025

Flakiness in LB2SolrClientTest.testTwoServers (#3937)

2377ee9

(cherry picked from commit 510cdcf)

janhoy added a commit that referenced this pull request Dec 14, 2025

Flakiness in LB2SolrClientTest.testTwoServers (#3937)

893c9f8

(cherry picked from commit 510cdcf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flakiness in LB2SolrClientTest.testTwoServers #3937

Flakiness in LB2SolrClientTest.testTwoServers #3937

janhoy commented Dec 10, 2025

Uh oh!

janhoy commented Dec 10, 2025

Uh oh!

epugh commented Dec 10, 2025

Uh oh!

janhoy commented Dec 10, 2025 •

edited

Loading

Uh oh!

dsmiley commented Dec 11, 2025

Uh oh!

dsmiley commented Dec 11, 2025

Uh oh!

janhoy commented Dec 11, 2025

Uh oh!

dsmiley left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flakiness in LB2SolrClientTest.testTwoServers #3937

Flakiness in LB2SolrClientTest.testTwoServers #3937

Conversation

janhoy commented Dec 10, 2025

Uh oh!

janhoy commented Dec 10, 2025

Uh oh!

epugh commented Dec 10, 2025

Uh oh!

janhoy commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsmiley commented Dec 11, 2025

Uh oh!

dsmiley commented Dec 11, 2025

Uh oh!

janhoy commented Dec 11, 2025

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janhoy commented Dec 10, 2025 •

edited

Loading