Skip to content

Conversation

@janhoy
Copy link
Contributor

@janhoy janhoy commented Dec 10, 2025

On Crave runs, about one failure per week.

> Task :solr:core:wipeTaskTemp
ERROR: The following test(s) have failed:
  - org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers (:solr:solrj)
    Test history: 
https://develocity.apache.org/scans/tests?search.rootProjectNames=solr-root&tests.container=org.apache.solr.client.solrj.impl.LB2SolrClientTest&tests.test=testTwoServers 
http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers
    Test output: /tmp/src/solr/solr/solrj/build/test-results/test/outputs/OUTPUT-org.apache.solr.client.solrj.impl.LB2SolrClientTest.txt
    Reproduce with: ./gradlew :solr:solrj:test --tests "org.apache.solr.client.solrj.impl.LB2SolrClientTest.testTwoServers" "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 
-XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=BFC8AC5F327E55CE -Ptests.timeoutSuite=600000! -Ptests.useSecurityManager=true 
-Ptests.file.encoding=US-ASCII

The failure did not reproduce locally even after beasting, which makes it likely to be a timing issue which may be more visible on the ultra-fast Crave hardware. The failure was analyzed by an AI and a possible fix (though not proven) is to add a small sleep between observing the new Jetty starting up (being added to liveNodes) and actually sending requests to it.

@janhoy
Copy link
Contributor Author

janhoy commented Dec 10, 2025

Adding you @dsmiley as you touched SolrJ a lot lately, but I don't think any of the recent work causes these test fails.

Due to recent renames, the develocity history for this test is split across two class names. Older history can be seen as LBHttp2SolrClientIntegrationTest here https://develocity.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=solr-root&search.timeZoneId=Europe%2FOslo&tests.container=org.apache.solr.client.solrj.impl.LBHttp2SolrClientIntegrationTest which confirms that this test has been flaky for a long time.

@epugh
Copy link
Contributor

epugh commented Dec 10, 2025

More of a general question... How can we make our test platform not require super "magic" changes like this on a per class basis and solve it more globally? Is there a way that this could have been solved in our core code? Is it possible that in real life someone would hit the same issue of spinning up a Jetty and then sending requests? Could we have a retry or a probe for liveness instead approach that is built in?

@janhoy
Copy link
Contributor Author

janhoy commented Dec 10, 2025

And the failing precommit due to Antora / node is also quite annoying, feel I'm seeing it all the time?

* What went wrong:
Execution failed for task ':solr:solr-ref-guide:buildLocalAntoraSite'.
> Process 'command '/home/runner/work/solr/solr/solr/solr-ref-guide/.gradle/node/nodejs/node-v22.18.0-linux-x64/bin/npx'' finished with non-zero exit value 1

<BEGIN RANT>
Flaky test is a productivity drain on the whole community. Seeing a bunch of solrbot PRs red due to test flakiness delays dependency upgrades and reduces velocity.

Also, could we somehow split the test suite in two tiers, where a "core" tier is what is run on every PR and with normal gradle test. Then the next tier contains long-running and flaky tests. Then we already have nightly tier, perhaps we can just move more tests to nightly, dunno.

But there is an elephant in this room, that we have been lazy, that almost all our tests are integration tests, where perhaps half of them could have been done as a unit test with mocks etc. Our test suite should be possible to run by a normal drive-by contributor in 5-10 mins. Not as today 1.5 hours and failing half of the runs.
</END RANT>

@dsmiley
Copy link
Contributor

dsmiley commented Dec 11, 2025

I looked at this for 15min just now. This may help a little but I'm not optimistic; obviously is just a bandaid and test anti-pattern -- what I tell new engineers not to do. I wish we had better means of flaky reproduction other than beasting. In particular, I wonder if there's a technique and/or tool that can simulate the JVM slowing down a lot without actually burdening one's machine.

I suspect the average test on Crave takes longer, but it's massively parallelized to run a crazy number of tests in parallel.

@dsmiley
Copy link
Contributor

dsmiley commented Dec 11, 2025

A better compromise without more sleeps on the happy path would be to execute the test that follows this call with org.apache.solr.common.util.RetryUtil. That, I'd get behind.

@janhoy
Copy link
Contributor Author

janhoy commented Dec 11, 2025

Thanks for bringing attention to RetryUtil, haven't seen it before. I switched to that strategy and improved RetryUtil with docs at the same time.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Just some minor feedback.

@janhoy janhoy merged commit 510cdcf into apache:main Dec 14, 2025
4 of 6 checks passed
@janhoy janhoy deleted the fix-test-instability-LB2SolrClientTest branch December 14, 2025 18:32
janhoy added a commit that referenced this pull request Dec 14, 2025
janhoy added a commit that referenced this pull request Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants