[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s#29533
[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s#29533jkleckner wants to merge 1 commit intoapache:branch-3.0from
Conversation
|
ok to test |
|
cc @holdenk |
|
Test build #127914 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
We also see this problem with Spark 2.4.X so I would very much like to see this landed for both versions (I was redirected here from the other PR). |
|
Ok sounds like something to keep in mind. |
|
This comment exists merely to link this PR with PR #29496 intended for Spark 2.4 branch |
|
Any review update, @holdenk ? |
|
Jenkins retest this please. |
|
I haven’t had a chance to look, but given it’s a backport if you have don’t feel like you need to wait for me. |
|
Test build #128023 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Is there some way to run that expanded test information for this single failed R integration test as described in the message? @stijndehaes Do you have some insight into the test suite changes and what made it succeed on the master branch as discussed in #28423 ? |
|
Test build #128053 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
745ee6b to
8ad475b
Compare
|
Test build #128054 has finished for PR 29533 at commit
|
8ad475b to
6449efa
Compare
|
Test build #128055 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Does the integration test flakiness described by @holdenk in SPARK-32354 [1] apply to this build? |
|
Huh did someone re-enable the R tests? |
|
Kubernetes integration test status success |
|
@holdenk Ok, integration tests pass. It was just a two word deletion... |
|
@dongjoon-hyun Is this ok to merge? |
| with BeforeAndAfterAll with BeforeAndAfter with BasicTestsSuite with SecretsTestsSuite | ||
| with PythonTestsSuite with ClientModeTestsSuite with PodTemplateSuite with PVTestsSuite | ||
| with DepsTestsSuite with RTestsSuite with Logging with Eventually with Matchers { | ||
| with DepsTestsSuite with Logging with Eventually with Matchers { |
There was a problem hiding this comment.
Please revert this. branch-3.0 doesn't have R test issue.
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Run SparkR on simple dataframe.R example
Run completed in 8 minutes, 12 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
There was a problem hiding this comment.
@dongjoon-hyun We have a review loop.
@holdenk asked me to turn off the SparkR tests - see above.
And @holdenk created a story for master to make the integration tests work properly for SparkR here [1].
[1] https://issues.apache.org/jira/projects/SPARK/issues/SPARK-32354
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Hi, @@jkleckner .
I briefly take a look. The change on KubernetesSuite.scala should be reverted from this PR.
|
@jkleckner I tried this patch in production but it does not seem to work. Distribution: Image creation: When I execute my long-running spark application I get I'm not sure if I made a mistake or there's a problem in the patch. Thanks |
Your rebase of the patch looks correct when I tried it out. Unfortunately, there must be something in the patches to k8s code after 2.4 and before the fix #28423 was merged. My interest in landing this patch has been to unblock #29496 due to the backporting policies for Spark 2.4 which is what we use and I don't have a setup to test this for the 3.0 branch. @redsk if you would like to look into that, it would be helpful. The candidate patch sets that are in the k8s can be viewed with something like this: @stijndehaes can you guide as to which of these might affect this backport? |
|
@redsk @jkleckner The error line you are seeing comes from the class This also looks like driver logs? The fix here is in the spark-submit application, you should watch the logs of that i.s.o. the driver. Can you tell me if these were driver logs or spark-submit logs? |
|
Hi, @jkleckner and all. |
|
@dongjoon-hyun Since I don't have a test environment to try out or observe the problem mentioned by @redsk it will have to be taken up by someone like @redsk. I am fully on a critical path for the next many weeks and the back port to 2.4 is working fine for us so I can't spend any more time on this now. And as mentioned by @stijndehaes , this could very well be a separate issue from the fix that 24266 addresses. I did spend some time last week and thought the patch sets that upgrade fabric8 which had fixes that might plausibly explain the problem seen by @redsk . I could revert the one line patch that you wanted to revert that was requested by @holdenk but I feel caught in the middle and would prefer that you both agree. |
…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control Rebase on branch-3.0 to fix SparkR integration test.
|
I reverted the SparkR per your instruction and also rebased to branch-3.0. |
0605745 to
9015c82
Compare
|
Thanks, @jkleckner . For the SparkR part, we can ignore it in this PR if it fails again. And,
|
|
Test build #130558 has finished for PR 29533 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, all!
Merged to branch-3.0 for Apache Spark 3.0.2.
I also hit the same issue, ExecutorPodsWatchSnapshotSource, reported by @redsk without this patch on the latest branch-3.0. I agree that is another watcher issue.
… changed from k8s ### What changes were proposed in this pull request? This is a straight application of #28423 onto branch-3.0 Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was tested in #28423 by running spark-submit to a k8s cluster. Closes #29533 from jkleckner/backport-SPARK-24266-to-branch-3.0. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
@redsk Could you file a new JIRA for your observation, please? |
|
@dongjoon-hyun Thank you for the merge and hanging in there. Hopefully @shockdm will revive the 2.4 branch fix kicked off in #29496 |
|
@dongjoon-hyun I've created SPARK-33349 as requested. Thanks |
|
Thank you so much, @redsk ! |
… changed from k8s This is a backport of apache#29533 from master. It includes the shockdm/pull/1 which has been squashed and the import review comment include. It has also been rebased to branch-2.4
… changed from k8s This is a backport of apache#29533 from master. It includes the shockdm/pull/1 which has been squashed and the import review comment include. It has also been rebased to branch-2.4 Address review comments.
… changed from k8s This is a backport of apache#29533 from master. It includes the shockdm/pull/1 which has been squashed and the import review comment include. It has also been rebased to branch-2.4 Address review comments.
What changes were proposed in this pull request?
This is a straight application of #28423 onto branch-3.0
Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.
For more relevant information see here: fabric8io/kubernetes-client#1075
Does this PR introduce any user-facing change?
No
How was this patch tested?
This was tested in #28423 by running spark-submit to a k8s cluster.