[improve][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable#23634
Conversation
…return a failure because the service may be unavailable
|
@yyj8 Please add the following content to your PR description and select a checkbox: |
lhotari
left a comment
There was a problem hiding this comment.
There's already a deadlock check in the health check:
It also contains an example of how to check deadlocks.
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…return a failure because the service may be unavailable
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable
…e should return a failure because the service may be unavailable
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable.
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
…e should return a failure because the service may be unavailable.
|
@yyj8 btw. when you add commits to the PR, it's useful to make the commit title about the change and not copy the PR title into the follow up commits. When the PR is merged, all commits are squashed so they won't end up in the final merged commit. The benefit of the commit messages in the PR commits is that the reviewer will be able to follow the changes. |
…e should return a failure because the service may be unavailable. Add lastPrintThreadDumpTimestamp field to control the interval time for printing complete thread stack information.
…e should return a failure because the service may be unavailable. Add unit testing code.
pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #23634 +/- ##
=============================================
+ Coverage 38.30% 74.27% +35.97%
- Complexity 100 33281 +33181
=============================================
Files 1844 1901 +57
Lines 144273 148403 +4130
Branches 16726 17204 +478
=============================================
+ Hits 55262 110227 +54965
+ Misses 81479 29400 -52079
- Partials 7532 8776 +1244
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com>
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com> (cherry picked from commit d833b8b) (cherry picked from commit cb223f7)
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com> (cherry picked from commit d833b8b) (cherry picked from commit cb223f7)
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com> (cherry picked from commit d833b8b) (cherry picked from commit e199b24)
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com> (cherry picked from commit d833b8b) (cherry picked from commit e199b24)
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com>
…ould return a failure because the service may be unavailable (apache#23634) Co-authored-by: Lari Hotari <lhotari@apache.org> Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com> (cherry picked from commit d833b8b)
Fixes #23635
Main Issue: #xyz
PIP: #xyz
Motivation
In some special scenarios, when the broker service has a deadlock, it needs to be able to automatically recover instead of requiring manual intervention. For example, when the service is deployed in a customer environment, we cannot directly manage it. If the service has a deadlock, the k8s probe should return a failure because the service may be unavailable. The probe failure triggers a broker pod restart to resolve the deadlock.
Modifications
Add deadlock detection in the probe. If a deadlock exists, print the thread stack and return a service unavailable exception.
Verifying this change
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository:
yyj8#10