HDDS-13108. Refactor StorageVolume to use SlidingWindow#8843
HDDS-13108. Refactor StorageVolume to use SlidingWindow#8843ptlrs wants to merge 13 commits intoapache:masterfrom
Conversation
… scanner failures
…e-failed-volume-checks-to-one-sliding-window
…g window mechanism
|
Hi @errose28 @Tejaskriya @adoroszlai can you please review this PR? |
Tejaskriya
left a comment
There was a problem hiding this comment.
Thanks for working on this @ptlrs , please find a suggestion below
...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
Outdated
Show resolved
Hide resolved
|
Thanks for the review @Tejaskriya. I have added the configuration. |
Tejaskriya
left a comment
There was a problem hiding this comment.
Looks good to me, @errose28 could you please take a look?
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it. |
|
Hi @errose28, could you please reopen this PR? |
…e-failed-volume-checks-to-one-sliding-window # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java
|
Hi @errose28, the conflicts have been resolved for this PR. Could you please take a look. |
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
…failed-volume-checks-to-one-sliding-window
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @ptlrs for the patch.
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java
Outdated
Show resolved
Hide resolved
...test/java/org/apache/hadoop/ozone/container/common/volume/TestStorageVolumeHealthChecks.java
Outdated
Show resolved
Hide resolved
| + " failure result stored in the sliding window will expire." | ||
| + " Unit could be defined with postfix (ns,ms,s,m,h,d)." | ||
| ) | ||
| private Duration diskCheckSlidingWindowTimeout = DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT; |
There was a problem hiding this comment.
What's the recommendation of value relationship between this new property and PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT? Say if user reconfigured PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT to 2h, or 30m, shall we suggest user to reconfigure this property too?
There was a problem hiding this comment.
This is a good question.
The period disk check currently runs every 1 hour and the sliding window coverage is also for 1 hour.
The window should likely cover the period between and inclusive of the two periodic checks such that if two periodic checks fail, then they get counted in the same window.
Otherwise the only opportunity for a failure will be by a combination of periodic and on-demand checks and never due to two periodic checks.
I have updated the sliding window to be as long as the periodic disk check interval plus the timeout value for the disk check.
...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java
Show resolved
Hide resolved
| } | ||
|
|
||
| LOG.debug("IO test results for volume {}: encountered {} out of {} tolerated failures", | ||
| this, ioTestSlidingWindow.getNumEventsInWindow(), ioTestSlidingWindow.getWindowSize()); |
There was a problem hiding this comment.
Shall we call getNumEvents() here, instead of getNumEventsInWindow()?
There was a problem hiding this comment.
getNumEventsInWindow is correct here. It will show how many failures are in the actual window. The window size is the number of failures that are allowed.
This log line is logged every time in debug mode. Since our queue is larger than the window, logging getNumEvents may give us the false impression that we got more failures as it will include the errors outside the window as well.
We are using getNumEvents only for logging in tests.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/SlidingWindow.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
Please describe your PR in detail:
This PR uses the new sliding window implementation.
It migrates all existing checks to detect a failed volume to use the new time-based sliding window utility.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13108
How was this patch tested?
CI:https://github.com/ptlrs/ozone/actions/runs/16436635030