Skip to content

HDDS-13108. Refactor StorageVolume to use SlidingWindow#8843

Draft
ptlrs wants to merge 13 commits intoapache:masterfrom
ptlrs:HDDS-13108-Migrate-failed-volume-checks-to-one-sliding-window
Draft

HDDS-13108. Refactor StorageVolume to use SlidingWindow#8843
ptlrs wants to merge 13 commits intoapache:masterfrom
ptlrs:HDDS-13108-Migrate-failed-volume-checks-to-one-sliding-window

Conversation

@ptlrs
Copy link
Contributor

@ptlrs ptlrs commented Jul 22, 2025

Please describe your PR in detail:

This PR uses the new sliding window implementation.
It migrates all existing checks to detect a failed volume to use the new time-based sliding window utility.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13108

How was this patch tested?

CI:https://github.com/ptlrs/ozone/actions/runs/16436635030

@ptlrs ptlrs marked this pull request as draft July 22, 2025 06:38
@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 22, 2025

Hi @errose28 @Tejaskriya @adoroszlai can you please review this PR?

@errose28 errose28 added the scanners Changes related to datanode container and volume scanners label Jul 22, 2025
@Tejaskriya Tejaskriya self-requested a review July 23, 2025 08:31
@errose28 errose28 self-requested a review July 23, 2025 18:46
Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ptlrs , please find a suggestion below

@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 31, 2025

Thanks for the review @Tejaskriya. I have added the configuration.

@ptlrs ptlrs requested a review from Tejaskriya July 31, 2025 06:38
Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @errose28 could you please take a look?

@github-actions
Copy link

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Nov 11, 2025
@github-actions
Copy link

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

@github-actions github-actions bot closed this Nov 25, 2025
@ptlrs
Copy link
Contributor Author

ptlrs commented Feb 9, 2026

Hi @errose28, could you please reopen this PR?

@errose28 errose28 reopened this Feb 9, 2026
@github-actions github-actions bot removed the stale label Feb 10, 2026
…e-failed-volume-checks-to-one-sliding-window

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java
@ptlrs
Copy link
Contributor Author

ptlrs commented Feb 12, 2026

Hi @errose28, the conflicts have been resolved for this PR. Could you please take a look.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Mar 6, 2026
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for the patch.

@adoroszlai adoroszlai removed the stale label Mar 9, 2026
@adoroszlai adoroszlai changed the title HDDS-13108. Migrate failed volume checks to one sliding window HDDS-13108. Refactor StorageVolume to use SlidingWindow Mar 9, 2026
+ " failure result stored in the sliding window will expire."
+ " Unit could be defined with postfix (ns,ms,s,m,h,d)."
)
private Duration diskCheckSlidingWindowTimeout = DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the recommendation of value relationship between this new property and PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT? Say if user reconfigured PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT to 2h, or 30m, shall we suggest user to reconfigure this property too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question.

The period disk check currently runs every 1 hour and the sliding window coverage is also for 1 hour.

The window should likely cover the period between and inclusive of the two periodic checks such that if two periodic checks fail, then they get counted in the same window.

Otherwise the only opportunity for a failure will be by a combination of periodic and on-demand checks and never due to two periodic checks.

I have updated the sliding window to be as long as the periodic disk check interval plus the timeout value for the disk check.

}

LOG.debug("IO test results for volume {}: encountered {} out of {} tolerated failures",
this, ioTestSlidingWindow.getNumEventsInWindow(), ioTestSlidingWindow.getWindowSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call getNumEvents() here, instead of getNumEventsInWindow()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNumEventsInWindow is correct here. It will show how many failures are in the actual window. The window size is the number of failures that are allowed.

This log line is logged every time in debug mode. Since our queue is larger than the window, logging getNumEvents may give us the false impression that we got more failures as it will include the errors outside the window as well.

We are using getNumEvents only for logging in tests.

Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scanners Changes related to datanode container and volume scanners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants