HDDS-14751. Add basic ZDU flow in acceptance tests by dombizita · Pull Request #9877 · apache/ozone

dombizita · 2026-03-06T23:51:21Z

What changes were proposed in this pull request?

In this change I added a new rolling-upgrade suite to the upgrade acceptance tests. This initial version has a simple upgrade scenario, where it goes over all the services and simply stop-start (with new image) them one-by-one. After each docker container stop it does a data generation and once the service is restarted, it does the validation of it. I moved out some general methods to the testlib.sh from the non-rolling-upgrade suite, so it can be reused. Small typo is also fixed in the readme.

In follow up patches I'll do the downgrade and pre-finalization tests (added TODO comments for them, since it's on a feature branch). The test run is commented out for now, until the necessary ZDU changes are merged.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14751

How was this patch tested?

Successful CI run without commenting the rolling upgrade test out: https://github.com/dombizita/ozone/actions/runs/22766179609

errose28

Thanks @dombizita for getting this started. Left some minor comments.

errose28 · 2026-03-20T16:43:31Z

hadoop-ozone/dist/src/main/compose/upgrade/compose/ha/docker-compose.yaml

+  extra_hosts:
+    - "om1:10.9.0.11"
+    - "om2:10.9.0.12"
+    - "om3:10.9.0.13"
+    - "scm1.org:10.9.0.14"
+    - "scm2.org:10.9.0.15"
+    - "scm3.org:10.9.0.16"


Is this required? We already have the ipv4_address fields set for each service.

Without this I saw test failures with key creation timeouts:

/Users/zitadombi/git_repos/ozone/hadoop-ozone/dist/target/ozone-2.2.0-SNAPSHOT/compose/upgrade fails like this: "--- RESTARTING scm1 WITH IMAGE 2.2.0 --- --- STOPPING scm1 --- --- STOPPED scm1 --- --- SCM BEFORE: scm3 --- --- SCM AFTER: scm3 --- --- CALLING before_service_restart with scm1 --- Using Docker Compose v2 ============================================================================== 2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data ============================================================================== Create a volume and bucket | PASS | ------------------------------------------------------------------------------ Create key | FAIL | Test timeout 5 minutes exceeded. ------------------------------------------------------------------------------ Create a bucket in s3v volume | PASS | ------------------------------------------------------------------------------ Create key in the bucket in s3v volume | FAIL | Test timeout 5 minutes exceeded. ------------------------------------------------------------------------------ Try to create a bucket using S3 API | PASS | ------------------------------------------------------------------------------ Create key using S3 API | FAIL | Test timeout 5 minutes exceeded. ------------------------------------------------------------------------------ 2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data | FAIL | 6 tests, 3 passed, 3 failed =============================================================================="

In this test, when scm1 is stopped, DNS resolution for scm1.org fell back to public DNS, which caused key creation timeouts while retrying on the bad address:

Address change detected. Old: scm1.org/10.9.0.14 New: scm1.org/208.91.197.27

extra_hosts forces deterministic in-cluster resolution even when a node is down, so HA client retries stay on the intended private IPs. This extra_hosts is also used at other docker yaml files, where we have HA and stopping containers one-by-one (e.g. debug tools, decommissioning)

Cursor response while debugging: "Most likely root cause in your run: scm1.org resolves to a public IP while scm1 is intentionally down, and OM/SCM clients get stuck retrying that bad address. scm1 is the hostname that collides with real DNS (scm1.org), so when container DNS entry disappears during stop, resolver falls back to public DNS. Then Java caches/keeps retrying that bad endpoint long enough to hit your 5-minute test timeout."

hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh

errose28 · 2026-03-20T17:07:36Z

hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh

+done
+
+# OMs with upgrade arg
+export OM_HA_ARGS='--upgrade'


The --upgrade flag takes the OM out of prepare mode, which won't happen during rolling upgrade. We can remove this flag.

Great, removed this flag.

errose28 · 2026-03-20T17:12:45Z

hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh

This file needs execute permissions to run. These are tracked by git. After changing the permissions:

$ git diff diff --git a/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh b/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh old mode 100644 new mode 100755

Note that callback.sh does not need this since it is only sourced.

Nice, thanks, I changed it

dombizita added 2 commits March 7, 2026 00:03

HDDS-14751. Add basic ZDU flow in acceptance tests

68bd62d

Comment out rolling upgrade run_test

a3bec30

dombizita added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Mar 6, 2026

dombizita requested review from adoroszlai and errose28 March 9, 2026 07:17

errose28 reviewed Mar 20, 2026

View reviewed changes

Address review comments

f9b0822

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14751. Add basic ZDU flow in acceptance tests#9877

HDDS-14751. Add basic ZDU flow in acceptance tests#9877
dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
dombizita:HDDS-14751-zdu

dombizita commented Mar 6, 2026

Uh oh!

errose28 left a comment

Uh oh!

errose28 Mar 20, 2026

Uh oh!

dombizita Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

errose28 Mar 20, 2026

Uh oh!

dombizita Mar 25, 2026

Uh oh!

errose28 Mar 20, 2026

Uh oh!

dombizita Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dombizita commented Mar 6, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dombizita Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

errose28 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dombizita Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

errose28 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dombizita Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants