Skip to content

HDDS-14751. Add basic ZDU flow in acceptance tests#9877

Open
dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
dombizita:HDDS-14751-zdu
Open

HDDS-14751. Add basic ZDU flow in acceptance tests#9877
dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
dombizita:HDDS-14751-zdu

Conversation

@dombizita
Copy link
Contributor

What changes were proposed in this pull request?

In this change I added a new rolling-upgrade suite to the upgrade acceptance tests. This initial version has a simple upgrade scenario, where it goes over all the services and simply stop-start (with new image) them one-by-one. After each docker container stop it does a data generation and once the service is restarted, it does the validation of it. I moved out some general methods to the testlib.sh from the non-rolling-upgrade suite, so it can be reused. Small typo is also fixed in the readme.

In follow up patches I'll do the downgrade and pre-finalization tests (added TODO comments for them, since it's on a feature branch). The test run is commented out for now, until the necessary ZDU changes are merged.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14751

How was this patch tested?

Successful CI run without commenting the rolling upgrade test out: https://github.com/dombizita/ozone/actions/runs/22766179609

@dombizita dombizita added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Mar 6, 2026
@dombizita dombizita requested review from adoroszlai and errose28 March 9, 2026 07:17
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dombizita for getting this started. Left some minor comments.

Comment on lines +25 to +31
extra_hosts:
- "om1:10.9.0.11"
- "om2:10.9.0.12"
- "om3:10.9.0.13"
- "scm1.org:10.9.0.14"
- "scm2.org:10.9.0.15"
- "scm3.org:10.9.0.16"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? We already have the ipv4_address fields set for each service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I saw test failures with key creation timeouts:

/Users/zitadombi/git_repos/ozone/hadoop-ozone/dist/target/ozone-2.2.0-SNAPSHOT/compose/upgrade fails like this: "--- RESTARTING scm1 WITH IMAGE 2.2.0 ---
--- STOPPING scm1 ---
--- STOPPED scm1 ---
--- SCM BEFORE: scm3 ---
--- SCM AFTER: scm3 ---
--- CALLING before_service_restart with scm1 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data                    
==============================================================================
Create a volume and bucket                                            | PASS |
------------------------------------------------------------------------------
Create key                                                            | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Create a bucket in s3v volume                                         | PASS |
------------------------------------------------------------------------------
Create key in the bucket in s3v volume                                | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Try to create a bucket using S3 API                                   | PASS |
------------------------------------------------------------------------------
Create key using S3 API                                               | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data            | FAIL |
6 tests, 3 passed, 3 failed
=============================================================================="

In this test, when scm1 is stopped, DNS resolution for scm1.org fell back to public DNS, which caused key creation timeouts while retrying on the bad address:

Address change detected. Old: scm1.org/10.9.0.14 New: scm1.org/208.91.197.27

extra_hosts forces deterministic in-cluster resolution even when a node is down, so HA client retries stay on the intended private IPs. This extra_hosts is also used at other docker yaml files, where we have HA and stopping containers one-by-one (e.g. debug tools, decommissioning)

Cursor response while debugging: "Most likely root cause in your run: scm1.org resolves to a public IP while scm1 is intentionally down, and OM/SCM clients get stuck retrying that bad address. scm1 is the hostname that collides with real DNS (scm1.org), so when container DNS entry disappears during stop, resolver falls back to public DNS. Then Java caches/keeps retrying that bad endpoint long enough to hit your 5-minute test timeout."

done

# OMs with upgrade arg
export OM_HA_ARGS='--upgrade'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --upgrade flag takes the OM out of prepare mode, which won't happen during rolling upgrade. We can remove this flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, removed this flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs execute permissions to run. These are tracked by git. After changing the permissions:

$ git diff  
diff --git a/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh b/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh
old mode 100644
new mode 100755

Note that callback.sh does not need this since it is only sourced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks, I changed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants