HDDS-14751. Add basic ZDU flow in acceptance tests#9877
HDDS-14751. Add basic ZDU flow in acceptance tests#9877dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
Conversation
errose28
left a comment
There was a problem hiding this comment.
Thanks @dombizita for getting this started. Left some minor comments.
| extra_hosts: | ||
| - "om1:10.9.0.11" | ||
| - "om2:10.9.0.12" | ||
| - "om3:10.9.0.13" | ||
| - "scm1.org:10.9.0.14" | ||
| - "scm2.org:10.9.0.15" | ||
| - "scm3.org:10.9.0.16" |
There was a problem hiding this comment.
Is this required? We already have the ipv4_address fields set for each service.
There was a problem hiding this comment.
Without this I saw test failures with key creation timeouts:
/Users/zitadombi/git_repos/ozone/hadoop-ozone/dist/target/ozone-2.2.0-SNAPSHOT/compose/upgrade fails like this: "--- RESTARTING scm1 WITH IMAGE 2.2.0 ---
--- STOPPING scm1 ---
--- STOPPED scm1 ---
--- SCM BEFORE: scm3 ---
--- SCM AFTER: scm3 ---
--- CALLING before_service_restart with scm1 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data
==============================================================================
Create a volume and bucket | PASS |
------------------------------------------------------------------------------
Create key | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Create a bucket in s3v volume | PASS |
------------------------------------------------------------------------------
Create key in the bucket in s3v volume | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Try to create a bucket using S3 API | PASS |
------------------------------------------------------------------------------
Create key using S3 API | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data | FAIL |
6 tests, 3 passed, 3 failed
=============================================================================="
In this test, when scm1 is stopped, DNS resolution for scm1.org fell back to public DNS, which caused key creation timeouts while retrying on the bad address:
Address change detected. Old: scm1.org/10.9.0.14 New: scm1.org/208.91.197.27
extra_hosts forces deterministic in-cluster resolution even when a node is down, so HA client retries stay on the intended private IPs. This extra_hosts is also used at other docker yaml files, where we have HA and stopping containers one-by-one (e.g. debug tools, decommissioning)
Cursor response while debugging: "Most likely root cause in your run: scm1.org resolves to a public IP while scm1 is intentionally down, and OM/SCM clients get stuck retrying that bad address. scm1 is the hostname that collides with real DNS (scm1.org), so when container DNS entry disappears during stop, resolver falls back to public DNS. Then Java caches/keeps retrying that bad endpoint long enough to hit your 5-minute test timeout."
hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh
Show resolved
Hide resolved
hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh
Show resolved
Hide resolved
| done | ||
|
|
||
| # OMs with upgrade arg | ||
| export OM_HA_ARGS='--upgrade' |
There was a problem hiding this comment.
The --upgrade flag takes the OM out of prepare mode, which won't happen during rolling upgrade. We can remove this flag.
There was a problem hiding this comment.
Great, removed this flag.
There was a problem hiding this comment.
This file needs execute permissions to run. These are tracked by git. After changing the permissions:
$ git diff
diff --git a/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh b/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh
old mode 100644
new mode 100755
Note that callback.sh does not need this since it is only sourced.
There was a problem hiding this comment.
Nice, thanks, I changed it
What changes were proposed in this pull request?
In this change I added a new
rolling-upgradesuite to the upgrade acceptance tests. This initial version has a simple upgrade scenario, where it goes over all the services and simply stop-start (with new image) them one-by-one. After each docker container stop it does a data generation and once the service is restarted, it does the validation of it. I moved out some general methods to thetestlib.shfrom thenon-rolling-upgradesuite, so it can be reused. Small typo is also fixed in the readme.In follow up patches I'll do the downgrade and pre-finalization tests (added TODO comments for them, since it's on a feature branch). The test run is commented out for now, until the necessary ZDU changes are merged.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14751
How was this patch tested?
Successful CI run without commenting the rolling upgrade test out: https://github.com/dombizita/ozone/actions/runs/22766179609