Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion hadoop-ozone/dist/src/main/compose/upgrade/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Docker compose cluster definitions to be used in upgrade testing are defined in

- Tests that should run for all upgrades, regardless of the version being tested, can be added to *compose/upgrade/\<upgrade-type>/common/callback.sh*.

- Tests that should run only for an upgrade to a specific version can be added to *compose/upgrade/\<upgrade-type>/\<ending-upgrade-version>/callback.sh*.
- Tests that should run only for an upgrade to a specific version can be added to *compose/upgrade/upgrades/\<upgrade-type>/\<ending-upgrade-version>/callback.sh*.

- Add commands in the callback function when they should be run. Each callback file will have access to the following environment variables:
- `OZONE_UPGRADE_FROM`: The version of ozone being upgraded from.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,13 @@ x-common-config:
- ../../../common/security.conf
image: ${OZONE_TEST_IMAGE}
dns_search: .
extra_hosts:
- "om1:10.9.0.11"
- "om2:10.9.0.12"
- "om3:10.9.0.13"
- "scm1.org:10.9.0.14"
- "scm2.org:10.9.0.15"
- "scm3.org:10.9.0.16"
Comment on lines +25 to +31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? We already have the ipv4_address fields set for each service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I saw test failures with key creation timeouts:

/Users/zitadombi/git_repos/ozone/hadoop-ozone/dist/target/ozone-2.2.0-SNAPSHOT/compose/upgrade fails like this: "--- RESTARTING scm1 WITH IMAGE 2.2.0 ---
--- STOPPING scm1 ---
--- STOPPED scm1 ---
--- SCM BEFORE: scm3 ---
--- SCM AFTER: scm3 ---
--- CALLING before_service_restart with scm1 ---
Using Docker Compose v2
==============================================================================
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data                    
==============================================================================
Create a volume and bucket                                            | PASS |
------------------------------------------------------------------------------
Create key                                                            | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Create a bucket in s3v volume                                         | PASS |
------------------------------------------------------------------------------
Create key in the bucket in s3v volume                                | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
Try to create a bucket using S3 API                                   | PASS |
------------------------------------------------------------------------------
Create key using S3 API                                               | FAIL |
Test timeout 5 minutes exceeded.
------------------------------------------------------------------------------
2.2.0-2.2.0-2-scm1-generate-generate-scm1 :: Generate data            | FAIL |
6 tests, 3 passed, 3 failed
=============================================================================="

In this test, when scm1 is stopped, DNS resolution for scm1.org fell back to public DNS, which caused key creation timeouts while retrying on the bad address:

Address change detected. Old: scm1.org/10.9.0.14 New: scm1.org/208.91.197.27

extra_hosts forces deterministic in-cluster resolution even when a node is down, so HA client retries stay on the intended private IPs. This extra_hosts is also used at other docker yaml files, where we have HA and stopping containers one-by-one (e.g. debug tools, decommissioning)

Cursor response while debugging: "Most likely root cause in your run: scm1.org resolves to a public IP while scm1 is intentionally down, and OM/SCM clients get stuck retrying that bad address. scm1 is the hostname that collides with real DNS (scm1.org), so when container DNS entry disappears during stop, resolver falls back to public DNS. Then Java caches/keeps retrying that bad endpoint long enough to hit your 5-minute test timeout."


x-environment:
&environment
Expand Down
3 changes: 3 additions & 0 deletions hadoop-ozone/dist/src/main/compose/upgrade/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ run_test ha non-rolling-upgrade 2.1.0 "$OZONE_CURRENT_VERSION"
# run_test ha non-rolling-upgrade 1.2.1 "$OZONE_CURRENT_VERSION"
# run_test om-ha non-rolling-upgrade 1.1.0 "$OZONE_CURRENT_VERSION"

# Rolling upgrade test, commented out for now
# run_test ha rolling-upgrade "$OZONE_CURRENT_VERSION" "$OZONE_CURRENT_VERSION"

generate_report "upgrade" "$ALL_RESULT_DIR"

exit "$RESULT"
18 changes: 18 additions & 0 deletions hadoop-ozone/dist/src/main/compose/upgrade/testlib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,21 @@ run_test() {

copy_results "$execution_dir" "$ALL_RESULT_DIR"
}

### CALLBACK HELPER METHODS ###

## @description Generates data on the cluster.
## @param The prefix to use for data generated.
## @param All parameters after the first one are passed directly to the robot command,
## see https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#all-command-line-options
generate() {
execute_robot_test "$SCM" -N "${OUTPUT_NAME}-generate-${1}" -v PREFIX:"$1" ${@:2} upgrade/generate.robot
}

## @description Validates that data exists on the cluster.
## @param The prefix of the data to be validated.
## @param All parameters after the first one are passed directly to the robot command,
## see https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#all-command-line-options
validate() {
execute_robot_test "$SCM" -N "${OUTPUT_NAME}-validate-${1}" -v PREFIX:"$1" ${@:2} upgrade/validate.robot
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,6 @@

source "$TEST_DIR"/testlib.sh

### HELPER METHODS ###

## @description Generates data on the cluster.
## @param The prefix to use for data generated.
## @param All parameters after the first one are passed directly to the robot command,
## see https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#all-command-line-options
generate() {
execute_robot_test "$SCM" -N "${OUTPUT_NAME}-generate-${1}" -v PREFIX:"$1" ${@:2} upgrade/generate.robot
}

## @description Validates that data exists on the cluster.
## @param The prefix of the data to be validated.
## @param All parameters after the first one are passed directly to the robot command,
## see https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#all-command-line-options
validate() {
execute_robot_test "$SCM" -N "${OUTPUT_NAME}-validate-${1}" -v PREFIX:"$1" ${@:2} upgrade/validate.robot
}

### CALLBACKS ###

with_old_version() {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

source "$TEST_DIR"/testlib.sh

### CALLBACKS ###

before_service_restart() {
generate "generate-${SERVICE}"
}

after_service_restart() {
validate "generate-${SERVICE}"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs execute permissions to run. These are tracked by git. After changing the permissions:

$ git diff  
diff --git a/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh b/hadoop-ozone/dist/src/main/compose/upgrade/upgrades/rolling-upgrade/driver.sh
old mode 100644
new mode 100755

Note that callback.sh does not need this since it is only sourced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks, I changed it

Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This script tests upgrade from a previous release to the current
# binaries. Docker image with Ozone binaries is required for the
# initial version, while the snapshot version uses Ozone runner image.

set -e -o pipefail

# Fail if required vars are not set.
set -u
: "${OZONE_UPGRADE_FROM}"
: "${OZONE_UPGRADE_TO}"
: "${TEST_DIR}"
: "${SCM}"
: "${OZONE_CURRENT_VERSION}"
set +u

echo "--- RUNNING ROLLING UPGRADE TEST FROM $OZONE_UPGRADE_FROM TO $OZONE_UPGRADE_TO ---"

source "$TEST_DIR"/testlib.sh

# Restart one service with the target image.
rolling_restart_service() {
SERVICE="$1"

echo "--- RESTARTING ${SERVICE} WITH IMAGE ${OZONE_UPGRADE_TO} ---"

# Stop service
stop_containers "${SERVICE}"

# Check if this SCM container is running, as during a rolling upgrade it does stop-start one-by-one and
# we want to run write/read tests while one service is unavailable. Choose SCM (the container where the generate and
# validate robot tests are running) considering availability.
if [[ "$(docker inspect -f '{{.State.Running}}' "ha-${SCM}-1" 2>/dev/null)" != "true" ]]; then
local fallback_scm
fallback_scm="$(docker-compose --project-directory="$TEST_DIR/compose/ha" config --services | grep scm | grep -v "^${SCM}$" | head -n1)"
if [[ -n "$fallback_scm" ]]; then
export SCM="$fallback_scm"
fi
fi

# The data generation/validation is doing S3 API tests, so skip it in case the S3 gateway is updated
# TODO find a better solution
if [[ ${SERVICE} != "s3g" ]]; then
callback before_service_restart
fi

# Restart service with new image.
prepare_for_image "${OZONE_UPGRADE_TO}"
create_containers "${SERVICE}"

# The data generation/validation is doing S3 API tests, so skip it in case the S3 gateway is updated
if [[ ${SERVICE} != "s3g" ]]; then
callback after_service_restart
fi

# Service-specific readiness checks.
case "${SERVICE}" in
om*)
wait_for_port "${SERVICE}" 9862 120
;;
scm*)
# SCM hostnames in this compose are scmX.org
wait_for_port "${SERVICE}.org" 9876 120
;;
dn*)
wait_for_port "${SERVICE}" 9882 120
;;
esac
}

echo "--- SETTING UP OLD VERSION $OZONE_UPGRADE_FROM ---"
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-1-original"
export OM_HA_ARGS='--'
prepare_for_image "$OZONE_UPGRADE_FROM"

echo "--- RUNNING WITH OLD VERSION $OZONE_UPGRADE_FROM ---"
start_docker_env

# TODO Add old data generation

echo "--- ROLLING UPGRADE TO $OZONE_UPGRADE_TO PRE-FINALIZED ---"

# SCMs first
for s in scm2 scm1 scm3; do
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-2-${s}"
rolling_restart_service "$s" "$OZONE_UPGRADE_TO"
done

# Recon
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-2-recon"
rolling_restart_service "recon" "$OZONE_UPGRADE_TO"

# DNs
for s in dn1 dn2 dn3 dn4 dn5; do
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-2-${s}"
rolling_restart_service "$s" "$OZONE_UPGRADE_TO"
done

for s in om1 om2 om3; do
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-2-${s}"
rolling_restart_service "$s" "$OZONE_UPGRADE_TO"
done

# S3 Gateway
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-2-s3g"
rolling_restart_service "s3g" "$OZONE_UPGRADE_TO"

# TODO Add downgrade scenario

echo "--- RUNNING WITH NEW VERSION $OZONE_UPGRADE_TO FINALIZED ---"
OUTPUT_NAME="${OZONE_UPGRADE_FROM}-${OZONE_UPGRADE_TO}-3-finalized"

# TODO Add validation for pre-finalized state

# Sends commands to finalize OM and SCM.
execute_robot_test "$SCM" -N "${OUTPUT_NAME}-finalize" upgrade/finalize.robot
Loading