RHSTOR-7656 - Test generation of MDS alert for xattr latency by posharma1 · Pull Request #14045 · red-hat-storage/ocs-ci

posharma1 · 2026-01-07T05:52:22Z

Happy path automation for RHSTOR-7656] - Test will verify generation of CephXattrSetLatency alert generation during intensive extended attribute operation on MDS along with high CPU utilization in the cluster.

Signed-off-by: Poonam Sharma <posharma@redhat.com>

openshift-ci · 2026-01-07T05:52:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: posharma1

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hnallurv · 2026-01-07T07:04:37Z

ocs_ci/templates/workloads/helper_scripts/check_xattr.py

+
+    # Example usage
+    files = create_files("testfile_", file_count, directory)
+    print(f"Creation of {len(files)} files in '{directory}' directory completed.")


Can we use logging instead of print ?

hnallurv · 2026-01-07T07:10:32Z

ocs_ci/templates/workloads/helper_scripts/check_xattr.py

+    """
+    Create `count` empty files named <prefix>1, <prefix>2, ... in `directory`.
+    Returns a list of created file paths (strings).
+    """


WCA suggested the following optimised code:

if count < 1: return [] dir_path = Path(directory) dir_path.mkdir(parents=True, exist_ok=True) created = [str(dir_path / f"{prefix}{i}") for i in range(1, count + 1)] return created

dgaikwad · 2026-01-28T12:47:49Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+            "Setting extended attributes and file creation IO started in the background."
+            " Script will look for CephXattrSetLatency  alert"
+        )
+        assert MDSxattr_alert_values(threading_lock)


@posharma1
Since the MDSxattr_alert_values function is expected to return only True, do you think we should enforce this behavior using an assert statement?

Signed-off-by: Poonam Sharma <posharma@redhat.com>

hnallurv · 2026-02-10T06:52:26Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+            [active_mds_node.name], constants.STATUS_READY, timeout=420
+        )
+
+        assert MDSxattr_alert_values(threading_lock, timeout=1200)


Hi Poonam,
I took Bob's help in reviewing this PR. Below are the comments.
We can take upgrade test in next PR.

PR Review: RHSTOR-7656 - Test generation of MDS alert for xattr latency

Executive Summary

This PR implements automated tests for the new MDS (Metadata Server) xattr latency alert feature in ODF 4.21. Based on my analysis of the test cases spreadsheet, feature documentation, and the PR code visible in the browser, I've identified several critical issues and recommendations.

Test Coverage Analysis

✅ Tests Implemented (5 out of 7):

test_mds_xattr_alert_triggered - Basic alert generation and validation

test_alert_triggered_by_restarting_operator_and_metrics_pods - Pod restart scenarios

test_alert_after_recovering_prometheus_from_failures - Prometheus failure recovery

test_alert_after_active_mds_scaledown - Active MDS scale down/up

test_alert_with_both_mds_scaledown - Both active and standby MDS scale down/up

test_alert_with_mds_running_node_restart - Node restart scenario

❌ Missing Tests (2 out of 7):

Upgrade test (4.20 to 4.21) - NOT IMPLEMENTED

Runbook validation test - PARTIALLY IMPLEMENTED (only URL check, no mitigation steps validation)

MUST IMPLEMENT

1. Missing Upgrade Test

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Severity: CRITICAL
Issue: Test case "RHSTOR-7656 - verifies the mds alert generation post upgrade from version 4.20 to 4.21" is completely missing from the PR.
Expected: A test method that:

Upgrades cluster from 4.20 to 4.21

Sets up CPU load with fio pods

Updates xattr values

Validates alert generation post-upgrade
Impact: This is a critical test case for ensuring backward compatibility and feature availability after upgrade. Without this, there's no validation that the alert works correctly in upgraded environments.

2. Incomplete Runbook Validation

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~160-163)
Severity: CRITICAL
Issue: The test only validates that the runbook URL is accessible but doesn't verify:

The runbook content is correct

Mitigation steps can be followed

Alert clears after executing runbook steps
Current Code:

runbook = ( "https://github.com/openshift/runbooks/blob/master/alerts/" "openshift-container-storage-operator/CephXattrSetLatency.md" )

Expected: Test should:

Access the runbook URL

Parse and validate mitigation steps

Execute mitigation steps (increase RAM/CPU for MDS or scale MDS servers)

Verify alert is cleared after mitigation

Verify latency returns to valid range (<30ms)

3. Missing Alert Clearance Validation

File: All test methods
Severity: CRITICAL
Issue: None of the tests validate that the alert clears after the condition is resolved. Tests only check alert generation but not alert resolution.
Expected: Each test should include:

# After resolving the condition assert MDSxattr_alert_values(threading_lock, timeout=300) is False log.info("Alert cleared successfully after condition resolved")

Impact: Without this, we cannot confirm the alert lifecycle is working correctly. Alerts that don't clear create alert fatigue and operational issues.

4. Hard-coded Timeout Values

Files: Multiple locations (Lines 230, 239, 277, 291, 310, 333, 345, 381, 393, 403)
Severity: HIGH
Issue: Timeout values are hard-coded (30, 60, 1200 seconds) throughout the code without explanation or configuration.
Current Code:

assert MDSxattr_alert_values(threading_lock, timeout=30) is False assert MDSxattr_alert_values(threading_lock, timeout=1200)

Expected:

Define timeout constants at module level with clear names

Add comments explaining why specific timeouts are chosen

Consider making timeouts configurable via test parameters

ALERT_GENERATION_TIMEOUT = 1200 # 20 minutes for alert to trigger under load ALERT_CLEARANCE_TIMEOUT = 300 # 5 minutes for alert to clear INITIAL_CHECK_TIMEOUT = 30 # Initial check that alert is not present

5. Missing CPU Load Threshold Documentation

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~95-96)
Severity: HIGH
Issue: Test case spreadsheet mentions "CPU load and threshold details will be added after testing" but the PR doesn't document what values were determined.
Current Code:

log.info( "Setting up cephfs stress job for increasing CPU utilization in the cluster" )

Expected: Clear documentation of:

What CPU utilization percentage triggers the alert

What xattr latency threshold (>30ms mentioned in feature doc)

How the fio parameters were determined

Rationale for m_factor="1,2,3,4", parallelism=5, completions=5

6. No Validation of Alert Message Content

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~155-160)
Severity: HIGH
Issue: Test validates alert exists but doesn't verify the alert message content matches the expected description.
Current Code:

message = ( "There is a latency in setting the 'xattr' values for Ceph Metadata Servers." ) description = ( "This latency can be caused by different factors like high CPU usage or network" " related issues etc. Please see the runbook URL link to get further help on mitigating the issue." )

Expected: Add assertions to verify these values match what Prometheus returns:

alert_data = prometheus.check_alert_list(label=MDSxattr_alert, ...) assert alert_data['message'] == message assert alert_data['description'] == description assert alert_data['severity'] == "warning"

NICE TO IMPLEMENT

7. Missing Negative Test Cases

Severity: MEDIUM
Issue: No tests verify that the alert does NOT trigger under normal conditions or when latency is below threshold.
Recommendation: Add test cases for:

Normal operations without high CPU load

Xattr operations with latency <30ms

Alert should not trigger in these scenarios

8. No Concurrent Test Execution Consideration

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Severity: MEDIUM
Issue: Tests create resources with fixed names that could conflict if run concurrently.
Current Code:

file = constants.EXTENDED_ATTRIBUTES stress_mgr = CephFSStressTestManager(namespace=constants.DEFAULT_NAMESPACE)

Recommendation: Use unique identifiers or test-specific namespaces:

test_namespace = f"test-mds-xattr-{uuid.uuid4().hex[:8]}"

9. Insufficient Logging for Debugging

Severity: MEDIUM
Issue: Limited logging of intermediate states makes debugging failures difficult.
Recommendation: Add more detailed logging:

Current xattr latency values

CPU utilization levels

Number of files being processed

MDS pod states and transitions

Alert state transitions with timestamps

10. Missing Performance Metrics Collection

Severity: MEDIUM
Issue: Tests don't collect performance metrics that could be valuable for analysis.
Recommendation: Collect and log:

Actual latency values from Prometheus queries

Time taken for alert to trigger

Time taken for alert to clear

CPU utilization during test execution

11. No Cleanup Verification

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~134-144)
Severity: MEDIUM
Issue: Finalizer deletes resources but doesn't verify cleanup was successful.
Current Code:

def finalizer(): delete_deployment_pods(pod_obj) job_obj = OCP(kind="Job", namespace=constants.DEFAULT_NAMESPACE) job_obj.delete(resource_name="cephfs-stress-job")

Recommendation: Add verification:

def finalizer(): delete_deployment_pods(pod_obj) job_obj = OCP(kind="Job", namespace=constants.DEFAULT_NAMESPACE) job_obj.delete(resource_name="cephfs-stress-job") # Verify cleanup assert not job_obj.get(resource_name="cephfs-stress-job", dont_raise=True) log.info("Cleanup verified successfully")

12. Missing Test for Multiple Concurrent Xattr Operations

Severity: MEDIUM
Issue: Tests use a single script execution. Real-world scenarios might have multiple concurrent xattr operations.
Recommendation: Add test variation with multiple pods executing xattr operations simultaneously.

COSMETIC

13. Inconsistent Naming Convention

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Issue: Function names mix different naming styles.

set_xattr_with_high_cpu_usage (snake_case - correct)

MDSxattr_alert_values (mixed case - inconsistent)

initiate_alert_clearanace (typo: "clearanace" should be "clearance")

Recommendation: Standardize to snake_case:

def mds_xattr_alert_values(threading_lock, timeout): def initiate_alert_clearance():

14. Magic Numbers in Code

File: Multiple locations
Issue: Numbers like 7000, 200, 5, 16 appear without explanation.
Examples:

base_file_count=7000, size="200", parallelism=5, threads=16,

Recommendation: Define as named constants with comments explaining their purpose.

15. Inconsistent String Formatting

File: Throughout the test file
Issue: Mix of f-strings, .format(), and string concatenation.
Recommendation: Standardize on f-strings for consistency:

log.info(f"Scale up {deployment_name} to 1") log.info(f"Initial ocs-metrics-exporter pod: {metrics_pod_name}")

16. Missing Docstring Details

File: All test methods
Issue: Docstrings are minimal and don't include:

Test prerequisites

Expected behavior

Cleanup actions
Recommendation: Enhance docstrings:

def test_mds_xattr_alert_triggered(self, set_xattr_with_high_cpu_usage, threading_lock): """ Test MDS xattr latency alert generation. Prerequisites: - ODF 4.21+ cluster - CephFS filesystem available - Sufficient cluster resources for CPU stress Test Steps: 1. Create high CPU load using fio pods 2. Execute xattr operations on MDS 3. Wait for alert generation (max 20 minutes) 4. Validate alert properties Expected Result: - Alert 'CephXattrSetLatency' triggers - Alert has correct severity, message, and runbook URL Cleanup: - Deletes fio pods and jobs - Clears alert condition """

17. Commented Code

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~254)
Issue: Commented code should be removed:

# ] # metrics_pods.delete(force=True)

Recommendation: Remove commented code or add explanation if it's intentionally kept for reference.

18. Long Lines

File: Multiple locations
Issue: Several lines exceed 100 characters, reducing readability.
Example (Line ~191):

params='{"spec": {"resources": {"mds": {"limits": {"cpu": "16"}, "requests": {"cpu": "16"}}}}}',

Recommendation: Break into multiple lines:

params=( '{"spec": {"resources": {"mds": ' '{"limits": {"cpu": "16"}, "requests": {"cpu": "16"}}}}}' )

Summary Statistics

Total Test Cases Expected: 7

Test Cases Implemented: 5

Test Cases Missing: 2

Must Implement Issues: 6

Nice to Implement Issues: 6

Cosmetic Issues: 6

Total Issues: 18

Recommendation

DO NOT MERGE until:

Complete runbook validation with mitigation steps is added

Alert clearance validation is added to all tests

Timeout values are properly documented and configured

CPU load thresholds are documented

The PR shows good test structure and covers most scenarios, but the missing upgrade test and incomplete runbook validation are critical gaps that must be addressed before merge.

Akarsha-rai · 2026-02-12T10:33:19Z

ocs_ci/ocs/constants.py

 FILE_CREATOR_IO = os.path.join(
    TEMPLATE_WORKLOAD_DIR, "helper_scripts/file_creator_io.py"
 )
+EXTENTDED_ATTRIBUTES = os.path.join(


typo: “Extentded” → “Extended

Akarsha-rai · 2026-02-12T10:34:53Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+        return False
+
+
+def initiate_alert_clearanace():


typo: initiate_alert_clearanace → initiate_alert_clearance

Akarsha-rai · 2026-02-12T10:36:44Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+
+@magenta_squad
+@tier2
+class TestMdsXattrAlerts(E2ETest):


Add @pytest.mark.polarion_id(...) to each new test

Akarsha-rai · 2026-02-12T10:53:04Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+        initiate_alert_clearanace()
+        # waiting for sometime for load distribution
+        time.sleep(600)
+        assert MDSxattr_alert_values(threading_lock, timeout=30) is False


MDSxattr_alert_values returning False when the alert is absent (e.g. after timeout and failed check_alert_list). That works but is harder to read, would suggest to use PrometheusAPI.check_alert_cleared or api.wait_for_alert, so that we can make sure the alerts have be cleared or not.

Akarsha-rai · 2026-02-12T10:56:09Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+    )
+
+
+@magenta_squad


SInce it is related to alerts/monitoring, I believe this should fall in blue_squad

Akarsha-rai · 2026-02-12T10:58:27Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+        # metrics_pods = OCP_POD_OBJ.get(selector="app.kubernetes.io/name=ocs-metrics-exporter")[
+        #     "items"
+        # ]
+        # metrics_pods.delete(force=True)


Missed to remove commented code?

Akarsha-rai · 2026-02-12T12:10:20Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+        active_mds = cluster.get_active_mds_info()["mds_daemon"]
+        standby_mds = cluster.get_mds_standby_replay_info()["mds_daemon"]
+        active_mds_d = "rook-ceph-mds-" + active_mds
+        standby_mds_d = "rook-ceph-mds-" + standby_mds
+        active_mds_pod = cluster.get_active_mds_info()["active_pod"]
+        standby_mds_pod = cluster.get_mds_standby_replay_info()["standby_replay_pod"]
+        mds_dc_pods = [active_mds_d, standby_mds_d]
+
+        log.info(f"Scale down {active_mds_d} to 0")
+        helpers.modify_deployment_replica_count(
+            deployment_name=active_mds_d, replica_count=0
+        )
+        OCP_POD_OBJ.wait_for_delete(resource_name=active_mds_pod)
+
+        log.info(f"Scale down {standby_mds_d} to 0")
+        helpers.modify_deployment_replica_count(
+            deployment_name=standby_mds_d, replica_count=0
+        )
+        OCP_POD_OBJ.wait_for_delete(resource_name=standby_mds_pod)


Both the mgr pods are deleting, so you can use

ocs-ci/ocs_ci/ocs/resources/deployment.py

Line 147 in 5e53bcb

def get_deployments_having_label(label, namespace):

vkathole · 2026-02-16T06:27:25Z

tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py

+    sa_name = helpers.create_serviceaccount(pvc_obj.project.namespace)
+
+    helpers.add_scc_policy(sa_name=sa_name.name, namespace=pvc_obj.project.namespace)
+    pod_obj = helpers.create_pod(


You can use deployment_pod_factory which will also avoid handling pod deletion in the teardown.
pod_obj = deployment_pod_factory(
interface=constants.CEPHFILESYSTEM,
pvc=pvc_obj,
service_account=sa_name,
node_name=active_mds_node_name,
)

Signed-off-by: Poonam Sharma <posharma@redhat.com>

ocs-ci

PR validation on existing cluster

Cluster Name: posharma-ibm18
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Additional Test Params:
OCP VERSION: 4.21
OCS VERSION: 4.21
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: Poonam Sharma <posharma@redhat.com>

ocs-ci

PR validation on existing cluster

Cluster Name: posharma-ibm18
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py::TestMdsXattrAlerts::test_mds_xattr_alert_triggered tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py::TestMdsXattrAlerts::test_alert_with_mds_running_node_restart
Additional Test Params:
OCP VERSION: 4.21
OCS VERSION: 4.21
tested against branch: master

Job PASSED.

RHSTOR-7656 - Test generation of MDS alert for xattr latency

3979375

Signed-off-by: Poonam Sharma <posharma@redhat.com>

posharma1 requested review from a team as code owners January 7, 2026 05:52

pull-request-size bot added the size/L PR that changes 100-499 lines label Jan 7, 2026

posharma1 added the Squad/Magenta label Jan 7, 2026

hnallurv reviewed Jan 7, 2026

View reviewed changes

posharma1 requested a review from a team January 8, 2026 05:59

openshift-merge-robot added the needs-rebase label Jan 13, 2026

dgaikwad reviewed Jan 28, 2026

View reviewed changes

remaining testcases for cephxattr alert

f02d828

Signed-off-by: Poonam Sharma <posharma@redhat.com>

pull-request-size bot added size/XL and removed size/L PR that changes 100-499 lines labels Feb 2, 2026

updation for mitigation steps in remaining testcases

d3741e6

Signed-off-by: Poonam Sharma <posharma@redhat.com>

hnallurv reviewed Feb 10, 2026

View reviewed changes

Akarsha-rai reviewed Feb 12, 2026

View reviewed changes

Merge branch 'master' into extented_attributes

2ec2210

openshift-merge-robot removed the needs-rebase label Feb 13, 2026

vkathole reviewed Feb 16, 2026

View reviewed changes

PrasadDesala added the team/e2e E2E team related issues/PRs label Feb 17, 2026

include comments

fde16af

Signed-off-by: Poonam Sharma <posharma@redhat.com>

openshift-merge-robot added the needs-rebase label Feb 22, 2026

posharma1 force-pushed the extented_attributes branch from fde16af to dc3f159 Compare February 22, 2026 15:12

posharma1 requested review from a team as code owners February 22, 2026 15:12

posharma1 force-pushed the extented_attributes branch from dc3f159 to fde16af Compare February 22, 2026 15:51

Merge branch 'master' into extented_attributes

d13e99e

openshift-merge-robot removed the needs-rebase label Feb 22, 2026

posharma1 added 4 commits March 6, 2026 00:27

address review comments1

ddf24f3

Signed-off-by: Poonam Sharma <posharma@redhat.com>

stabalize testcases

8643927

Signed-off-by: Poonam Sharma <posharma@redhat.com>

fixing leftover resources and timing issue

a761177

Signed-off-by: Poonam Sharma <posharma@redhat.com>

alert clearance issue

36d70b7

Signed-off-by: Poonam Sharma <posharma@redhat.com>

ocs-ci reviewed Mar 19, 2026

View reviewed changes

toolpod issues

af7fd08

Signed-off-by: Poonam Sharma <posharma@redhat.com>

ocs-ci reviewed Mar 20, 2026

View reviewed changes

		)


		@magenta_squad

Conversation

posharma1 commented Jan 7, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

PR Review: RHSTOR-7656 - Test generation of MDS alert for xattr latency

Executive Summary

Test Coverage Analysis

✅ Tests Implemented (5 out of 7):

❌ Missing Tests (2 out of 7):

MUST IMPLEMENT

1. Missing Upgrade Test

2. Incomplete Runbook Validation

3. Missing Alert Clearance Validation

4. Hard-coded Timeout Values

5. Missing CPU Load Threshold Documentation

6. No Validation of Alert Message Content

NICE TO IMPLEMENT

7. Missing Negative Test Cases

8. No Concurrent Test Execution Consideration

9. Insufficient Logging for Debugging

10. Missing Performance Metrics Collection

11. No Cleanup Verification

12. Missing Test for Multiple Concurrent Xattr Operations

COSMETIC

13. Inconsistent Naming Convention

14. Magic Numbers in Code

15. Inconsistent String Formatting

16. Missing Docstring Details

17. Commented Code

18. Long Lines

Summary Statistics

Recommendation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocs-ci left a comment

Choose a reason for hiding this comment

Uh oh!

ocs-ci left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

posharma1 commented Jan 7, 2026 •

edited by openshift-ci bot

Loading