Skip to content

RHSTOR-7656 - Test generation of MDS alert for xattr latency#14045

Open
posharma1 wants to merge 11 commits intored-hat-storage:masterfrom
posharma1:extented_attributes
Open

RHSTOR-7656 - Test generation of MDS alert for xattr latency#14045
posharma1 wants to merge 11 commits intored-hat-storage:masterfrom
posharma1:extented_attributes

Conversation

@posharma1
Copy link
Contributor

@posharma1 posharma1 commented Jan 7, 2026

Happy path automation for RHSTOR-7656] - Test will verify generation of CephXattrSetLatency alert generation during intensive extended attribute operation on MDS along with high CPU utilization in the cluster.

Signed-off-by: Poonam Sharma <posharma@redhat.com>
@posharma1 posharma1 requested review from a team as code owners January 7, 2026 05:52
@openshift-ci
Copy link

openshift-ci bot commented Jan 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: posharma1

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines label Jan 7, 2026

# Example usage
files = create_files("testfile_", file_count, directory)
print(f"Creation of {len(files)} files in '{directory}' directory completed.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use logging instead of print ?

"""
Create `count` empty files named <prefix>1, <prefix>2, ... in `directory`.
Returns a list of created file paths (strings).
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WCA suggested the following optimised code:

if count < 1:
    return []
dir_path = Path(directory)
dir_path.mkdir(parents=True, exist_ok=True)
created = [str(dir_path / f"{prefix}{i}") for i in range(1, count + 1)]
return created

"Setting extended attributes and file creation IO started in the background."
" Script will look for CephXattrSetLatency alert"
)
assert MDSxattr_alert_values(threading_lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@posharma1
Since the MDSxattr_alert_values function is expected to return only True, do you think we should enforce this behavior using an assert statement?

Signed-off-by: Poonam Sharma <posharma@redhat.com>
@pull-request-size pull-request-size bot added size/XL and removed size/L PR that changes 100-499 lines labels Feb 2, 2026
Signed-off-by: Poonam Sharma <posharma@redhat.com>
[active_mds_node.name], constants.STATUS_READY, timeout=420
)

assert MDSxattr_alert_values(threading_lock, timeout=1200)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Poonam,
I took Bob's help in reviewing this PR. Below are the comments.
We can take upgrade test in next PR.

PR Review: RHSTOR-7656 - Test generation of MDS alert for xattr latency

Executive Summary

This PR implements automated tests for the new MDS (Metadata Server) xattr latency alert feature in ODF 4.21. Based on my analysis of the test cases spreadsheet, feature documentation, and the PR code visible in the browser, I've identified several critical issues and recommendations.

Test Coverage Analysis

✅ Tests Implemented (5 out of 7):

  1. test_mds_xattr_alert_triggered - Basic alert generation and validation
  2. test_alert_triggered_by_restarting_operator_and_metrics_pods - Pod restart scenarios
  3. test_alert_after_recovering_prometheus_from_failures - Prometheus failure recovery
  4. test_alert_after_active_mds_scaledown - Active MDS scale down/up
  5. test_alert_with_both_mds_scaledown - Both active and standby MDS scale down/up
  6. test_alert_with_mds_running_node_restart - Node restart scenario

❌ Missing Tests (2 out of 7):

  1. Upgrade test (4.20 to 4.21) - NOT IMPLEMENTED
  2. Runbook validation test - PARTIALLY IMPLEMENTED (only URL check, no mitigation steps validation)

MUST IMPLEMENT

1. Missing Upgrade Test

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Severity: CRITICAL
Issue: Test case "RHSTOR-7656 - verifies the mds alert generation post upgrade from version 4.20 to 4.21" is completely missing from the PR.
Expected: A test method that:

  • Upgrades cluster from 4.20 to 4.21
  • Sets up CPU load with fio pods
  • Updates xattr values
  • Validates alert generation post-upgrade
    Impact: This is a critical test case for ensuring backward compatibility and feature availability after upgrade. Without this, there's no validation that the alert works correctly in upgraded environments.

2. Incomplete Runbook Validation

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~160-163)
Severity: CRITICAL
Issue: The test only validates that the runbook URL is accessible but doesn't verify:

  • The runbook content is correct
  • Mitigation steps can be followed
  • Alert clears after executing runbook steps
    Current Code:
runbook = (
    "https://github.com/openshift/runbooks/blob/master/alerts/"
    "openshift-container-storage-operator/CephXattrSetLatency.md"
)

Expected: Test should:

  1. Access the runbook URL
  2. Parse and validate mitigation steps
  3. Execute mitigation steps (increase RAM/CPU for MDS or scale MDS servers)
  4. Verify alert is cleared after mitigation
  5. Verify latency returns to valid range (<30ms)

3. Missing Alert Clearance Validation

File: All test methods
Severity: CRITICAL
Issue: None of the tests validate that the alert clears after the condition is resolved. Tests only check alert generation but not alert resolution.
Expected: Each test should include:

# After resolving the condition
assert MDSxattr_alert_values(threading_lock, timeout=300) is False
log.info("Alert cleared successfully after condition resolved")

Impact: Without this, we cannot confirm the alert lifecycle is working correctly. Alerts that don't clear create alert fatigue and operational issues.

4. Hard-coded Timeout Values

Files: Multiple locations (Lines 230, 239, 277, 291, 310, 333, 345, 381, 393, 403)
Severity: HIGH
Issue: Timeout values are hard-coded (30, 60, 1200 seconds) throughout the code without explanation or configuration.
Current Code:

assert MDSxattr_alert_values(threading_lock, timeout=30) is False
assert MDSxattr_alert_values(threading_lock, timeout=1200)

Expected:

  • Define timeout constants at module level with clear names
  • Add comments explaining why specific timeouts are chosen
  • Consider making timeouts configurable via test parameters
ALERT_GENERATION_TIMEOUT = 1200  # 20 minutes for alert to trigger under load
ALERT_CLEARANCE_TIMEOUT = 300    # 5 minutes for alert to clear
INITIAL_CHECK_TIMEOUT = 30       # Initial check that alert is not present

5. Missing CPU Load Threshold Documentation

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~95-96)
Severity: HIGH
Issue: Test case spreadsheet mentions "CPU load and threshold details will be added after testing" but the PR doesn't document what values were determined.
Current Code:

log.info(
    "Setting up cephfs stress job for increasing CPU utilization in the cluster"
)

Expected: Clear documentation of:

  • What CPU utilization percentage triggers the alert
  • What xattr latency threshold (>30ms mentioned in feature doc)
  • How the fio parameters were determined
  • Rationale for m_factor="1,2,3,4", parallelism=5, completions=5

6. No Validation of Alert Message Content

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~155-160)
Severity: HIGH
Issue: Test validates alert exists but doesn't verify the alert message content matches the expected description.
Current Code:

message = (
    "There is a latency in setting the 'xattr' values for Ceph Metadata Servers."
)
description = (
    "This latency can be caused by different factors like high CPU usage or network"
    " related issues etc. Please see the runbook URL link to get further help on mitigating the issue."
)

Expected: Add assertions to verify these values match what Prometheus returns:

alert_data = prometheus.check_alert_list(label=MDSxattr_alert, ...)
assert alert_data['message'] == message
assert alert_data['description'] == description
assert alert_data['severity'] == "warning"

NICE TO IMPLEMENT

7. Missing Negative Test Cases

Severity: MEDIUM
Issue: No tests verify that the alert does NOT trigger under normal conditions or when latency is below threshold.
Recommendation: Add test cases for:

  • Normal operations without high CPU load
  • Xattr operations with latency <30ms
  • Alert should not trigger in these scenarios

8. No Concurrent Test Execution Consideration

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Severity: MEDIUM
Issue: Tests create resources with fixed names that could conflict if run concurrently.
Current Code:

file = constants.EXTENDED_ATTRIBUTES
stress_mgr = CephFSStressTestManager(namespace=constants.DEFAULT_NAMESPACE)

Recommendation: Use unique identifiers or test-specific namespaces:

test_namespace = f"test-mds-xattr-{uuid.uuid4().hex[:8]}"

9. Insufficient Logging for Debugging

Severity: MEDIUM
Issue: Limited logging of intermediate states makes debugging failures difficult.
Recommendation: Add more detailed logging:

  • Current xattr latency values
  • CPU utilization levels
  • Number of files being processed
  • MDS pod states and transitions
  • Alert state transitions with timestamps

10. Missing Performance Metrics Collection

Severity: MEDIUM
Issue: Tests don't collect performance metrics that could be valuable for analysis.
Recommendation: Collect and log:

  • Actual latency values from Prometheus queries
  • Time taken for alert to trigger
  • Time taken for alert to clear
  • CPU utilization during test execution

11. No Cleanup Verification

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~134-144)
Severity: MEDIUM
Issue: Finalizer deletes resources but doesn't verify cleanup was successful.
Current Code:

def finalizer():
    delete_deployment_pods(pod_obj)
    job_obj = OCP(kind="Job", namespace=constants.DEFAULT_NAMESPACE)
    job_obj.delete(resource_name="cephfs-stress-job")

Recommendation: Add verification:

def finalizer():
    delete_deployment_pods(pod_obj)
    job_obj = OCP(kind="Job", namespace=constants.DEFAULT_NAMESPACE)
    job_obj.delete(resource_name="cephfs-stress-job")
    # Verify cleanup
    assert not job_obj.get(resource_name="cephfs-stress-job", dont_raise=True)
    log.info("Cleanup verified successfully")

12. Missing Test for Multiple Concurrent Xattr Operations

Severity: MEDIUM
Issue: Tests use a single script execution. Real-world scenarios might have multiple concurrent xattr operations.
Recommendation: Add test variation with multiple pods executing xattr operations simultaneously.


COSMETIC

13. Inconsistent Naming Convention

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Issue: Function names mix different naming styles.

  • set_xattr_with_high_cpu_usage (snake_case - correct)
  • MDSxattr_alert_values (mixed case - inconsistent)
  • initiate_alert_clearanace (typo: "clearanace" should be "clearance")

Recommendation: Standardize to snake_case:

def mds_xattr_alert_values(threading_lock, timeout):
def initiate_alert_clearance():

14. Magic Numbers in Code

File: Multiple locations
Issue: Numbers like 7000, 200, 5, 16 appear without explanation.
Examples:

base_file_count=7000,
size="200",
parallelism=5,
threads=16,

Recommendation: Define as named constants with comments explaining their purpose.

15. Inconsistent String Formatting

File: Throughout the test file
Issue: Mix of f-strings, .format(), and string concatenation.
Recommendation: Standardize on f-strings for consistency:

log.info(f"Scale up {deployment_name} to 1")
log.info(f"Initial ocs-metrics-exporter pod: {metrics_pod_name}")

16. Missing Docstring Details

File: All test methods
Issue: Docstrings are minimal and don't include:

  • Test prerequisites
  • Expected behavior
  • Cleanup actions
    Recommendation: Enhance docstrings:
def test_mds_xattr_alert_triggered(self, set_xattr_with_high_cpu_usage, threading_lock):
    """
    Test MDS xattr latency alert generation.
    
    Prerequisites:
        - ODF 4.21+ cluster
        - CephFS filesystem available
        - Sufficient cluster resources for CPU stress
    
    Test Steps:
        1. Create high CPU load using fio pods
        2. Execute xattr operations on MDS
        3. Wait for alert generation (max 20 minutes)
        4. Validate alert properties
    
    Expected Result:
        - Alert 'CephXattrSetLatency' triggers
        - Alert has correct severity, message, and runbook URL
    
    Cleanup:
        - Deletes fio pods and jobs
        - Clears alert condition
    """

17. Commented Code

File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~254)
Issue: Commented code should be removed:

# ]
# metrics_pods.delete(force=True)

Recommendation: Remove commented code or add explanation if it's intentionally kept for reference.

18. Long Lines

File: Multiple locations
Issue: Several lines exceed 100 characters, reducing readability.
Example (Line ~191):

params='{"spec": {"resources": {"mds": {"limits": {"cpu": "16"}, "requests": {"cpu": "16"}}}}}',

Recommendation: Break into multiple lines:

params=(
    '{"spec": {"resources": {"mds": '
    '{"limits": {"cpu": "16"}, "requests": {"cpu": "16"}}}}}' 
)

Summary Statistics

  • Total Test Cases Expected: 7
  • Test Cases Implemented: 5
  • Test Cases Missing: 2
  • Must Implement Issues: 6
  • Nice to Implement Issues: 6
  • Cosmetic Issues: 6
  • Total Issues: 18

Recommendation

DO NOT MERGE until:

  1. Complete runbook validation with mitigation steps is added
  2. Alert clearance validation is added to all tests
  3. Timeout values are properly documented and configured
  4. CPU load thresholds are documented

The PR shows good test structure and covers most scenarios, but the missing upgrade test and incomplete runbook validation are critical gaps that must be addressed before merge.

FILE_CREATOR_IO = os.path.join(
TEMPLATE_WORKLOAD_DIR, "helper_scripts/file_creator_io.py"
)
EXTENTDED_ATTRIBUTES = os.path.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: “Extentded” → “Extended

return False


def initiate_alert_clearanace():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: initiate_alert_clearanace → initiate_alert_clearance


@magenta_squad
@tier2
class TestMdsXattrAlerts(E2ETest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @pytest.mark.polarion_id(...) to each new test

initiate_alert_clearanace()
# waiting for sometime for load distribution
time.sleep(600)
assert MDSxattr_alert_values(threading_lock, timeout=30) is False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDSxattr_alert_values returning False when the alert is absent (e.g. after timeout and failed check_alert_list). That works but is harder to read, would suggest to use PrometheusAPI.check_alert_cleared or api.wait_for_alert, so that we can make sure the alerts have be cleared or not.

)


@magenta_squad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SInce it is related to alerts/monitoring, I believe this should fall in blue_squad

Comment on lines +252 to +255
# metrics_pods = OCP_POD_OBJ.get(selector="app.kubernetes.io/name=ocs-metrics-exporter")[
# "items"
# ]
# metrics_pods.delete(force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed to remove commented code?

Comment on lines +347 to +365
active_mds = cluster.get_active_mds_info()["mds_daemon"]
standby_mds = cluster.get_mds_standby_replay_info()["mds_daemon"]
active_mds_d = "rook-ceph-mds-" + active_mds
standby_mds_d = "rook-ceph-mds-" + standby_mds
active_mds_pod = cluster.get_active_mds_info()["active_pod"]
standby_mds_pod = cluster.get_mds_standby_replay_info()["standby_replay_pod"]
mds_dc_pods = [active_mds_d, standby_mds_d]

log.info(f"Scale down {active_mds_d} to 0")
helpers.modify_deployment_replica_count(
deployment_name=active_mds_d, replica_count=0
)
OCP_POD_OBJ.wait_for_delete(resource_name=active_mds_pod)

log.info(f"Scale down {standby_mds_d} to 0")
helpers.modify_deployment_replica_count(
deployment_name=standby_mds_d, replica_count=0
)
OCP_POD_OBJ.wait_for_delete(resource_name=standby_mds_pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the mgr pods are deleting, so you can use

def get_deployments_having_label(label, namespace):

sa_name = helpers.create_serviceaccount(pvc_obj.project.namespace)

helpers.add_scc_policy(sa_name=sa_name.name, namespace=pvc_obj.project.namespace)
pod_obj = helpers.create_pod(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use deployment_pod_factory which will also avoid handling pod deletion in the teardown.
pod_obj = deployment_pod_factory(
interface=constants.CEPHFILESYSTEM,
pvc=pvc_obj,
service_account=sa_name,
node_name=active_mds_node_name,
)

@PrasadDesala PrasadDesala added the team/e2e E2E team related issues/PRs label Feb 17, 2026
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: posharma-ibm18
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Additional Test Params:
OCP VERSION: 4.21
OCS VERSION: 4.21
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: Poonam Sharma <posharma@redhat.com>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: posharma-ibm18
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py::TestMdsXattrAlerts::test_mds_xattr_alert_triggered tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py::TestMdsXattrAlerts::test_alert_with_mds_running_node_restart
Additional Test Params:
OCP VERSION: 4.21
OCS VERSION: 4.21
tested against branch: master

Job PASSED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Squad/Magenta team/e2e E2E team related issues/PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants