RHSTOR-7656 - Test generation of MDS alert for xattr latency#14045
RHSTOR-7656 - Test generation of MDS alert for xattr latency#14045posharma1 wants to merge 11 commits intored-hat-storage:masterfrom
Conversation
Signed-off-by: Poonam Sharma <posharma@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: posharma1 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
||
| # Example usage | ||
| files = create_files("testfile_", file_count, directory) | ||
| print(f"Creation of {len(files)} files in '{directory}' directory completed.") |
| """ | ||
| Create `count` empty files named <prefix>1, <prefix>2, ... in `directory`. | ||
| Returns a list of created file paths (strings). | ||
| """ |
There was a problem hiding this comment.
WCA suggested the following optimised code:
if count < 1:
return []
dir_path = Path(directory)
dir_path.mkdir(parents=True, exist_ok=True)
created = [str(dir_path / f"{prefix}{i}") for i in range(1, count + 1)]
return created
| "Setting extended attributes and file creation IO started in the background." | ||
| " Script will look for CephXattrSetLatency alert" | ||
| ) | ||
| assert MDSxattr_alert_values(threading_lock) |
There was a problem hiding this comment.
@posharma1
Since the MDSxattr_alert_values function is expected to return only True, do you think we should enforce this behavior using an assert statement?
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
| [active_mds_node.name], constants.STATUS_READY, timeout=420 | ||
| ) | ||
|
|
||
| assert MDSxattr_alert_values(threading_lock, timeout=1200) |
There was a problem hiding this comment.
Hi Poonam,
I took Bob's help in reviewing this PR. Below are the comments.
We can take upgrade test in next PR.
PR Review: RHSTOR-7656 - Test generation of MDS alert for xattr latency
Executive Summary
This PR implements automated tests for the new MDS (Metadata Server) xattr latency alert feature in ODF 4.21. Based on my analysis of the test cases spreadsheet, feature documentation, and the PR code visible in the browser, I've identified several critical issues and recommendations.
Test Coverage Analysis
✅ Tests Implemented (5 out of 7):
- test_mds_xattr_alert_triggered - Basic alert generation and validation
- test_alert_triggered_by_restarting_operator_and_metrics_pods - Pod restart scenarios
- test_alert_after_recovering_prometheus_from_failures - Prometheus failure recovery
- test_alert_after_active_mds_scaledown - Active MDS scale down/up
- test_alert_with_both_mds_scaledown - Both active and standby MDS scale down/up
- test_alert_with_mds_running_node_restart - Node restart scenario
❌ Missing Tests (2 out of 7):
- Upgrade test (4.20 to 4.21) - NOT IMPLEMENTED
- Runbook validation test - PARTIALLY IMPLEMENTED (only URL check, no mitigation steps validation)
MUST IMPLEMENT
1. Missing Upgrade Test
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Severity: CRITICAL
Issue: Test case "RHSTOR-7656 - verifies the mds alert generation post upgrade from version 4.20 to 4.21" is completely missing from the PR.
Expected: A test method that:
- Upgrades cluster from 4.20 to 4.21
- Sets up CPU load with fio pods
- Updates xattr values
- Validates alert generation post-upgrade
Impact: This is a critical test case for ensuring backward compatibility and feature availability after upgrade. Without this, there's no validation that the alert works correctly in upgraded environments.
2. Incomplete Runbook Validation
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~160-163)
Severity: CRITICAL
Issue: The test only validates that the runbook URL is accessible but doesn't verify:
- The runbook content is correct
- Mitigation steps can be followed
- Alert clears after executing runbook steps
Current Code:
runbook = (
"https://github.com/openshift/runbooks/blob/master/alerts/"
"openshift-container-storage-operator/CephXattrSetLatency.md"
)Expected: Test should:
- Access the runbook URL
- Parse and validate mitigation steps
- Execute mitigation steps (increase RAM/CPU for MDS or scale MDS servers)
- Verify alert is cleared after mitigation
- Verify latency returns to valid range (<30ms)
3. Missing Alert Clearance Validation
File: All test methods
Severity: CRITICAL
Issue: None of the tests validate that the alert clears after the condition is resolved. Tests only check alert generation but not alert resolution.
Expected: Each test should include:
# After resolving the condition
assert MDSxattr_alert_values(threading_lock, timeout=300) is False
log.info("Alert cleared successfully after condition resolved")Impact: Without this, we cannot confirm the alert lifecycle is working correctly. Alerts that don't clear create alert fatigue and operational issues.
4. Hard-coded Timeout Values
Files: Multiple locations (Lines 230, 239, 277, 291, 310, 333, 345, 381, 393, 403)
Severity: HIGH
Issue: Timeout values are hard-coded (30, 60, 1200 seconds) throughout the code without explanation or configuration.
Current Code:
assert MDSxattr_alert_values(threading_lock, timeout=30) is False
assert MDSxattr_alert_values(threading_lock, timeout=1200)Expected:
- Define timeout constants at module level with clear names
- Add comments explaining why specific timeouts are chosen
- Consider making timeouts configurable via test parameters
ALERT_GENERATION_TIMEOUT = 1200 # 20 minutes for alert to trigger under load
ALERT_CLEARANCE_TIMEOUT = 300 # 5 minutes for alert to clear
INITIAL_CHECK_TIMEOUT = 30 # Initial check that alert is not present5. Missing CPU Load Threshold Documentation
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~95-96)
Severity: HIGH
Issue: Test case spreadsheet mentions "CPU load and threshold details will be added after testing" but the PR doesn't document what values were determined.
Current Code:
log.info(
"Setting up cephfs stress job for increasing CPU utilization in the cluster"
)Expected: Clear documentation of:
- What CPU utilization percentage triggers the alert
- What xattr latency threshold (>30ms mentioned in feature doc)
- How the fio parameters were determined
- Rationale for m_factor="1,2,3,4", parallelism=5, completions=5
6. No Validation of Alert Message Content
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~155-160)
Severity: HIGH
Issue: Test validates alert exists but doesn't verify the alert message content matches the expected description.
Current Code:
message = (
"There is a latency in setting the 'xattr' values for Ceph Metadata Servers."
)
description = (
"This latency can be caused by different factors like high CPU usage or network"
" related issues etc. Please see the runbook URL link to get further help on mitigating the issue."
)Expected: Add assertions to verify these values match what Prometheus returns:
alert_data = prometheus.check_alert_list(label=MDSxattr_alert, ...)
assert alert_data['message'] == message
assert alert_data['description'] == description
assert alert_data['severity'] == "warning"NICE TO IMPLEMENT
7. Missing Negative Test Cases
Severity: MEDIUM
Issue: No tests verify that the alert does NOT trigger under normal conditions or when latency is below threshold.
Recommendation: Add test cases for:
- Normal operations without high CPU load
- Xattr operations with latency <30ms
- Alert should not trigger in these scenarios
8. No Concurrent Test Execution Consideration
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Severity: MEDIUM
Issue: Tests create resources with fixed names that could conflict if run concurrently.
Current Code:
file = constants.EXTENDED_ATTRIBUTES
stress_mgr = CephFSStressTestManager(namespace=constants.DEFAULT_NAMESPACE)Recommendation: Use unique identifiers or test-specific namespaces:
test_namespace = f"test-mds-xattr-{uuid.uuid4().hex[:8]}"9. Insufficient Logging for Debugging
Severity: MEDIUM
Issue: Limited logging of intermediate states makes debugging failures difficult.
Recommendation: Add more detailed logging:
- Current xattr latency values
- CPU utilization levels
- Number of files being processed
- MDS pod states and transitions
- Alert state transitions with timestamps
10. Missing Performance Metrics Collection
Severity: MEDIUM
Issue: Tests don't collect performance metrics that could be valuable for analysis.
Recommendation: Collect and log:
- Actual latency values from Prometheus queries
- Time taken for alert to trigger
- Time taken for alert to clear
- CPU utilization during test execution
11. No Cleanup Verification
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~134-144)
Severity: MEDIUM
Issue: Finalizer deletes resources but doesn't verify cleanup was successful.
Current Code:
def finalizer():
delete_deployment_pods(pod_obj)
job_obj = OCP(kind="Job", namespace=constants.DEFAULT_NAMESPACE)
job_obj.delete(resource_name="cephfs-stress-job")Recommendation: Add verification:
def finalizer():
delete_deployment_pods(pod_obj)
job_obj = OCP(kind="Job", namespace=constants.DEFAULT_NAMESPACE)
job_obj.delete(resource_name="cephfs-stress-job")
# Verify cleanup
assert not job_obj.get(resource_name="cephfs-stress-job", dont_raise=True)
log.info("Cleanup verified successfully")12. Missing Test for Multiple Concurrent Xattr Operations
Severity: MEDIUM
Issue: Tests use a single script execution. Real-world scenarios might have multiple concurrent xattr operations.
Recommendation: Add test variation with multiple pods executing xattr operations simultaneously.
COSMETIC
13. Inconsistent Naming Convention
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Issue: Function names mix different naming styles.
set_xattr_with_high_cpu_usage(snake_case - correct)MDSxattr_alert_values(mixed case - inconsistent)initiate_alert_clearanace(typo: "clearanace" should be "clearance")
Recommendation: Standardize to snake_case:
def mds_xattr_alert_values(threading_lock, timeout):
def initiate_alert_clearance():14. Magic Numbers in Code
File: Multiple locations
Issue: Numbers like 7000, 200, 5, 16 appear without explanation.
Examples:
base_file_count=7000,
size="200",
parallelism=5,
threads=16,Recommendation: Define as named constants with comments explaining their purpose.
15. Inconsistent String Formatting
File: Throughout the test file
Issue: Mix of f-strings, .format(), and string concatenation.
Recommendation: Standardize on f-strings for consistency:
log.info(f"Scale up {deployment_name} to 1")
log.info(f"Initial ocs-metrics-exporter pod: {metrics_pod_name}")16. Missing Docstring Details
File: All test methods
Issue: Docstrings are minimal and don't include:
- Test prerequisites
- Expected behavior
- Cleanup actions
Recommendation: Enhance docstrings:
def test_mds_xattr_alert_triggered(self, set_xattr_with_high_cpu_usage, threading_lock):
"""
Test MDS xattr latency alert generation.
Prerequisites:
- ODF 4.21+ cluster
- CephFS filesystem available
- Sufficient cluster resources for CPU stress
Test Steps:
1. Create high CPU load using fio pods
2. Execute xattr operations on MDS
3. Wait for alert generation (max 20 minutes)
4. Validate alert properties
Expected Result:
- Alert 'CephXattrSetLatency' triggers
- Alert has correct severity, message, and runbook URL
Cleanup:
- Deletes fio pods and jobs
- Clears alert condition
"""17. Commented Code
File: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py (Line ~254)
Issue: Commented code should be removed:
# ]
# metrics_pods.delete(force=True)Recommendation: Remove commented code or add explanation if it's intentionally kept for reference.
18. Long Lines
File: Multiple locations
Issue: Several lines exceed 100 characters, reducing readability.
Example (Line ~191):
params='{"spec": {"resources": {"mds": {"limits": {"cpu": "16"}, "requests": {"cpu": "16"}}}}}',Recommendation: Break into multiple lines:
params=(
'{"spec": {"resources": {"mds": '
'{"limits": {"cpu": "16"}, "requests": {"cpu": "16"}}}}}'
)Summary Statistics
- Total Test Cases Expected: 7
- Test Cases Implemented: 5
- Test Cases Missing: 2
- Must Implement Issues: 6
- Nice to Implement Issues: 6
- Cosmetic Issues: 6
- Total Issues: 18
Recommendation
DO NOT MERGE until:
- Complete runbook validation with mitigation steps is added
- Alert clearance validation is added to all tests
- Timeout values are properly documented and configured
- CPU load thresholds are documented
The PR shows good test structure and covers most scenarios, but the missing upgrade test and incomplete runbook validation are critical gaps that must be addressed before merge.
ocs_ci/ocs/constants.py
Outdated
| FILE_CREATOR_IO = os.path.join( | ||
| TEMPLATE_WORKLOAD_DIR, "helper_scripts/file_creator_io.py" | ||
| ) | ||
| EXTENTDED_ATTRIBUTES = os.path.join( |
There was a problem hiding this comment.
typo: “Extentded” → “Extended
| return False | ||
|
|
||
|
|
||
| def initiate_alert_clearanace(): |
There was a problem hiding this comment.
typo: initiate_alert_clearanace → initiate_alert_clearance
|
|
||
| @magenta_squad | ||
| @tier2 | ||
| class TestMdsXattrAlerts(E2ETest): |
There was a problem hiding this comment.
Add @pytest.mark.polarion_id(...) to each new test
| initiate_alert_clearanace() | ||
| # waiting for sometime for load distribution | ||
| time.sleep(600) | ||
| assert MDSxattr_alert_values(threading_lock, timeout=30) is False |
There was a problem hiding this comment.
MDSxattr_alert_values returning False when the alert is absent (e.g. after timeout and failed check_alert_list). That works but is harder to read, would suggest to use PrometheusAPI.check_alert_cleared or api.wait_for_alert, so that we can make sure the alerts have be cleared or not.
| ) | ||
|
|
||
|
|
||
| @magenta_squad |
There was a problem hiding this comment.
SInce it is related to alerts/monitoring, I believe this should fall in blue_squad
| # metrics_pods = OCP_POD_OBJ.get(selector="app.kubernetes.io/name=ocs-metrics-exporter")[ | ||
| # "items" | ||
| # ] | ||
| # metrics_pods.delete(force=True) |
There was a problem hiding this comment.
Missed to remove commented code?
| active_mds = cluster.get_active_mds_info()["mds_daemon"] | ||
| standby_mds = cluster.get_mds_standby_replay_info()["mds_daemon"] | ||
| active_mds_d = "rook-ceph-mds-" + active_mds | ||
| standby_mds_d = "rook-ceph-mds-" + standby_mds | ||
| active_mds_pod = cluster.get_active_mds_info()["active_pod"] | ||
| standby_mds_pod = cluster.get_mds_standby_replay_info()["standby_replay_pod"] | ||
| mds_dc_pods = [active_mds_d, standby_mds_d] | ||
|
|
||
| log.info(f"Scale down {active_mds_d} to 0") | ||
| helpers.modify_deployment_replica_count( | ||
| deployment_name=active_mds_d, replica_count=0 | ||
| ) | ||
| OCP_POD_OBJ.wait_for_delete(resource_name=active_mds_pod) | ||
|
|
||
| log.info(f"Scale down {standby_mds_d} to 0") | ||
| helpers.modify_deployment_replica_count( | ||
| deployment_name=standby_mds_d, replica_count=0 | ||
| ) | ||
| OCP_POD_OBJ.wait_for_delete(resource_name=standby_mds_pod) |
There was a problem hiding this comment.
Both the mgr pods are deleting, so you can use
ocs-ci/ocs_ci/ocs/resources/deployment.py
Line 147 in 5e53bcb
| sa_name = helpers.create_serviceaccount(pvc_obj.project.namespace) | ||
|
|
||
| helpers.add_scc_policy(sa_name=sa_name.name, namespace=pvc_obj.project.namespace) | ||
| pod_obj = helpers.create_pod( |
There was a problem hiding this comment.
You can use deployment_pod_factory which will also avoid handling pod deletion in the teardown.
pod_obj = deployment_pod_factory(
interface=constants.CEPHFILESYSTEM,
pvc=pvc_obj,
service_account=sa_name,
node_name=active_mds_node_name,
)
Signed-off-by: Poonam Sharma <posharma@redhat.com>
fde16af to
dc3f159
Compare
dc3f159 to
fde16af
Compare
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
Signed-off-by: Poonam Sharma <posharma@redhat.com>
ocs-ci
left a comment
There was a problem hiding this comment.
PR validation on existing cluster
Cluster Name: posharma-ibm18
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py
Additional Test Params:
OCP VERSION: 4.21
OCS VERSION: 4.21
tested against branch: master
Job UNSTABLE (some or all tests failed).
Signed-off-by: Poonam Sharma <posharma@redhat.com>
ocs-ci
left a comment
There was a problem hiding this comment.
PR validation on existing cluster
Cluster Name: posharma-ibm18
Cluster Configuration:
PR Test Suite: tier2
PR Test Path: tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py::TestMdsXattrAlerts::test_mds_xattr_alert_triggered tests/functional/monitoring/prometheus/alerts/test_alert_mds_xattr_latency.py::TestMdsXattrAlerts::test_alert_with_mds_running_node_restart
Additional Test Params:
OCP VERSION: 4.21
OCS VERSION: 4.21
tested against branch: master
Happy path automation for RHSTOR-7656] - Test will verify generation of CephXattrSetLatency alert generation during intensive extended attribute operation on MDS along with high CPU utilization in the cluster.