-
Notifications
You must be signed in to change notification settings - Fork 50
CNF-19737: Add test for observability remoteWrite #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
If the user configures the optional remoteWrite capability for their observability configuration it must begin streaming prometheus metrics to that endpoint. Validate that this configuration is compatible with other Core RDS monitoring configuration. Signed-off-by: Ian Miller <imiller@redhat.com>
📝 WalkthroughWalkthroughAdds a system test that deploys an HTTP test server, patches Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)
196-236: Redundanttime.SleepinsideEventuallyloop.Line 198 has
time.Sleep(checkInterval)inside the Eventually function body, while line 236 already configures.WithPolling(checkInterval). This results in each iteration waiting approximately double the intended interval (sleep + polling delay).Remove the sleep since Eventually's polling already provides the interval:
Eventually(func() bool { - // Wait for the check interval - time.Sleep(checkInterval) - currentStats, err := getHTTPStats(httpServerPod, remoteWriteTestPodName) if err != nil { klog.V(rdscoreparams.RDSCoreLogLevel).Infof("Error getting stats during rate check: %v", err)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go(1 hunks)tests/system-tests/rdscore/tests/00_validate_top_level.go(1 hunks)
🔇 Additional comments (5)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (5)
1-35: LGTM!Package structure, imports, and constants are well-organized. Using constants for resource names and configuration values is a good practice.
37-41: LGTM!Simple and appropriate struct for the HTTP server statistics response.
327-390: LGTM!The DeferCleanup implementation properly handles ConfigMap restoration, including edge cases where the config structure may have changed. Silent failure with logging is appropriate for cleanup code.
433-458: LGTM!The
getHTTPStatshelper function includes a sensible fallback fromcurlto Python'surllibin casecurlis not available in the container image.
128-143: I encountered an issue accessing the repository to verify the container configuration concern. Since the repository cannot be cloned, I cannot inspect the pod builder implementation or examine usage patterns in the codebase.However, based on the review comment provided, the concern raised requires examining the actual pod builder API implementation to determine whether
pod.NewBuildercreates a default container that conflicts withWithAdditionalContainer.
Verify the pod container configuration pattern.
The code creates a container with
NewContainerBuilder(lines 128-130), then usespod.NewBuilderwith the same image (line 135), and finally adds the configured container viaWithAdditionalContainer(line 137). This may result in duplicate containers in the pod—verify whetherNewBuilderimplicitly creates a default container and whetherWithAdditionalContaineris the correct method to use, or if the configured container should be set as the primary container instead.
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
Outdated
Show resolved
Hide resolved
| It("Verifies cluster monitoring configuration with remoteWrite", | ||
| Label("monitoring", "monitoring-remote-write"), | ||
| rdscorecommon.VerifyMonitoringConfigRemoteWrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing reportxml.ID() and SpecTimeout for test tracking and safety.
All other test cases in this file include a reportxml.ID() for test tracking/reporting. This test is also potentially long-running (10+ minutes for stabilization alone), so consider adding a SpecTimeout.
Additionally, if you fix the SpecContext issue in VerifyMonitoringConfigRemoteWrite, this call will need to be updated.
It("Verifies cluster monitoring configuration with remoteWrite",
Label("monitoring", "monitoring-remote-write"),
- rdscorecommon.VerifyMonitoringConfigRemoteWrite)
+ reportxml.ID("XXXXX"), SpecTimeout(15*time.Minute),
+ rdscorecommon.VerifyMonitoringConfigRemoteWrite)Replace XXXXX with the appropriate test case ID from your tracking system.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| It("Verifies cluster monitoring configuration with remoteWrite", | |
| Label("monitoring", "monitoring-remote-write"), | |
| rdscorecommon.VerifyMonitoringConfigRemoteWrite) | |
| It("Verifies cluster monitoring configuration with remoteWrite", | |
| Label("monitoring", "monitoring-remote-write"), | |
| reportxml.ID("XXXXX"), SpecTimeout(15*time.Minute), | |
| rdscorecommon.VerifyMonitoringConfigRemoteWrite) |
🤖 Prompt for AI Agents
In tests/system-tests/rdscore/tests/00_validate_top_level.go around lines 118 to
120, this It() test is missing a reportxml.ID(...) for test tracking and a
SpecTimeout to guard against long-running stabilization; add
reportxml.ID("XXXXX") (replace XXXXX with the correct test ID) and a SpecTimeout
option (e.g., a suitable duration such as 10–20 minutes) to the It() call
options. Also confirm the signature of
rdscorecommon.VerifyMonitoringConfigRemoteWrite — if you change it to accept a
SpecContext, update this call to pass the appropriate SpecContext argument
instead of the current function value.
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
Outdated
Show resolved
Hide resolved
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
Show resolved
Hide resolved
Additional resilience for pod as part of deployment. Fix code analysis warning on uninitialized ctx. Signed-off-by: Ian Miller <imiller@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (6)
28-38: Confirm intent to run test workloads inopenshift-monitoring
remoteWriteTestNamespaceis hard-coded toopenshift-monitoring, so the HTTP test Deployment/Service are created in the same system namespace as the cluster monitoring stack. That’s probably fine, but it does mean:
- potential name collisions if someone happens to create resources with the same names, and
- mixing test-only workloads into a cluster-managed namespace.
If isolation is preferable, consider running the test server in a dedicated namespace (while still reading
cluster-monitoring-configfromopenshift-monitoringand pointingremoteWrite.urlat the cross-namespace Service).Also applies to: 61-69
156-163:CreateAndWaitUntilReadyinsideEventuallyadds complexity without clear benefit
CreateAndWaitUntilReady(2 * time.Minute)is itself a blocking wait. Wrapping it in anEventuallywith its ownWithPollingandWithTimeoutlayers gives you nested timeouts and potential repeated create attempts, which can make the behavior harder to reason about.If you only need a single create+wait, calling
CreateAndWaitUntilReadyonce (with a suitably generous timeout) would be simpler; if you really want retries, a lighter pattern likeCreate()insideEventuallyplus a separate readiness wait would be clearer.
213-265: Avoid double-sleeping in the rate-stabilization loopInside the
Eventuallyyou calltime.Sleep(checkInterval)and also configureWithPolling(checkInterval). That effectively doubles the interval between checks and the innerSleepisn’t cancellable viactx, which can make the timing less predictable.You can simplify by either:
- removing the explicit
time.Sleepand relying solely onWithPolling(checkInterval), or- keeping the
Sleepinside the function and using a very small (or default) polling interval onEventually.
275-351: Consider handling ConfigMap update conflicts more robustlyThe
EventuallyaroundcmBuilder.Update()retries the sameUpdate()call, but never re-pulls the ConfigMap. If something else is mutatingcluster-monitoring-configand you hit a resourceVersion conflict, all retries will likely fail with the same conflict.A more robust pattern would be to, inside the
Eventuallyfunction:
Pullthe ConfigMap,- re-apply the remoteWrite mutation to the fresh object, then
Update().That way conflicts can be resolved by retrying with the latest version instead of spinning on a stale builder.
354-418: Tighten matching when removing the testremoteWriteendpointCleanup currently removes any
remoteWriteentry whoseurlstrings.Contains(url, remoteWriteTestServiceName). This is very unlikely to collide with user config, but in theory it could drop an unrelated endpoint with that substring.Matching on the exact
serviceURLused when you appended the endpoint (or at least on the fullhttp://<svc>.<ns>.svc.cluster.localhost) would make the cleanup more precise and future-proof.
488-515: Minor simplification opportunity ingetPodFromDeploymentAfter the explicit
len(podList) == 0check that returns an error, the finalif len(podList) > 0 { return podList[0], nil }is always true when reached. You can simplify to:// After the for-loop: return podList[0], nil(or collapse the zero-length check + fallback into a single structure) to reduce a bit of dead-branch noise.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)
tests/system-tests/rdscore/internal/rdscoreparams/const.go (1)
RDSCoreLogLevel(8-8)
🔇 Additional comments (2)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (2)
46-59: Context propagation and overall test flow look solidAccepting
ctx SpecContextand feeding it through allEventually(...).WithContext(ctx)calls fixes the earlier nil-context risk, and the step-wise flow (namespace, Deployment/Service, ConfigMap mutation, verification, cleanup) is coherent and easy to follow.
462-486:getHTTPStatshelper is nicely robustThe curl→python fallback and JSON unmarshalling into
HTTPStatsmatch the server’s/statsresponse shape and should be resilient to minimal tooling differences in the image. No changes needed here.
| monitoringConfigMapName, monitoringConfigYAMLKey)) | ||
|
|
||
| configYAML, keyExists := cmBuilder.Object.Data[monitoringConfigYAMLKey] | ||
| if !keyExists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the key is missing then test must fail or be skipped. The idea is that cluster is configure with the policies and the test evaluates the config. Updating CM on a live cluster doesn't verify that source CR from the telco-reference repo was correctly processed and applied to the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is specifically adding the remoteWrite to the existing (Core RDS applied) configuration in this configmap. This is optional (not included by default) configuration that we want to include as "acceptable" user variation. Typically this would be included in the policy but the goal is to ensure that this option can be added to the RDS applied configuration without causing issues and resulting in a stream of metrics to the indicated endpoint. I'll update it to fail the test but would expect that if the RDS configuration isn't getting applied correctly that it would be caught earlier than this. That said this will serve as a fall-back check on ensuring this specific piece is correctly applied.
Signed-off-by: Ian Miller <imiller@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (2)
395-459: Consider exact URL matching for cleanup robustness.The cleanup logic at line 443 uses
strings.Contains(url, remoteWriteTestServiceName)to identify the test endpoint. While this works for the current scenario, exact URL matching would be more robust and prevent accidental removal of endpoints that might contain the same service name substring.Consider storing the exact service URL in a variable and matching against it:
// Remove the test endpoint (the one with our service URL). +// Use exact match against the service URL we added. filteredSlice := []interface{}{} for _, endpoint := range remoteWriteSlice { endpointMap, ok := endpoint.(map[interface{}]interface{}) if !ok { continue } url, ok := endpointMap["url"].(string) - if !ok || !strings.Contains(url, remoteWriteTestServiceName) { + if !ok || url != serviceURL { filteredSlice = append(filteredSlice, endpoint) } }Note: You would need to capture
serviceURLin the cleanup closure or reconstruct it.
502-529: LGTM with optional readability improvement.The
getHTTPStatsfunction correctly implements a fallback mechanism (curl → Python) for fetching statistics. The logic is sound and error handling is appropriate.For slightly better readability, consider extracting the Python command to a constant or breaking it into multiple lines, though the current implementation is functional:
const pythonStatsTemplate = `import urllib.request; import json; print(json.dumps(json.loads(urllib.request.urlopen('%s').read().decode())))`This is purely a style suggestion; the existing code works correctly.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 2bef8eb and e3f1c605baa569a37462d5a55d504ad5a5ffdb14.
📒 Files selected for processing (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)
tests/internal/oran-subscriber/subscriber.go (1)
WithTimeout(219-223)
🔇 Additional comments (7)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (7)
1-44: LGTM!The imports, constants, and
HTTPStatsstruct are well-defined and appropriate for this Kubernetes integration test.
46-101: LGTM!The Python HTTP server script is well-structured with proper thread-safe statistics tracking and appropriate endpoint handlers for the remoteWrite test scenario.
103-139: LGTM!The security context configuration is appropriate for OpenShift environments, allowing the platform to assign UID/GID from the allowed range while maintaining security restrictions. The container builder usage is correct.
141-182: LGTM!The deployment creation properly uses
Eventuallywith context for retries, and the service creation is straightforward. The use ofCreateAndWaitUntilReadyensures the deployment is ready before proceeding.
187-253: LGTM!The test setup is well-structured with proper resource creation, early cleanup registration, and appropriate error handling. The use of
Eventuallyfor async operations is correct.
317-394: LGTM!The ConfigMap update logic properly handles YAML parsing, ensures the
prometheusK8ssection exists, appends the new remoteWrite endpoint, and uses retry logic for both pulling and updating the ConfigMap.
465-500: LGTM!The final verification logic properly uses
Eventuallyto poll for data arrival, calculates the increases relative to the pre-update baseline, and asserts appropriate thresholds (>50 connections, >100KB) to confirm Prometheus is sending data to the remoteWrite endpoint.
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
Show resolved
Hide resolved
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
Show resolved
Hide resolved
| // getPodFromDeployment gets a running pod from a deployment by label selector. | ||
| func getPodFromDeployment(apiClient *clients.Settings, deploymentName, namespace string) (*pod.Builder, error) { | ||
| // Get pods by the deployment's label selector. | ||
| podList, err := pod.List(apiClient, namespace, metav1.ListOptions{ | ||
| LabelSelector: fmt.Sprintf("app=%s", deploymentName), | ||
| }) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to list pods: %w", err) | ||
| } | ||
|
|
||
| if len(podList) == 0 { | ||
| return nil, fmt.Errorf("no pods found for deployment %s", deploymentName) | ||
| } | ||
|
|
||
| // Find a running pod. | ||
| for _, p := range podList { | ||
| if p.Object.Status.Phase == corev1.PodRunning && p.Object.DeletionTimestamp == nil { | ||
| return p, nil | ||
| } | ||
| } | ||
|
|
||
| // If no running pod found, return the first one (it might be starting). | ||
| if len(podList) > 0 { | ||
| return podList[0], nil | ||
| } | ||
|
|
||
| return nil, fmt.Errorf("no pods found for deployment %s", deploymentName) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return error if no running pod found instead of returning a non-running pod.
Lines 553-555 return the first pod even if it's not in Running state. While the comment suggests "it might be starting," returning a non-running pod can cause failures when ExecCommand is called on it later in getHTTPStats. This creates fragile behavior where the test might intermittently fail depending on pod startup timing.
Apply this diff to return an error if no running pod is found:
// Find a running pod.
for _, p := range podList {
if p.Object.Status.Phase == corev1.PodRunning && p.Object.DeletionTimestamp == nil {
return p, nil
}
}
-// If no running pod found, return the first one (it might be starting).
-if len(podList) > 0 {
- return podList[0], nil
-}
-
-return nil, fmt.Errorf("no pods found for deployment %s", deploymentName)
+return nil, fmt.Errorf("no running pods found for deployment %s", deploymentName)The caller already wraps this in an Eventually block (line 215), so returning an error when no running pod exists will trigger retries until a running pod is available, providing more reliable behavior.
🤖 Prompt for AI Agents
In
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
around lines 531-558, the function currently returns the first pod when no
running pod is found which can cause failures later when ExecCommand is called;
instead, change the logic to return an error when no pod in corev1.PodRunning
state with no DeletionTimestamp is found (remove the fallback that returns
podList[0] and return fmt.Errorf("no running pod found for deployment %s",
deploymentName)), so the caller's Eventually retry loop can wait for a true
running pod.
e3f1c60 to
48a07c5
Compare
Move Deployment creation to helper functions to simplify the Verify.... function Signed-off-by: Ian Miller <imiller@redhat.com>
48a07c5 to
27e7ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)
320-322: Avoid the hard-coded 30s sleep before polling for remoteWrite traffic.The fixed
time.Sleep(30 * time.Second)adds latency but the followingEventually(5‑minute timeout, 10‑second polling) already handles waiting for Prometheus to reload and start sending data. You can safely remove or greatly reduce this sleep to tighten test runtime and reduce unnecessary delay. This was already discussed in a prior review; it still applies here.// Step 7: Wait for Prometheus to pick up the config and start sending data. By("Waiting for Prometheus to start sending data to remoteWrite endpoint") -time.Sleep(30 * time.Second) // Give Prometheus time to reload config and send data. +// Rely on the Eventually block below to wait for Prometheus to reload and send data.
🧹 Nitpick comments (5)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (5)
324-356: Hard-coded connection/byte thresholds may be too strict for low-traffic clusters.Requiring
> preUpdateStats.Connections+50and> 100KBwithin 5 minutes could be flaky on small or lightly loaded clusters, even if remoteWrite is functioning. Consider loosening these thresholds or making them relative (e.g.,> 0but with a guard on at least a few KB of data), or making them configurable via test parameters so the test remains reliable across different environments.
485-507: Reconsider wrappingCreateAndWaitUntilReadyin anEventuallyloop.
CreateAndWaitUntilReadytypically encapsulates its own polling/wait logic. Wrapping it inEventuallythat repeatedly callsCreateAndWaitUntilReadycan surface non‑idempotent behavior (e.g., repeated create attempts hittingAlreadyExists) and makes the overall timeout behavior harder to reason about. It may be simpler and more robust to callCreateAndWaitUntilReadyonce and fail fast if it cannot bring the deployment up within its own timeout.Please double‑check eco-goinfra’s
CreateAndWaitUntilReadysemantics in your version; if it’s already doing retries internally, simplifying this to a single call would be preferable.
509-526: Make service creation idempotent to better tolerate reruns.
service.NewBuilder(...).Create()will typically fail if the Service already exists (e.g., from a previous test run that didn’t complete cleanup). Since you already handle the namespace existing case, consider making the Service creation similarly tolerant—either by checking.Exists()first, handlingAlreadyExistserrors specifically, or using aCreateOrUpdatehelper if eco-goinfra provides one.Please confirm the exact behavior of
service.NewBuilder(...).Create()on existing Services in your eco-goinfra version and adjust the logic accordingly if it’s not idempotent.
528-549: Clarify error message when no running pods are found.The final error path returns
fmt.Errorf("no pods found for deployment %s", deploymentName)even when pods exist but none are inRunningstate. It’s a minor nit, but changing the message to “no running pods found for deployment …” would better reflect the actual condition and make debugging easier when the pod is Pending/CrashLoopBackOff.- return nil, fmt.Errorf("no pods found for deployment %s", deploymentName) + return nil, fmt.Errorf("no running pods found for deployment %s", deploymentName)
37-38: Consider pinning the Python image tag for reproducible test behavior.Using
pythonHTTPServerImage = "...:latest"means the test’s runtime environment can change as the image is updated, which can introduce subtle, time‑dependent flakes. For long‑lived system tests, it’s usually safer to pin to a specific tag or digest (e.g., a known-good UBI9 Python image) and update intentionally when needed.Please check your organization’s guidance on image pinning for test workloads (e.g., recommended tags or digests for UBI9 Python) and align this constant accordingly.
Also applies to: 390-445
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 48a07c58676204dbd6132535762f3f5bc67b0a0a and 27e7ab6.
📒 Files selected for processing (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)
tests/system-tests/rdscore/internal/rdscoreparams/const.go (1)
RDSCoreLogLevel(8-8)
🔇 Additional comments (1)
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)
46-175: Overall remoteWrite validation flow and cleanup look solid.The end-to-end flow—spinning up a dedicated in-cluster HTTP server, wiring it via remoteWrite, stabilizing pre‑update stats, and then asserting post‑update traffic—looks coherent and matches the PR objective. Good use of
DeferCleanupfor both the deployment/service and ConfigMap mutation to avoid leaving cluster state dirty, and nice use ofEventually(...).WithContext(ctx)around all the external calls.
If the user configures the optional remoteWrite capability for their observability configuration it must begin streaming prometheus metrics to that endpoint. Validate that this configuration is compatible with other Core RDS monitoring configuration.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.