Skip to content

Conversation

@bazem8
Copy link
Collaborator

@bazem8 bazem8 commented Aug 20, 2025

Summary by CodeRabbit

  • Tests
    • Added BDD-style Cluster Health and comprehensive Node Health test suites validating node discovery, readiness, pressure, resource, kubelet, and condition checks with ordered, failure-aware flows.
    • Emits JUnit-compatible reports, supports label-based selective runs, and produces failure-time diagnostic dumps for CI troubleshooting.
  • Documentation
    • Added README documenting the Node Health test suite, configuration defaults, running instructions, and customization guidance.
  • Chores
    • Exposed configurable thresholds and log verbosity to aid debugging and reporting.

Copy link
Collaborator

@klaskosk klaskosk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more general note on comments (since I know gemini has a tendency to add way too many 🙂): you only need to add them to explain why, not what. The linter will complain for exported types/consts/var/etc so things like consts may have more obvious comments, but inside a function you shouldn't need that many.

This is a good example, where the comment explains why we can set clusterName := strings.Split(cluster.Server, ".")[1]. Without the comment it's not clear without knowing the format of the input why that works

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be typed as glog.Level but it's unused so that didn't show up

as for the actual number, the convention so far is to leave 100 for eco-goinfra logs and then use lower numbers for eco-gotests. In cnf/ran we usually do 90 on the test suite level and 80 on the cnf/ran level, although we always set verbosity to 100 in CI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Expect(err).ToNot(HaveOccurred(),, the error will be printed regardless so we don't need to specify it in the format string. although good to be aware that you can format like that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like above, you can just format directly in the matcher. no need to use fmt.Sprintf

Comment on lines +37 to +46
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the eco-goinfra/pkg/nodes.Builder type has an IsReady method, so it would be good to reuse that method if possible

this is a good implementation of it though

Comment on lines 49 to 53
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins console will get all log messages, so the biggest advantage is that GinkgoWriter will show up in the junit that gets exported. But unless you have a specific reason for doing that it'd be preferred to use glog.V(tsparams.LogLevel).Info() or Infof. Plus glog will add the newline automatically

@klaskosk
Copy link
Collaborator

another thing since we just moved the repo is that you'll need to rebase (git pull --rebase upstream main) these changes and update the imports to be github.com/rh-ecosystem-edge/...

@bazem8 bazem8 force-pushed the b-ecogo-health-test branch from 7909402 to 40e4f89 Compare August 27, 2025 13:21
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

Walkthrough

Adds two Ginkgo/Gomega test suites and supporting packages: a minimal "Cluster Health Check" healthcheck suite and a comprehensive "Node Health CNF RAN" suite. Introduces test params, initialization helpers, reporting hooks (JUnit/XML), multiple node-health validations, and documentation for the node-health tests.

Changes

Cohort / File(s) Summary
Healthcheck suite wiring
tests/cnf/ran/healthcheck/healthcheck_suite_test.go
New Ginkgo suite entrypoint that sets the JUnit report path via RANConfig, registers the fail handler, and runs specs filtered by tsparams.LabelHealthCheckTestCases.
Healthcheck params
tests/cnf/ran/healthcheck/internal/tsparams/params.go
Adds LabelHealthCheckTestCases = "healthcheck" and LogLevel glog.Level = 90 constants for labeling and glog verbosity.
Healthcheck test case
tests/cnf/ran/healthcheck/tests/healthcheck.go
Adds a spec that lists cluster nodes via Spoke1APIClient, asserts the list is non-empty, and verifies each node has NodeReady == ConditionTrue, logging readiness.
Node Health suite wiring
tests/cnf/ran/node-health/node_health_suite_test.go
New Ginkgo suite entrypoint for node-health: sets JUnit/XML reporting paths, registers fail handler, applies label filters from nodehealthparams.Labels, adds JustAfterEach reporter dump on failure, and creates XML report in AfterSuite.
Node Health params & reporter config
tests/cnf/ran/node-health/internal/nodehealthparams/...
tests/cnf/ran/node-health/internal/nodehealthparams/const.go
Adds many constants (labels, thresholds, timeouts, selectors), Labels slice, ReporterNamespacesToDump map, and ReporterCRDsToDump list used by reporter hooks.
Node Health init helper
tests/cnf/ran/node-health/internal/nodehealthinittools/nodehealthinittools.go
Exposes APIClient *clients.Settings initialized from internal inittools.APIClient for consumers of the package.
Node Health tests (validation suite)
tests/cnf/ran/node-health/tests/node_health_validation.go
Adds a comprehensive, ordered test suite that: lists nodes in BeforeAll, validates readiness, pressure, resource fields, kubelet pod status, node conditions (including timestamps), and performs short continuous monitoring. Uses report IDs and logging.
Documentation
tests/cnf/ran/node-health/README.md
New README describing suite purpose, structure, categories, thresholds, running instructions, config guidance, dependencies, and contribution notes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Suite as Healthcheck Suite
  participant Config as RANConfig
  participant Spec as Healthcheck Spec
  participant API as Kubernetes API (Spoke1APIClient)

  Suite->>Config: GetJunitReportPath(currentFile)
  Config-->>Suite: JUnit path
  Suite->>Suite: RegisterFailHandler(Fail)\nRunSpecs(filter: "healthcheck")
  Spec->>API: List Nodes (context.TODO())
  API-->>Spec: NodeList or error
  alt error
    Spec->>Suite: Fail (error)
  else success
    Spec->>Spec: Expect nodeList non-empty
    loop each node
      Spec->>Spec: Check NodeReady == ConditionTrue
      alt not ready
        Spec->>Suite: Fail (node not ready)
      else ready
        Spec->>Spec: Log readiness
      end
    end
  end
Loading
sequenceDiagram
  autonumber
  participant Suite as NodeHealth Suite
  participant Init as InitTools/APIClient
  participant Reporter as Reporter/ReportXML
  participant API as Kubernetes API

  Note over Suite: Suite startup
  Suite->>Init: resolve currentFile, get APIClient
  Suite->>Reporter: configure JUnit/XML paths
  Suite->>Suite: RegisterFailHandler(Fail)\nRunSpecs(filter: nodehealthparams.Labels)

  Note over Suite,API: Test execution (high level)
  Suite->>API: BeforeAll: List Nodes -> nodesList
  Suite->>API: Ordered Contexts: Readiness, Pressure, Resources, Kubelet, Conditions, Monitoring
  API-->>Suite: Node and Pod data responses
  alt any spec failed
    Suite->>Reporter: JustAfterEach -> ReportIfFailed(CurrentSpecReport(), currentFile, namespaces, crds)
  end

  Note over Suite: AfterSuite
  Suite->>Reporter: reportxml.Create(report, GeneralConfig.GetReportPath(), GeneralConfig.TCPrefix)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
tests/cnf/ran/healthcheck/internal/tsparams/params.go (1)

7-7: Log-level convention ack (90 for suite-level; 100 reserved for eco-goinfra)

This aligns with the convention described earlier; keep as-is for cnf/ran suite.

tests/cnf/ran/healthcheck/tests/healthcheck.go (2)

30-31: Drop custom message for err matcher; Gomega will print the error

Cleaner failure output; previous review mentioned this as well.

-		Expect(err).ToNot(HaveOccurred(), "Failed to list cluster nodes")
+		Expect(err).ToNot(HaveOccurred())

34-51: Optional: reuse eco-goinfra’s Node builder IsReady helper

Reduces custom condition-walking and aligns with library semantics.

-		By("Verifying all nodes are in Ready state")
-		// Iterate through each node retrieved from the cluster.
-		for _, node := range nodeList {
-			// Initialize a flag to track if the node is Ready.
-			isReady := false
-			// Iterate through the conditions reported by the node.
-			for _, condition := range node.Status.Conditions {
-				// Check for the "Ready" condition type and ensure its status is "True".
-				if condition.Type == corev1.NodeReady && condition.Status == corev1.ConditionTrue {
-					isReady = true
-					break // Found the Ready condition, no need to check further conditions for this node.
-				}
-			}
-			// Assert that the node is ready. If not, the test will fail and report the node's name.
-			Expect(isReady).To(BeTrue(), "Node %s is not in Ready state. Current conditions: %+v", node.Name, node.Status.Conditions)
-			// Print a message to the glog output for successful nodes.
-			glog.V(tsparams.LogLevel).Infof("Node %s is Ready.", node.Name)
-		}
+		By("Verifying all nodes are in Ready state")
+		for _, node := range nodeList {
+			ready, err := nodes.NewBuilder(Spoke1APIClient, node.Name).IsReady(context.TODO())
+			Expect(err).ToNot(HaveOccurred())
+			Expect(ready).To(BeTrue(), "Node %s is not in Ready state. Current conditions: %+v", node.Name, node.Status.Conditions)
+			glog.V(tsparams.LogLevel).Infof("Node %s is Ready.", node.Name)
+		}
🧹 Nitpick comments (2)
tests/cnf/ran/healthcheck/internal/tsparams/params.go (1)

3-7: Polish exported comments to satisfy linters

Start comments with the identifier names.

-// This label will be used in the Jenkins job's TEST_TYPE parameter.
+// LabelHealthCheckTestCases is used in the Jenkins job's TEST_TYPE parameter.
 const LabelHealthCheckTestCases = "healthcheck"
 
-// defines the verbosity level for glog.
+// LogLevel defines the default verbosity level for glog.
 const LogLevel glog.Level = 90
tests/cnf/ran/healthcheck/tests/healthcheck.go (1)

25-31: Optional: use a timeout context for API calls

Prevents hangs on slow/unreachable clusters.

-		nodeList, err := nodes.List(Spoke1APIClient, context.TODO())
+		ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+		defer cancel()
+		nodeList, err := nodes.List(Spoke1APIClient, ctx)

Add import if adopting:

-import (
+import (
 	"context"
+	"time"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8f38185 and 40e4f89.

📒 Files selected for processing (3)
  • tests/cnf/ran/healthcheck/healthcheck_suite_test.go (1 hunks)
  • tests/cnf/ran/healthcheck/internal/tsparams/params.go (1 hunks)
  • tests/cnf/ran/healthcheck/tests/healthcheck.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/cnf/ran/healthcheck/healthcheck_suite_test.go (1)
tests/cnf/ran/healthcheck/internal/tsparams/params.go (1)
  • LabelHealthCheckTestCases (4-4)
tests/cnf/ran/healthcheck/tests/healthcheck.go (2)
tests/cnf/ran/healthcheck/internal/tsparams/params.go (2)
  • LabelHealthCheckTestCases (4-4)
  • LogLevel (7-7)
tests/cnf/ran/internal/raninittools/raninittools.go (1)
  • Spoke1APIClient (14-14)
🔇 Additional comments (2)
tests/cnf/ran/healthcheck/healthcheck_suite_test.go (1)

21-22: RunSpecs invocation is consistent with repository conventions; no changes required

After reviewing the RunSpecs calls across all suite tests (e.g. vcore, spk, RDS, O-Cloud, RAN O-RAN, TALM, etc.), every test uses the pattern:

RunSpecs(t, "<Suite Name>", Label(...), reporterConfig)

Your Cluster Health Check suite does the same:

RunSpecs(t, "Cluster Health Check", Label(tsparams.LabelHealthCheckTestCases), reporterConfig)

This matches the established Ginkgo v2 usage in our codebase, so no update is necessary here.

tests/cnf/ran/healthcheck/tests/healthcheck.go (1)

53-53: LGTM: concise summary logging

Final readiness summary at the configured verbosity is helpful and consistent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (10)
tests/cnf/ran/node-health/README.md (1)

27-38: Add language specifiers to fenced code blocks (markdownlint MD040)

Add a language to the directory tree and log examples to satisfy linters.

-```
+```text
 ...
-```
+```
-```
+```text
 INFO: Found 3 nodes for health monitoring
 ...
-```
+```

Also applies to: 141-150

tests/cnf/ran/node-health/internal/nodehealthinittools/nodehealthinittools.go (2)

13-17: Avoid recommending dot-imports; export a stable API instead

Dot-imports reduce clarity and can cause identifier collisions. The comment suggests dot-importing this package, but only APIClient is exported anyway.

-// init loads all variables automatically when this package is imported. Once package is imported a user has full
-// access to all vars within init function. It is recommended to import this package using dot import.
+// init exposes shared clients when this package is imported.
+// Import this package normally and reference nodehealthinittools.APIClient explicitly.

8-17: Remove or consolidate the nodehealthinittools re-export
No references to nodehealthinittools.APIClient were found—tests still dot-import tests/internal/inittools. Either switch tests to import this package for APIClient or drop it to eliminate redundancy.

tests/cnf/ran/node-health/tests/node_health_validation.go (3)

156-181: Disk “usage” test doesn’t assert usage; it only logs image sizes

Currently this only checks Capacity/Allocatable presence and logs image sizes. If thresholds are intended (per README), add real assertions from metrics or kubelet stats API.

I can wire this to metrics.k8s.io (if enabled) or node allocatable vs. requested via schedulable resources.


372-411: Long-running monitoring in an It block; parameterize or gate

A hardcoded 2 minutes with 30s sleeps will slow suites and may hit CI timeouts.

  • Read duration/interval from env (e.g., ECO_NODE_MONITOR_DURATION, ECO_NODE_MONITOR_INTERVAL) with sensible defaults.
  • Skip or shorten when GeneralConfig.DryRun is true.

3-20: Prefer avoiding dot-imports for inittools

Dot-importing inittools pulls many globals into this package’s scope. Prefer a named import and explicit references for readability.

- . "github.com/rh-ecosystem-edge/eco-gotests/tests/internal/inittools"
+ inittools "github.com/rh-ecosystem-edge/eco-gotests/tests/internal/inittools"

Then reference inittools.GeneralConfig and inittools.APIClient explicitly.

tests/cnf/ran/node-health/internal/nodehealthparams/const.go (4)

42-50: Use time.Duration for timeouts/intervals to avoid unit ambiguity.

Seconds as bare ints are error-prone; prefer typed durations.

Apply:

-	// KubeletHealthCheckTimeout is the timeout for kubelet health checks.
-	KubeletHealthCheckTimeout = 30
+	// KubeletHealthCheckTimeout is the timeout for kubelet health checks.
+	KubeletHealthCheckTimeout = 30 * time.Second
@@
-	// NodeConditionCheckTimeout is the timeout for node condition checks.
-	NodeConditionCheckTimeout = 60
+	// NodeConditionCheckTimeout is the timeout for node condition checks.
+	NodeConditionCheckTimeout = 60 * time.Second
@@
-	// ResourceCheckInterval is the interval between resource checks.
-	ResourceCheckInterval = 10
+	// ResourceCheckInterval is the interval between resource checks.
+	ResourceCheckInterval = 10 * time.Second

Also add the import:

import (
    k8sreporter "github.com/rh-ecosystem-edge/eco-goinfra/pkg/k8sreporter"
    corev1 "k8s.io/api/core/v1"
    "time"
)

51-59: Mirror Kubernetes constants instead of hardcoding strings.

Prevents drift and typos while keeping the same exported names.

Apply:

-	// ConditionTypeReadyString constant to fix linter warning.
-	ConditionTypeReadyString = "Ready"
-
-	// ConstantTrueString constant to fix linter warning.
-	ConstantTrueString = "True"
-
-	// ConstantFalseString constant to fix linter warning.
-	ConstantFalseString = "False"
+	// ConditionTypeReadyString mirrors corev1.NodeReady.
+	ConditionTypeReadyString = string(corev1.NodeReady)
+
+	// ConstantTrueString mirrors corev1.ConditionTrue.
+	ConstantTrueString = string(corev1.ConditionTrue)
+
+	// ConstantFalseString mirrors corev1.ConditionFalse.
+	ConstantFalseString = string(corev1.ConditionFalse)

1-2: Add a package doc comment for lint compliance.

Apply:

+// Package nodehealthparams centralizes labels, thresholds, and reporter settings
+// used by the Node Health CNF tests.
 package nodehealthparams

83-95: Reporter defaults are sensible.

Namespaces and CRDs chosen are useful for triage. Consider adding tuned (openshift-cluster-node-tuning-operator) if node tuning issues are in scope, but optional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 896a4af and 3955b21.

📒 Files selected for processing (5)
  • tests/cnf/ran/node-health/README.md (1 hunks)
  • tests/cnf/ran/node-health/internal/nodehealthinittools/nodehealthinittools.go (1 hunks)
  • tests/cnf/ran/node-health/internal/nodehealthparams/const.go (1 hunks)
  • tests/cnf/ran/node-health/node_health_suite_test.go (1 hunks)
  • tests/cnf/ran/node-health/tests/node_health_validation.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/cnf/ran/node-health/node_health_suite_test.go (3)
tests/internal/config/config.go (1)
  • GeneralConfig (21-42)
tests/cnf/ran/node-health/internal/nodehealthparams/const.go (4)
  • Label (10-10)
  • Labels (74-81)
  • ReporterNamespacesToDump (84-88)
  • ReporterCRDsToDump (91-95)
tests/internal/reporter/reporter.go (1)
  • ReportIfFailed (51-57)
tests/cnf/ran/node-health/tests/node_health_validation.go (2)
tests/cnf/ran/node-health/internal/nodehealthparams/const.go (2)
  • KubeletNamespace (61-61)
  • KubeletPodSelector (64-64)
tests/cnf/ran/node-health/internal/nodehealthinittools/nodehealthinittools.go (1)
  • APIClient (10-10)
🪛 LanguageTool
tests/cnf/ran/node-health/README.md

[grammar] ~9-~9: There might be a mistake here.
Context: ...Node Readiness**: Ensures all nodes are in Ready state for CNF workloads - **Press...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ...gories ### 1. Node Readiness Validation - Test ID: node-health-001 - **Purpose...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...ode Readiness Validation - Test ID: node-health-001 - Purpose: Verify all nodes are in Ready...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...01` - Purpose: Verify all nodes are in Ready state for CNF workloads - **Label...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...y state for CNF workloads - Labels: node-readiness-check - Description: Checks that each node has...

(QB_NEW_EN)


[grammar] ~48-~48: There might be a mistake here.
Context: ... True ### 2. Node Pressure Validation - Test ID: node-health-003 to `node-he...

(QB_NEW_EN)


[grammar] ~49-~49: There might be a mistake here.
Context: ...ion - Test ID: node-health-003 to node-health-005 - Purpose: Monitor node pressure conditi...

(QB_NEW_EN)


[grammar] ~50-~50: There might be a mistake here.
Context: ... conditions critical for RAN performance - Labels: disk-pressure-check, `memory...

(QB_NEW_EN)


[grammar] ~51-~51: There might be a mistake here.
Context: ...essure-check, memory-pressure-check, network-pressure-check` - Description: Validates that nodes are ...

(QB_NEW_EN)


[grammar] ~54-~54: There might be a mistake here.
Context: ...e ### 3. Node Resource Usage Validation - Test ID: node-health-006 to `node-he...

(QB_NEW_EN)


[grammar] ~55-~55: There might be a mistake here.
Context: ...ion - Test ID: node-health-006 to node-health-008 - Purpose: Monitor resource utilization ...

(QB_NEW_EN)


[grammar] ~56-~56: There might be a mistake here.
Context: ...ce utilization for CNF resource planning - Labels: disk-usage-check, `memory-us...

(QB_NEW_EN)


[grammar] ~57-~57: There might be a mistake here.
Context: ...isk-usage-check, memory-usage-check, cpu-usage-check` - Description: Verifies resource capacit...

(QB_NEW_EN)


[grammar] ~60-~60: There might be a mistake here.
Context: ...limits ### 4. Kubelet Status Validation - Test ID: node-health-009 to `node-he...

(QB_NEW_EN)


[grammar] ~61-~61: There might be a mistake here.
Context: ...ion - Test ID: node-health-009 to node-health-010 - Purpose: Ensure kubelet service health...

(QB_NEW_EN)


[grammar] ~63-~63: There might be a mistake here.
Context: ...ing - Labels: kubelet-pod-status, kubelet-service-check - Description: Verifies kubelet pods are...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ...nodes ### 5. Node Conditions Validation - Test ID: node-health-011 to `node-he...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ...e condition monitoring for RAN stability - Labels: node-conditions-check, `node...

(QB_NEW_EN)


[grammar] ~72-~72: There might be a mistake here.
Context: ...sition times ### 6. Resource Monitoring - Test ID: node-health-013 - **Purpose...

(QB_NEW_EN)


[grammar] ~73-~73: There might be a mistake here.
Context: ...# 6. Resource Monitoring - Test ID: node-health-013 - Purpose: Continuous monitoring over ti...

(QB_NEW_EN)


[grammar] ~74-~74: There might be a mistake here.
Context: ...g over time for CNF performance analysis - Labels: resource-monitoring - **Desc...

(QB_NEW_EN)


[grammar] ~75-~75: There might be a mistake here.
Context: ... CNF performance analysis - Labels: resource-monitoring - Description: Monitors node health cont...

(QB_NEW_EN)


[grammar] ~80-~80: There might be a mistake here.
Context: ...ration ## Configuration ### Thresholds The test suite uses configurable thresho...

(QB_NEW_EN)


[grammar] ~155-~155: There might be a mistake here.
Context: ...es the standard reporting infrastructure - Labels: Follows the established labeli...

(QB_NEW_EN)


[grammar] ~156-~156: There might be a mistake here.
Context: ...ows the established labeling conventions - Parameters: Uses the parameter managem...

(QB_NEW_EN)


[grammar] ~157-~157: There might be a mistake here.
Context: ...**: Uses the parameter management system - Error Handling: Integrates with the fa...

(QB_NEW_EN)


[grammar] ~162-~162: There might be a mistake here.
Context: ...ra**: For Kubernetes resource management - Ginkgo v2: For BDD-style testing - **G...

(QB_NEW_EN)


[grammar] ~163-~163: There might be a mistake here.
Context: ...t - Ginkgo v2: For BDD-style testing - Gomega: For assertions and matchers - ...

(QB_NEW_EN)


[grammar] ~164-~164: There might be a mistake here.
Context: ... Gomega: For assertions and matchers - Kubernetes client-go: For API interact...

(QB_NEW_EN)


[grammar] ~169-~169: There might be a mistake here.
Context: ...tomization ### Adding New Health Checks To add new health checks, extend the tes...

(QB_NEW_EN)


[grammar] ~181-~181: There might be a mistake here.
Context: ...ere }) ``` ### Modifying Thresholds Update the constants in `internal/nodehe...

(QB_NEW_EN)


[grammar] ~190-~190: There might be a mistake here.
Context: ...old ) ``` ### Adding New Resource Types Extend the resource validation tests to ...

(QB_NEW_EN)


[grammar] ~210-~210: There might be a mistake here.
Context: ...let is properly deployed ### Debug Mode Enable verbose logging by setting the lo...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
tests/cnf/ran/node-health/README.md

27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


141-141: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (6)
tests/cnf/ran/node-health/README.md (1)

80-99: Documented thresholds/timeouts aren’t enforced by the code

Threshold and timeout constants are mentioned but not used in node_health_validation.go. Either wire them into the tests or remove from README to avoid drift.

Do you want me to wire these into the tests (CPU/memory/disk checks and polling intervals)?

tests/cnf/ran/node-health/node_health_suite_test.go (2)

27-35: LGTM: failure diagnostics and XML report hook

The JustAfterEach failure dump and ReportAfterSuite XML generation align with existing eco-gotests patterns.


20-25: No action required: RunSpecs invocation is correct
Passing Label(...) and the mutated reporterConfig into RunSpecs is consistent with all other suites in this repo and aligns with Ginkgo v2’s DSL (labels as []string, reporter overrides via ReporterConfig).

tests/cnf/ran/node-health/tests/node_health_validation.go (1)

32-40: LGTM: node discovery and preconditions

Listing nodes once in BeforeAll and asserting non-empty cluster is fine for this suite.

tests/cnf/ran/node-health/internal/nodehealthparams/const.go (2)

66-71: MCD constants look correct.

Label and container name match typical MCO deployments.


60-65: Validate OpenShift kubelet selector
In tests/cnf/ran/node-health/internal/nodehealthparams/const.go:60-65, KubeletPodSelector = "k8s-app=kubelet" won’t match any pods on OpenShift (kubelet runs as a systemd service). Manually confirm with:

oc get pods -A -l k8s-app=kubelet

If no pods are returned, replace pod-based checks with node conditions (Ready) or query systemd status via the machine-config-daemon in the openshift-machine-config-operator namespace.

Comment on lines +3 to +6
import (
"github.com/openshift-kni/k8sreporter"
corev1 "k8s.io/api/core/v1"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix moved module import path (build will fail after repo move).

Update k8sreporter to the new module path per PR comment about the repo move.

Apply:

-import (
-	"github.com/openshift-kni/k8sreporter"
-	corev1 "k8s.io/api/core/v1"
-)
+import (
+	k8sreporter "github.com/rh-ecosystem-edge/eco-goinfra/pkg/k8sreporter"
+	corev1 "k8s.io/api/core/v1"
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import (
"github.com/openshift-kni/k8sreporter"
corev1 "k8s.io/api/core/v1"
)
import (
k8sreporter "github.com/rh-ecosystem-edge/eco-goinfra/pkg/k8sreporter"
corev1 "k8s.io/api/core/v1"
)
🤖 Prompt for AI Agents
In tests/cnf/ran/node-health/internal/nodehealthparams/const.go around lines
3–6, the import path for k8sreporter needs to be updated to the repo's new
module path after the move; replace "github.com/openshift-kni/k8sreporter" with
the new module path specified in the PR (adjusting any trailing /vX versioned
path if required), then run go mod tidy to update dependencies and fix any
import usages accordingly.

Comment on lines +8 to +14
const (
// Label is used to select tests for Node Health monitoring.
Label = "node-health"

// NodeHealthLogLevel configures logging level for Node Health related tests.
NodeHealthLogLevel = 90

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Confirm intended log level scale (90).

Value 90 is unusually high; ensure it matches your logging facade’s expectations (range, semantics). If unused, drop it.

Run:


🏁 Script executed:

#!/bin/bash
rg -n "NodeHealthLogLevel"

Length of output: 258


Drop the unused NodeHealthLogLevel constant. It’s declared at tests/cnf/ran/node-health/internal/nodehealthparams/const.go:13 but never referenced; remove it if there’s no intended usage.

🤖 Prompt for AI Agents
In tests/cnf/ran/node-health/internal/nodehealthparams/const.go around lines 8
to 14, the constant NodeHealthLogLevel is declared but unused; remove the
NodeHealthLogLevel constant definition from this file (and any trailing blank
line if needed) so only the used Label constant remains; if the log level was
intended to be configurable, instead add a TODO and a clear comment or implement
its usage where logging is configured.

Comment on lines +60 to +66
### 4. Kubelet Status Validation
- **Test ID**: `node-health-009` to `node-health-010`
- **Purpose**: Ensure kubelet service health essential for CNF pod scheduling
- **Labels**: `kubelet-pod-status`, `kubelet-service-check`
- **Description**: Verifies kubelet pods are running and ready on all nodes

### 5. Node Conditions Validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Kubelet “pod” wording is incorrect for OpenShift; tests will fail as written

On OpenShift/OKD, kubelet runs as a systemd service on the node, not as a Pod in kube-system. The current description (and matching tests) expect kubelet Pods and will consistently find none.

Consider updating both docs and tests to validate kubelet health via Node conditions (Ready) and/or kubelet health endpoint on the node, rather than listing Pods. If you keep a kubelet status section, reword to avoid “kubelet pods.”

🧰 Tools
🪛 LanguageTool

[grammar] ~60-~60: There might be a mistake here.
Context: ...limits ### 4. Kubelet Status Validation - Test ID: node-health-009 to `node-he...

(QB_NEW_EN)


[grammar] ~61-~61: There might be a mistake here.
Context: ...ion - Test ID: node-health-009 to node-health-010 - Purpose: Ensure kubelet service health...

(QB_NEW_EN)


[grammar] ~63-~63: There might be a mistake here.
Context: ...ing - Labels: kubelet-pod-status, kubelet-service-check - Description: Verifies kubelet pods are...

(QB_NEW_EN)


[grammar] ~66-~66: There might be a mistake here.
Context: ...nodes ### 5. Node Conditions Validation - Test ID: node-health-011 to `node-he...

(QB_NEW_EN)

Comment on lines +121 to +126
```bash
# Run tests with specific labels
go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-readiness"
go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-pressure"
go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-kubelet"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix label names in run examples

Examples use non-existent labels. Replace them with labels actually present in the code.

Apply:

-# Run tests with specific labels
-go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-readiness"
-go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-pressure"
-go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-kubelet"
+# Run tests with specific labels
+go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-readiness"
+go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-pressure"
+go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="kubelet-status"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
# Run tests with specific labels
go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-readiness"
go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-pressure"
go test ./tests/cnf/ran/node-health/ -v -ginkgo.label-filter="node-health-kubelet"
```
🤖 Prompt for AI Agents
In tests/cnf/ran/node-health/README.md around lines 121 to 126, the example go
test commands reference nonexistent labels; update those three example lines to
use the actual label names used in the test suite: inspect the test files
(ginkgo Describe/It labels or any label metadata) to determine the correct label
keys/values (e.g., readiness, pressure, kubelet may be named differently) and
replace "node-health-readiness", "node-health-pressure", and
"node-health-kubelet" with the exact labels found; ensure the updated commands
preserve the same flags and quoting and run a quick local go test with one
corrected label to verify correctness.

Comment on lines +210 to +216
### Debug Mode
Enable verbose logging by setting the log level:

```bash
export GLOMAXLEVEL=5
go test ./tests/cnf/ran/node-health/ -v
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Wrong env var for verbosity

Use ECO_VERBOSE_LEVEL (per tests/internal/config/config.go), not GLOMAXLEVEL.

-export GLOMAXLEVEL=5
+export ECO_VERBOSE_LEVEL=5
 go test ./tests/cnf/ran/node-health/ -v
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Debug Mode
Enable verbose logging by setting the log level:
```bash
export GLOMAXLEVEL=5
go test ./tests/cnf/ran/node-health/ -v
```
### Debug Mode
Enable verbose logging by setting the log level:
🧰 Tools
🪛 LanguageTool

[grammar] ~210-~210: There might be a mistake here.
Context: ...let is properly deployed ### Debug Mode Enable verbose logging by setting the lo...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In tests/cnf/ran/node-health/README.md around lines 210 to 216, the README shows
the wrong environment variable for test verbosity; update the instructions to
use ECO_VERBOSE_LEVEL (as defined in tests/internal/config/config.go) instead of
GLOMAXLEVEL and demonstrate exporting a numeric level (e.g., export
ECO_VERBOSE_LEVEL=5) before running the go test command to enable verbose
logging.

Comment on lines +231 to +299
Context("Kubelet Status Validation", Label("kubelet-status"), func() {
It("Verify kubelet pods are running on all nodes",
Label("kubelet-pod-status"),
reportxml.ID("node-health-009"),
func() {
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
glog.Infof("Checking kubelet status on node: %s", nodeObj.Name)

// Check if kubelet pod is running on this node
podList, err := APIClient.CoreV1Interface.Pods(nodehealthparams.KubeletNamespace).List(
context.TODO(),
metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeObj.Name),
LabelSelector: nodehealthparams.KubeletPodSelector,
},
)
Expect(err).NotTo(HaveOccurred(), "Failed to list kubelet pods on node %s", nodeObj.Name)

// Verify at least one kubelet pod is running
hasRunningKubelet := false
for _, pod := range podList.Items {
if pod.Status.Phase == corev1.PodRunning {
hasRunningKubelet = true
break
}
}

Expect(hasRunningKubelet).To(BeTrue(), "No running kubelet pod found on node %s", nodeObj.Name)
glog.Infof("Kubelet pod is running on node %s", nodeObj.Name)
}
})

It("Verify kubelet service is responding",
Label("kubelet-service-check"),
reportxml.ID("node-health-010"),
func() {
// This test would typically check kubelet health endpoint
// For now, we'll verify the kubelet pods are healthy
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
glog.Infof("Checking kubelet service health on node: %s", nodeObj.Name)

// Check kubelet pod readiness
podList, err := APIClient.CoreV1Interface.Pods(nodehealthparams.KubeletNamespace).List(
context.TODO(),
metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeObj.Name),
LabelSelector: nodehealthparams.KubeletPodSelector,
},
)
Expect(err).NotTo(HaveOccurred(), "Failed to list kubelet pods on node %s", nodeObj.Name)

for _, pod := range podList.Items {
// Check if pod is ready
isReady := false
for _, condition := range pod.Status.Conditions {
if condition.Type == corev1.PodReady {
isReady = condition.Status == corev1.ConditionTrue
break
}
}

Expect(isReady).To(BeTrue(), "Kubelet pod %s on node %s is not ready", pod.Name, nodeObj.Name)
glog.Infof("Kubelet pod %s on node %s is ready", pod.Name, nodeObj.Name)
}
}
})
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Kubelet checks assume kubelet Pods exist; this will fail on OpenShift

Kubelet isn’t a Pod on OpenShift/OKD. Listing kubelet Pods in kube-system with label k8s-app=kubelet will return nothing and the tests will fail.

Replace both kubelet tests with condition-based checks (or node heartbeat), e.g.:

- It("Verify kubelet pods are running on all nodes",
-   Label("kubelet-pod-status"),
-   reportxml.ID("node-health-009"),
-   func() {
-     ...
-     podList, err := APIClient.CoreV1Interface.Pods(nodehealthparams.KubeletNamespace).List(...)
-     ...
-     hasRunningKubelet := false
-     for _, pod := range podList.Items {
-       if pod.Status.Phase == corev1.PodRunning { hasRunningKubelet = true; break }
-     }
-     Expect(hasRunningKubelet).To(BeTrue(), "No running kubelet pod found on node %s", nodeObj.Name)
-   })
+ It("Verify kubelet is reporting Ready on all nodes",
+   Label("kubelet-service-check"),
+   reportxml.ID("node-health-009"),
+   func() {
+     for _, nodeBuilder := range nodesList {
+       nodeObj := nodeBuilder.Object
+       var readyCond *corev1.NodeCondition
+       for i := range nodeObj.Status.Conditions {
+         if nodeObj.Status.Conditions[i].Type == corev1.NodeReady {
+           readyCond = &nodeObj.Status.Conditions[i]
+           break
+         }
+       }
+       Expect(readyCond).NotTo(BeNil(), "Node %s has no Ready condition", nodeObj.Name)
+       Expect(readyCond.Status).To(Equal(corev1.ConditionTrue), "Node %s kubelet is not Ready", nodeObj.Name)
+     }
+   })
 
- It("Verify kubelet service is responding",
-   Label("kubelet-service-check"),
-   reportxml.ID("node-health-010"),
-   func() {
-     // Lists kubelet pods and checks readiness
-     ...
-   })
+ // Optionally, add a heartbeat freshness check by inspecting LastHeartbeatTime if desired.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Context("Kubelet Status Validation", Label("kubelet-status"), func() {
It("Verify kubelet pods are running on all nodes",
Label("kubelet-pod-status"),
reportxml.ID("node-health-009"),
func() {
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
glog.Infof("Checking kubelet status on node: %s", nodeObj.Name)
// Check if kubelet pod is running on this node
podList, err := APIClient.CoreV1Interface.Pods(nodehealthparams.KubeletNamespace).List(
context.TODO(),
metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeObj.Name),
LabelSelector: nodehealthparams.KubeletPodSelector,
},
)
Expect(err).NotTo(HaveOccurred(), "Failed to list kubelet pods on node %s", nodeObj.Name)
// Verify at least one kubelet pod is running
hasRunningKubelet := false
for _, pod := range podList.Items {
if pod.Status.Phase == corev1.PodRunning {
hasRunningKubelet = true
break
}
}
Expect(hasRunningKubelet).To(BeTrue(), "No running kubelet pod found on node %s", nodeObj.Name)
glog.Infof("Kubelet pod is running on node %s", nodeObj.Name)
}
})
It("Verify kubelet service is responding",
Label("kubelet-service-check"),
reportxml.ID("node-health-010"),
func() {
// This test would typically check kubelet health endpoint
// For now, we'll verify the kubelet pods are healthy
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
glog.Infof("Checking kubelet service health on node: %s", nodeObj.Name)
// Check kubelet pod readiness
podList, err := APIClient.CoreV1Interface.Pods(nodehealthparams.KubeletNamespace).List(
context.TODO(),
metav1.ListOptions{
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeObj.Name),
LabelSelector: nodehealthparams.KubeletPodSelector,
},
)
Expect(err).NotTo(HaveOccurred(), "Failed to list kubelet pods on node %s", nodeObj.Name)
for _, pod := range podList.Items {
// Check if pod is ready
isReady := false
for _, condition := range pod.Status.Conditions {
if condition.Type == corev1.PodReady {
isReady = condition.Status == corev1.ConditionTrue
break
}
}
Expect(isReady).To(BeTrue(), "Kubelet pod %s on node %s is not ready", pod.Name, nodeObj.Name)
glog.Infof("Kubelet pod %s on node %s is ready", pod.Name, nodeObj.Name)
}
}
})
})
Context("Kubelet Status Validation", Label("kubelet-status"), func() {
It("Verify kubelet is reporting Ready on all nodes",
Label("kubelet-service-check"),
reportxml.ID("node-health-009"),
func() {
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
var readyCond *corev1.NodeCondition
for i := range nodeObj.Status.Conditions {
if nodeObj.Status.Conditions[i].Type == corev1.NodeReady {
readyCond = &nodeObj.Status.Conditions[i]
break
}
}
Expect(readyCond).NotTo(BeNil(), "Node %s has no Ready condition", nodeObj.Name)
Expect(readyCond.Status).To(Equal(corev1.ConditionTrue), "Node %s kubelet is not Ready", nodeObj.Name)
}
})
// Optionally, add a heartbeat freshness check by inspecting LastHeartbeatTime if desired.
})

Comment on lines +316 to +320
if condition.Type == corev1.NodeNetworkUnavailable &&
strings.Contains(nodeObj.Name, "master") {
// Master nodes might have NetworkUnavailable=True during initial setup
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t infer control-plane nodes from name substrings

Checking strings.Contains(nodeObj.Name, "master") is brittle. Use role labels.

- if condition.Type == corev1.NodeNetworkUnavailable &&
-   strings.Contains(nodeObj.Name, "master") {
+ if condition.Type == corev1.NodeNetworkUnavailable && isControlPlaneNode(nodeObj) {
    continue
 }

Add helper (outside this hunk):

func isControlPlaneNode(n *corev1.Node) bool {
  if n == nil { return false }
  _, hasCP := n.Labels["node-role.kubernetes.io/control-plane"]
  _, hasMaster := n.Labels["node-role.kubernetes.io/master"]
  return hasCP || hasMaster
}
🤖 Prompt for AI Agents
In tests/cnf/ran/node-health/tests/node_health_validation.go around lines 316 to
320, the code infers control-plane nodes by checking if nodeObj.Name contains
"master", which is brittle; add a helper function (placed outside this hunk)
named isControlPlaneNode that returns true if the node has either
node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master label,
then replace the strings.Contains(nodeObj.Name, "master") check with a call to
isControlPlaneNode(&nodeObj) so the NetworkUnavailable exception is based on
role labels rather than name substrings.

Comment on lines +343 to +370
It("Verify node last transition times are recent",
Label("node-transition-time-check"),
reportxml.ID("node-health-012"),
func() {
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
glog.Infof("Checking transition times on node: %s", nodeObj.Name)

// Check if node conditions have recent transition times
for _, condition := range nodeObj.Status.Conditions {
if condition.LastTransitionTime.IsZero() {
continue // Skip conditions without transition time
}

// Check if transition time is within reasonable bounds (not too old)
timeSinceTransition := time.Since(condition.LastTransitionTime.Time)
Expect(timeSinceTransition).To(BeNumerically("<", 24*time.Hour),
"Node %s condition %s has very old transition time: %s",
nodeObj.Name, condition.Type, condition.LastTransitionTime)

glog.Infof("Node %s condition %s transition time: %s (age: %v)",
nodeObj.Name, condition.Type, condition.LastTransitionTime, timeSinceTransition)
}

glog.Infof("Transition times on node %s are recent", nodeObj.Name)
}
})
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Transition time < 24h will flake on stable clusters

Many clusters have not transitioned conditions for days/weeks. The 24h bound will cause false failures.

- Expect(timeSinceTransition).To(BeNumerically("<", 24*time.Hour),
+ Expect(timeSinceTransition).To(BeNumerically("<", 30*24*time.Hour),
   "Node %s condition %s has very old transition time: %s",
   nodeObj.Name, condition.Type, condition.LastTransitionTime)

Optionally, make this threshold configurable (env or nodehealthparams) and only warn/log when exceeded.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
It("Verify node last transition times are recent",
Label("node-transition-time-check"),
reportxml.ID("node-health-012"),
func() {
for _, nodeBuilder := range nodesList {
nodeObj := nodeBuilder.Object
glog.Infof("Checking transition times on node: %s", nodeObj.Name)
// Check if node conditions have recent transition times
for _, condition := range nodeObj.Status.Conditions {
if condition.LastTransitionTime.IsZero() {
continue // Skip conditions without transition time
}
// Check if transition time is within reasonable bounds (not too old)
timeSinceTransition := time.Since(condition.LastTransitionTime.Time)
Expect(timeSinceTransition).To(BeNumerically("<", 24*time.Hour),
"Node %s condition %s has very old transition time: %s",
nodeObj.Name, condition.Type, condition.LastTransitionTime)
glog.Infof("Node %s condition %s transition time: %s (age: %v)",
nodeObj.Name, condition.Type, condition.LastTransitionTime, timeSinceTransition)
}
glog.Infof("Transition times on node %s are recent", nodeObj.Name)
}
})
})
// Check if transition time is within reasonable bounds (not too old)
timeSinceTransition := time.Since(condition.LastTransitionTime.Time)
Expect(timeSinceTransition).To(BeNumerically("<", 30*24*time.Hour),
"Node %s condition %s has very old transition time: %s",
nodeObj.Name, condition.Type, condition.LastTransitionTime)
🤖 Prompt for AI Agents
In tests/cnf/ran/node-health/tests/node_health_validation.go around lines 343 to
370, the hardcoded 24-hour threshold for LastTransitionTime causes flakes
because many clusters don't transition conditions within 24h; replace the
hardcoded value with a configurable duration and stop failing the test on
exceedance—log or warn instead. Concretely: introduce a configurable threshold
(env var like NODE_HEALTH_TRANSITION_THRESHOLD or a nodeHealthParams field)
parsed as a time.Duration with a sensible default (e.g., 7*24*time.Hour),
replace the BeNumerically("<", 24*time.Hour) Expect with a conditional that if
timeSinceTransition > threshold then glog.Warningf (or Ginkgo Log/Warn) with the
node, condition, transition time and age (do not call Expect to fail), otherwise
keep logging the recent transition; ensure parsing the env var falls back to the
default and add a short comment explaining the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants