Skip to content

Conversation

@memodi
Copy link
Member

@memodi memodi commented Oct 14, 2025

Description

NETOBSERV-2443 fix bug, improve cleanup and writing files

With the help of Claude, I was able to identify the flakiness coming from pty and made bunch of improvements as below:

  • Complete output capture - All lines captured, no race conditions
  • Proper timeout handling - API calls respect polling context timeouts
  • Reliable cleanup - Ignores SIGHUP, completes deletion
  • Absolute paths for file reads
  • Cleanup of output/flow directory after every test, so next test won't read from the same file.
  • Use the OCP-XXXX and test label combination for the output files of collector and cleanup cmd

Made several runs, now CLI tests are much stable.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@memodi memodi added the no-qe label Oct 14, 2025
@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 0% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 13.54%. Comparing base (1654142) to head (ceb7f6a).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
e2e/common.go 0.00% 64 Missing ⚠️
e2e/integration-tests/cli.go 0.00% 26 Missing ⚠️
e2e/integration-tests/cluster.go 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
- Coverage   13.84%   13.54%   -0.30%     
==========================================
  Files          18       18              
  Lines        2731     2326     -405     
==========================================
- Hits          378      315      -63     
+ Misses       2329     1987     -342     
  Partials       24       24              
Flag Coverage Δ
unittests 13.54% <0.00%> (-0.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
e2e/integration-tests/cluster.go 0.00% <0.00%> (ø)
e2e/integration-tests/cli.go 0.00% <0.00%> (ø)
e2e/common.go 0.00% <0.00%> (ø)

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@memodi
Copy link
Member Author

memodi commented Oct 14, 2025

/test ?

@openshift-ci
Copy link

openshift-ci bot commented Oct 14, 2025

@memodi: The following commands are available to trigger required jobs:

/test images
/test integration-tests

Use /test all to run all jobs.

Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@memodi
Copy link
Member Author

memodi commented Oct 14, 2025

/test integration-tests

@memodi memodi requested a review from jpinsonneau October 16, 2025 21:00
@memodi
Copy link
Member Author

memodi commented Oct 17, 2025

/test integration-tests

@memodi
Copy link
Member Author

memodi commented Oct 17, 2025

integration tests are failing because for some reason CI cluster is taking too long to pull images.

/test integration-tests

@memodi
Copy link
Member Author

memodi commented Oct 21, 2025

/test integration-tests

1 similar comment
@memodi
Copy link
Member Author

memodi commented Oct 24, 2025

/test integration-tests

memodi and others added 4 commits October 24, 2025 11:45
- Increase waitDaemonset timeout from 50s to 5 minutes (30×10s)
  * CI environments often have slow image pulls
  * Previous timeout was too aggressive for registry operations

- Add comprehensive diagnostic output on pod startup failure:
  * Pod status with node placement (get pods -o wide)
  * Recent events to identify ImagePullBackOff, etc
  * Pod event details from describe output
  * Daemonset logs if containers started

This helps diagnose ContainerCreating issues in CI where pods
fail to start due to image pull problems or resource constraints.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
CI runs showed 4/6 pods ready with 5 minute timeout, indicating
image pulls need more time. Increasing to 10 minutes (60×10s) to
accommodate slower CI registry pulls and pod scheduling.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
In E2E test mode, the bash script's waitDaemonset() could exit with
error after 10 minutes while the Go test's isDaemonsetReady() was
still polling. This created a race where:

1. Go test calls StartCommand() which runs bash script async
2. Bash script calls waitDaemonset() and waits 10 mins
3. Go test calls isDaemonsetReady() and waits 10 mins
4. If bash times out first, it calls exit 1, killing the process
5. Go test is left polling a dead command

Solution: When isE2E=true, skip the bash-level wait since the Go
test framework handles pod readiness checking via isDaemonsetReady().

For manual CLI usage (isE2E=false), the wait still runs as before
to provide user feedback.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Tests were failing because:
1. Commands ran with --max-time=1m in foreground mode
2. After 1 minute, capture finished and auto-cleanup ran
3. Cleanup deleted the daemonset
4. isDaemonsetReady() was polling for a deleted daemonset
5. Test failed with context deadline exceeded

Using --background mode prevents automatic cleanup when the
capture finishes, allowing the test to verify daemonset
privilege settings before cleanup runs.
Also, Check for CLI is running instead of just daemnset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@memodi
Copy link
Member Author

memodi commented Nov 3, 2025

/test integration-tests

@memodi
Copy link
Member Author

memodi commented Nov 4, 2025

/test integration-tests

@memodi
Copy link
Member Author

memodi commented Nov 5, 2025

/test integration-tests

@memodi
Copy link
Member Author

memodi commented Nov 5, 2025

/needs-review

@memodi
Copy link
Member Author

memodi commented Nov 5, 2025

@jpinsonneau - any idea why e2e tests are failing?

@memodi memodi added the needs-review Tells that the PR needs a review label Nov 5, 2025
@jpinsonneau
Copy link
Contributor

jpinsonneau commented Nov 6, 2025

@jpinsonneau - any idea why e2e tests are failing?

If we expect command terminated in the output, we should rely on RunCommandAndTerminate function.

I'm having issues with my local kind so I can't test that right now. Trying to fix that ASAP

working locally: ceb7f6a

@Amoghrd
Copy link
Member

Amoghrd commented Nov 7, 2025

LGTM
Will wait for @oliver-smakal @kapjain-rh to review as well

@kapjain-rh
Copy link
Member

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Nov 7, 2025

@kapjain-rh: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@oliver-smakal
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Nov 10, 2025
@memodi memodi removed the needs-review Tells that the PR needs a review label Nov 10, 2025
@memodi
Copy link
Member Author

memodi commented Nov 10, 2025

@jpinsonneau - is this okay to merge? Not sure if you had chance to review.

Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me ! Thanks @memodi !

@openshift-ci
Copy link

openshift-ci bot commented Nov 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit bd46a3e into netobserv:main Nov 14, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants