Skip to content

Conversation

@mramdgh
Copy link
Contributor

@mramdgh mramdgh commented Nov 3, 2025

Summary

Comprehensive CI/CD improvements including QEMU-based integration testing, unit test coverage, and test execution metrics. Replaces direct deployment testing with isolated VM-based
tests and adds unit tests to the CI pipeline.

Motivation and Context

The previous CI approach directly modified the runner system, which was risky and could cause failures. This PR introduces isolated QEMU VM testing for safer integration tests and
adds unit test coverage to catch regressions early.

Changes Made

QEMU Integration Testing

  • Created qemu-disk-test.sh script for VM-based testing with 8 NVMe devices
  • Automated VM lifecycle (create, test, cleanup) for each test run
  • Implemented Ubuntu cloud image caching to speed up CI runs
  • Added structured YAML result parsing with yq for pass/fail validation
  • Accepts multiple config files for comprehensive test coverage

Test Metrics & Reporting

  • Added overall duration tracking to bloom test command
  • Includes both millisecond precision (duration_ms) and human-readable format
  • Per-config duration tracking for performance monitoring
  • Changed completed steps output to use step IDs instead of names

Unit Tests

  • Added unit test execution to CI workflow (runs before integration tests)
  • Fixed 6 failing unit tests to work in containerized environments:
    • TestSetupAndCheckRocmStep - Check Skip() before Action()
    • TestCreateMetalLBConfigStep - Check Skip() before Action()
    • TestSetupKubeConfig - Check Skip() before Action()
    • TestPrepareLonghornDisksStep - Use NO_DISKS_FOR_CLUSTER flag
    • TestInotifyInstancesStep - Skip when not running as root
    • TestGenerateNodeLabels - Remove subtests requiring /etc/rancher paths
  • Adjusted TestOIDCConfigTemplate expectations to match actual code

Installation Improvements

  • Moved CreateBloomConfigMapStep after WaitForClusterReady for better reliability
  • Removed unnecessary 10-second sleep from CreateBloomConfigMapStep
  • Added skip logic to UpdateModprobeStep for non-GPU nodes

Type of Change

  • New feature (QEMU testing infrastructure)
  • Bug fix (unit test failures)
  • CI/CD improvements
  • Performance improvement (image caching)

Testing

  • Unit tests pass locally
  • Integration tests run in QEMU VM
  • CI workflow validates test results

Test Coverage

All unit tests in ./pkg now pass without requiring root privileges or modifying the host system.

Testing Instructions

  1. Run unit tests: go test ./pkg
  2. Run QEMU integration test:
    bash .github/workflows/qemu-disk-test.sh \
      nvme-test-vm \
      ./cluster-bloom \
      ./config1.yaml \
      ./config2.yaml
  3. Check results: cat nvme-test-vm-test-results.yaml

Breaking Changes

None

Checklist

  • Code follows project style guidelines
  • Self-review performed
  • Tests added for new features
  • All tests pass locally
  • CI workflow updated

Performance Impact

  • Image caching: Reduces QEMU test startup from ~5 minutes to ~30 seconds
  • Test duration metrics: Now tracked for performance regression detection

CI/CD Changes

  • New step: "Run unit tests" (runs before integration tests)
  • Updated: "QEMU Disk Test" replaces direct deployment
  • Added: Result validation step with yq parsing
  • Improved: Test failure detection and reporting

Additional Notes

The QEMU-based approach provides true isolation for integration testing, allowing safe execution on shared CI runners without risk of system modification.

      Add duration metrics to test command overall_summary output, including
      both millisecond precision (duration_ms) and human-readable format.

      Also improve installation flow by:
      - Moving CreateBloomConfigMapStep after WaitForClusterReady to ensure
        cluster is fully ready before creating the ConfigMap
      - Removing unnecessary 10-second sleep from CreateBloomConfigMapStep
        since it now runs after explicit cluster ready check
      - Adding skip logic to UpdateModprobeStep for non-GPU nodes to avoid
        unnecessary operations
      Replace the direct deployment test with QEMU VM-based testing to avoid
      modifying the CI host system. The new approach:

      - Creates isolated QEMU VM with 8 NVMe devices for realistic testing
      - Accepts bloom binary path and multiple config file paths as arguments
      - Runs bloom test inside the VM and copies results back to host
      - Automatically cleans up VM after test completion
      - Moved qemu-disk-test.sh to .github/workflows for CI integration
      - Updated run-tests.yml to use QEMU test instead of direct deployment

      This allows safe integration testing without system-level changes to the
      CI runner, while still validating disk detection and configuration steps.
@mramdgh mramdgh marked this pull request as ready for review November 5, 2025 07:37
@mramdgh mramdgh requested a review from a team November 5, 2025 07:37
Copy link
Contributor

@punasusi punasusi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@mramdgh mramdgh merged commit 3a47ec4 into main Nov 5, 2025
3 checks passed
@mramdgh mramdgh deleted the qemu-ci branch November 5, 2025 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants