feat: add SOS report script #2065

almaslennikov · 2026-01-23T12:03:16Z

No description provided.

scripts/sosreport/kubectl-netop_sosreport

heyvister1 · 2026-01-25T11:40:07Z

scripts/sosreport/kubectl-netop_sosreport

+    log_info "Collecting all component workloads..."
+
+    # Component definitions: name, label, type
+    declare -A components=(


I suggest to make this more generic. Maybe get these components out of release.yaml?

Identify which release is used by customer (helm list ...)

Fetch release.yaml respectively

Build components map dynamically

When I think about, I suggest building network-operator workloads and CRD maps per each network-operator release. I mean, generate these maps and embed them under SOS report script.
In this case customers can pull SOS report which is adapted to their network-operator release (also need to make sure its backward compatible with previous releases e.g. don't throw errors for missing CRDs/workload if they are not part of network-operator release).

We also need to make sure that SOS script can work in a disconnected environment, therefore pulling release.yaml from public GH could be blocked, so resource maps can be generated part of network-operator release.

Added a script that generates the list of CRDs and components to collect. If a CRD is not present, the script will not fail, so backward compatibility should be fine

scripts/sosreport/test-sosreport.sh

scripts/sosreport/kubectl-netop_sosreport

greptile-apps · 2026-01-26T08:46:16Z

Greptile Overview

Greptile Summary

Added comprehensive SOS report collection script for troubleshooting Network Operator deployments. The implementation includes a main collection script (kubectl-netop_sosreport), automated generation of CRD/component maps, comprehensive test suite, and integration with the build/release process.

Key additions:

Main script collects CRDs, operator resources, component workloads, node info, and diagnostic commands
Automatic namespace detection with multiple fallback methods
Auto-generation of CRD and component lists from manifests via generate-maps.sh
Test suite validating script syntax, functions, and coverage
Makefile targets for testing and map generation/verification
CI/CD integration to keep maps current during releases
Backward compatibility symlink for existing workflows

Implementation quality:

Proper error handling with logging and statistics tracking
Safety checks for potentially dangerous operations (e.g., rm -rf with variable validation)
Graceful degradation when resources are missing or inaccessible
Comprehensive documentation with usage examples

The PR addresses a real operational need for collecting diagnostic data in a standardized way.

Confidence Score: 4/5

Safe to merge with minor style improvements recommended but not blocking
Well-implemented feature with proper error handling, safety checks, tests, and documentation. The previous review comments have been addressed (safety checks exist for rm -rf). Minor style issues remain around shell command execution that were already flagged in previous threads but don't pose critical security risks in this context.
No files require special attention - all implementations are solid with appropriate safeguards

Important Files Changed

Filename	Overview
scripts/sosreport/kubectl-netop_sosreport	Added comprehensive SOS report collection script (1093 lines) with proper error handling and safety checks; has adequate guards for `rm -rf` but command execution via `sh -c` could be safer
scripts/sosreport/generate-maps.sh	Added script to generate CRD and component maps from manifests and embed them into the sosreport script - well structured with verification capability
scripts/sosreport/test-sosreport.sh	Added comprehensive test suite for sosreport script covering syntax, functions, CRDs, and diagnostic commands - good test coverage
Makefile	Added targets for sosreport: `test-sosreport`, `generate-sosreport-maps`, and `verify-sosreport-maps`; integrated sosreport tests into main test target
.github/workflows/release.yaml	Added `make generate-sosreport-maps` to release workflow and included sosreport scripts in git commits during releases

Sequence Diagram

sequenceDiagram
    participant User
    participant Script as kubectl-netop_sosreport
    participant Kubectl
    participant K8sCluster as Kubernetes Cluster
    participant FileSystem as File System

    User->>Script: Execute with options
    Script->>Script: parse_args()
    Script->>Kubectl: check_prerequisites()
    Kubectl->>K8sCluster: Test connectivity
    K8sCluster-->>Kubectl: Connection OK
    Kubectl-->>Script: Prerequisites verified
    
    Script->>Kubectl: detect_operator_namespace()
    Kubectl->>K8sCluster: Search for operator deployment
    K8sCluster-->>Kubectl: Namespace found
    Kubectl-->>Script: Operator namespace detected
    
    Script->>FileSystem: setup_output_dir()
    FileSystem-->>Script: Directory created
    
    Script->>Script: collect_metadata()
    Script->>Kubectl: Get cluster version, namespaces
    Kubectl->>K8sCluster: Query resources
    K8sCluster-->>Kubectl: Return data
    Kubectl-->>Script: Metadata collected
    Script->>FileSystem: Write metadata files
    
    Script->>Script: collect_crd_definitions()
    loop For each CRD
        Script->>Kubectl: Get CRD definition
        Kubectl->>K8sCluster: Query CRD
        K8sCluster-->>Kubectl: CRD YAML
        Kubectl-->>Script: CRD data
        Script->>FileSystem: Write CRD file
    end
    
    Script->>Script: collect_crd_instances()
    loop For each CR type
        Script->>Kubectl: Get CR instances
        Kubectl->>K8sCluster: Query instances
        K8sCluster-->>Kubectl: Instance data
        Kubectl-->>Script: Instances
        Script->>FileSystem: Write instance files
    end
    
    Script->>Script: collect_operator_resources()
    Script->>Kubectl: Get operator deployment, pods, RBAC
    Kubectl->>K8sCluster: Query operator resources
    K8sCluster-->>Kubectl: Resource data
    Kubectl-->>Script: Resources collected
    Script->>FileSystem: Write operator files
    
    Script->>Script: collect_all_components()
    loop For each component (15 total)
        Script->>Kubectl: Check if component exists
        Kubectl->>K8sCluster: Query component workload
        K8sCluster-->>Kubectl: Component found/not found
        alt Component exists
            Script->>Kubectl: Get pods, logs, services
            Kubectl->>K8sCluster: Query component details
            K8sCluster-->>Kubectl: Component data
            Kubectl-->>Script: Details collected
            Script->>FileSystem: Write component files
        end
    end
    
    Script->>Script: collect_diagnostic_commands()
    alt Not skipped
        Script->>Kubectl: Find OFED pods
        Kubectl->>K8sCluster: Query OFED pods
        K8sCluster-->>Kubectl: Pod list
        loop For each OFED pod
            Script->>Kubectl: kubectl exec diagnostic commands
            Kubectl->>K8sCluster: Execute in pod (lsmod, ibstat, etc)
            K8sCluster-->>Kubectl: Command output
            Kubectl-->>Script: Diagnostic data
            Script->>FileSystem: Write diagnostic files
        end
    end
    
    Script->>Script: collect_node_info()
    Script->>Kubectl: Get nodes with labels and resources
    Kubectl->>K8sCluster: Query nodes
    K8sCluster-->>Kubectl: Node data
    Kubectl-->>Script: Nodes collected
    Script->>FileSystem: Write node files
    
    Script->>Script: cleanup_empty_artifacts()
    Script->>FileSystem: Remove empty files/directories
    FileSystem-->>Script: Cleanup complete
    
    Script->>Script: generate_summary()
    Script->>FileSystem: Write diagnostic summary
    
    Script->>Script: create_archive()
    alt No compress flag not set
        Script->>FileSystem: Create tar.gz archive
        FileSystem-->>Script: Archive created
        Script->>FileSystem: Generate SHA256 checksum
        Script->>FileSystem: Remove temporary directory
    end
    
    Script-->>User: Collection complete with statistics

greptile-apps

_{7 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

scripts/README.md

scripts/sosreport/kubectl-netop_sosreport

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

rollandf · 2026-01-29T07:46:09Z

scripts/sosreport/kubectl-netop_sosreport

+    fi
+
+    # Method 4: Check common namespaces for any network operator deployment
+    for ns in network-operator nvidia-network-operator openshift-network-operator kube-system; do


openshift-network-operator is used by OCP Cluster Network Operator:

https://docs.redhat.com/en/documentation/openshift_container_platform/4.8/html/networking/cluster-network-operator#nw-cluster-network-operator_cluster-network-operator

rollandf · 2026-01-29T07:46:54Z

scripts/sosreport/kubectl-netop_sosreport

+
+    # Collection info
+    cat > "$metadata_dir/collection-info.txt" <<EOF
+Network Operator SOS-Report


NVIDIA Network Operator SOS-Report

heyvister1 reviewed Jan 25, 2026

View reviewed changes

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

heyvister1 reviewed Jan 25, 2026

View reviewed changes

scripts/sosreport/test-sosreport.sh Show resolved Hide resolved

rollandf reviewed Jan 26, 2026

View reviewed changes

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

rollandf reviewed Jan 26, 2026

View reviewed changes

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

greptile-apps bot reviewed Jan 26, 2026

View reviewed changes

scripts/README.md Outdated Show resolved Hide resolved

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

scripts/sosreport/kubectl-netop_sosreport Outdated Show resolved Hide resolved

feat: add SOS report script

135c64f

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

almaslennikov force-pushed the sos-report branch from 166eacd to 135c64f Compare January 28, 2026 11:00

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

rollandf reviewed Jan 29, 2026

View reviewed changes

rollandf approved these changes Jan 29, 2026

View reviewed changes

feat: add SOS report script #2065

Are you sure you want to change the base?

feat: add SOS report script #2065

Uh oh!

Conversation

almaslennikov commented Jan 23, 2026

Uh oh!

Uh oh!

heyvister1 Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

heyvister1 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

almaslennikov Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

rollandf Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

rollandf Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Jan 26, 2026 •

edited

Loading