Skip to content

Conversation

@almaslennikov
Copy link
Collaborator

No description provided.

log_info "Collecting all component workloads..."

# Component definitions: name, label, type
declare -A components=(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to make this more generic. Maybe get these components out of release.yaml?

  1. Identify which release is used by customer (helm list ...)
  2. Fetch release.yaml respectively
  3. Build components map dynamically

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think about, I suggest building network-operator workloads and CRD maps per each network-operator release. I mean, generate these maps and embed them under SOS report script.
In this case customers can pull SOS report which is adapted to their network-operator release (also need to make sure its backward compatible with previous releases e.g. don't throw errors for missing CRDs/workload if they are not part of network-operator release).

We also need to make sure that SOS script can work in a disconnected environment, therefore pulling release.yaml from public GH could be blocked, so resource maps can be generated part of network-operator release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a script that generates the list of CRDs and components to collect. If a CRD is not present, the script will not fail, so backward compatibility should be fine

@greptile-apps
Copy link

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

Added comprehensive SOS report collection script for troubleshooting Network Operator deployments. The implementation includes a main collection script (kubectl-netop_sosreport), automated generation of CRD/component maps, comprehensive test suite, and integration with the build/release process.

Key additions:

  • Main script collects CRDs, operator resources, component workloads, node info, and diagnostic commands
  • Automatic namespace detection with multiple fallback methods
  • Auto-generation of CRD and component lists from manifests via generate-maps.sh
  • Test suite validating script syntax, functions, and coverage
  • Makefile targets for testing and map generation/verification
  • CI/CD integration to keep maps current during releases
  • Backward compatibility symlink for existing workflows

Implementation quality:

  • Proper error handling with logging and statistics tracking
  • Safety checks for potentially dangerous operations (e.g., rm -rf with variable validation)
  • Graceful degradation when resources are missing or inaccessible
  • Comprehensive documentation with usage examples

The PR addresses a real operational need for collecting diagnostic data in a standardized way.

Confidence Score: 4/5

  • Safe to merge with minor style improvements recommended but not blocking
  • Well-implemented feature with proper error handling, safety checks, tests, and documentation. The previous review comments have been addressed (safety checks exist for rm -rf). Minor style issues remain around shell command execution that were already flagged in previous threads but don't pose critical security risks in this context.
  • No files require special attention - all implementations are solid with appropriate safeguards

Important Files Changed

Filename Overview
scripts/sosreport/kubectl-netop_sosreport Added comprehensive SOS report collection script (1093 lines) with proper error handling and safety checks; has adequate guards for rm -rf but command execution via sh -c could be safer
scripts/sosreport/generate-maps.sh Added script to generate CRD and component maps from manifests and embed them into the sosreport script - well structured with verification capability
scripts/sosreport/test-sosreport.sh Added comprehensive test suite for sosreport script covering syntax, functions, CRDs, and diagnostic commands - good test coverage
Makefile Added targets for sosreport: test-sosreport, generate-sosreport-maps, and verify-sosreport-maps; integrated sosreport tests into main test target
.github/workflows/release.yaml Added make generate-sosreport-maps to release workflow and included sosreport scripts in git commits during releases

Sequence Diagram

sequenceDiagram
    participant User
    participant Script as kubectl-netop_sosreport
    participant Kubectl
    participant K8sCluster as Kubernetes Cluster
    participant FileSystem as File System

    User->>Script: Execute with options
    Script->>Script: parse_args()
    Script->>Kubectl: check_prerequisites()
    Kubectl->>K8sCluster: Test connectivity
    K8sCluster-->>Kubectl: Connection OK
    Kubectl-->>Script: Prerequisites verified
    
    Script->>Kubectl: detect_operator_namespace()
    Kubectl->>K8sCluster: Search for operator deployment
    K8sCluster-->>Kubectl: Namespace found
    Kubectl-->>Script: Operator namespace detected
    
    Script->>FileSystem: setup_output_dir()
    FileSystem-->>Script: Directory created
    
    Script->>Script: collect_metadata()
    Script->>Kubectl: Get cluster version, namespaces
    Kubectl->>K8sCluster: Query resources
    K8sCluster-->>Kubectl: Return data
    Kubectl-->>Script: Metadata collected
    Script->>FileSystem: Write metadata files
    
    Script->>Script: collect_crd_definitions()
    loop For each CRD
        Script->>Kubectl: Get CRD definition
        Kubectl->>K8sCluster: Query CRD
        K8sCluster-->>Kubectl: CRD YAML
        Kubectl-->>Script: CRD data
        Script->>FileSystem: Write CRD file
    end
    
    Script->>Script: collect_crd_instances()
    loop For each CR type
        Script->>Kubectl: Get CR instances
        Kubectl->>K8sCluster: Query instances
        K8sCluster-->>Kubectl: Instance data
        Kubectl-->>Script: Instances
        Script->>FileSystem: Write instance files
    end
    
    Script->>Script: collect_operator_resources()
    Script->>Kubectl: Get operator deployment, pods, RBAC
    Kubectl->>K8sCluster: Query operator resources
    K8sCluster-->>Kubectl: Resource data
    Kubectl-->>Script: Resources collected
    Script->>FileSystem: Write operator files
    
    Script->>Script: collect_all_components()
    loop For each component (15 total)
        Script->>Kubectl: Check if component exists
        Kubectl->>K8sCluster: Query component workload
        K8sCluster-->>Kubectl: Component found/not found
        alt Component exists
            Script->>Kubectl: Get pods, logs, services
            Kubectl->>K8sCluster: Query component details
            K8sCluster-->>Kubectl: Component data
            Kubectl-->>Script: Details collected
            Script->>FileSystem: Write component files
        end
    end
    
    Script->>Script: collect_diagnostic_commands()
    alt Not skipped
        Script->>Kubectl: Find OFED pods
        Kubectl->>K8sCluster: Query OFED pods
        K8sCluster-->>Kubectl: Pod list
        loop For each OFED pod
            Script->>Kubectl: kubectl exec diagnostic commands
            Kubectl->>K8sCluster: Execute in pod (lsmod, ibstat, etc)
            K8sCluster-->>Kubectl: Command output
            Kubectl-->>Script: Diagnostic data
            Script->>FileSystem: Write diagnostic files
        end
    end
    
    Script->>Script: collect_node_info()
    Script->>Kubectl: Get nodes with labels and resources
    Kubectl->>K8sCluster: Query nodes
    K8sCluster-->>Kubectl: Node data
    Kubectl-->>Script: Nodes collected
    Script->>FileSystem: Write node files
    
    Script->>Script: cleanup_empty_artifacts()
    Script->>FileSystem: Remove empty files/directories
    FileSystem-->>Script: Cleanup complete
    
    Script->>Script: generate_summary()
    Script->>FileSystem: Write diagnostic summary
    
    Script->>Script: create_archive()
    alt No compress flag not set
        Script->>FileSystem: Create tar.gz archive
        FileSystem-->>Script: Archive created
        Script->>FileSystem: Generate SHA256 checksum
        Script->>FileSystem: Remove temporary directory
    end
    
    Script-->>User: Collection complete with statistics
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

fi

# Method 4: Check common namespaces for any network operator deployment
for ns in network-operator nvidia-network-operator openshift-network-operator kube-system; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Collection info
cat > "$metadata_dir/collection-info.txt" <<EOF
Network Operator SOS-Report
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Network Operator SOS-Report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants