-
Notifications
You must be signed in to change notification settings - Fork 67
feat: add SOS report script #2065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| log_info "Collecting all component workloads..." | ||
|
|
||
| # Component definitions: name, label, type | ||
| declare -A components=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to make this more generic. Maybe get these components out of release.yaml?
- Identify which release is used by customer (helm list ...)
- Fetch release.yaml respectively
- Build components map dynamically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think about, I suggest building network-operator workloads and CRD maps per each network-operator release. I mean, generate these maps and embed them under SOS report script.
In this case customers can pull SOS report which is adapted to their network-operator release (also need to make sure its backward compatible with previous releases e.g. don't throw errors for missing CRDs/workload if they are not part of network-operator release).
We also need to make sure that SOS script can work in a disconnected environment, therefore pulling release.yaml from public GH could be blocked, so resource maps can be generated part of network-operator release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a script that generates the list of CRDs and components to collect. If a CRD is not present, the script will not fail, so backward compatibility should be fine
Greptile OverviewGreptile SummaryAdded comprehensive SOS report collection script for troubleshooting Network Operator deployments. The implementation includes a main collection script ( Key additions:
Implementation quality:
The PR addresses a real operational need for collecting diagnostic data in a standardized way. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Script as kubectl-netop_sosreport
participant Kubectl
participant K8sCluster as Kubernetes Cluster
participant FileSystem as File System
User->>Script: Execute with options
Script->>Script: parse_args()
Script->>Kubectl: check_prerequisites()
Kubectl->>K8sCluster: Test connectivity
K8sCluster-->>Kubectl: Connection OK
Kubectl-->>Script: Prerequisites verified
Script->>Kubectl: detect_operator_namespace()
Kubectl->>K8sCluster: Search for operator deployment
K8sCluster-->>Kubectl: Namespace found
Kubectl-->>Script: Operator namespace detected
Script->>FileSystem: setup_output_dir()
FileSystem-->>Script: Directory created
Script->>Script: collect_metadata()
Script->>Kubectl: Get cluster version, namespaces
Kubectl->>K8sCluster: Query resources
K8sCluster-->>Kubectl: Return data
Kubectl-->>Script: Metadata collected
Script->>FileSystem: Write metadata files
Script->>Script: collect_crd_definitions()
loop For each CRD
Script->>Kubectl: Get CRD definition
Kubectl->>K8sCluster: Query CRD
K8sCluster-->>Kubectl: CRD YAML
Kubectl-->>Script: CRD data
Script->>FileSystem: Write CRD file
end
Script->>Script: collect_crd_instances()
loop For each CR type
Script->>Kubectl: Get CR instances
Kubectl->>K8sCluster: Query instances
K8sCluster-->>Kubectl: Instance data
Kubectl-->>Script: Instances
Script->>FileSystem: Write instance files
end
Script->>Script: collect_operator_resources()
Script->>Kubectl: Get operator deployment, pods, RBAC
Kubectl->>K8sCluster: Query operator resources
K8sCluster-->>Kubectl: Resource data
Kubectl-->>Script: Resources collected
Script->>FileSystem: Write operator files
Script->>Script: collect_all_components()
loop For each component (15 total)
Script->>Kubectl: Check if component exists
Kubectl->>K8sCluster: Query component workload
K8sCluster-->>Kubectl: Component found/not found
alt Component exists
Script->>Kubectl: Get pods, logs, services
Kubectl->>K8sCluster: Query component details
K8sCluster-->>Kubectl: Component data
Kubectl-->>Script: Details collected
Script->>FileSystem: Write component files
end
end
Script->>Script: collect_diagnostic_commands()
alt Not skipped
Script->>Kubectl: Find OFED pods
Kubectl->>K8sCluster: Query OFED pods
K8sCluster-->>Kubectl: Pod list
loop For each OFED pod
Script->>Kubectl: kubectl exec diagnostic commands
Kubectl->>K8sCluster: Execute in pod (lsmod, ibstat, etc)
K8sCluster-->>Kubectl: Command output
Kubectl-->>Script: Diagnostic data
Script->>FileSystem: Write diagnostic files
end
end
Script->>Script: collect_node_info()
Script->>Kubectl: Get nodes with labels and resources
Kubectl->>K8sCluster: Query nodes
K8sCluster-->>Kubectl: Node data
Kubectl-->>Script: Nodes collected
Script->>FileSystem: Write node files
Script->>Script: cleanup_empty_artifacts()
Script->>FileSystem: Remove empty files/directories
FileSystem-->>Script: Cleanup complete
Script->>Script: generate_summary()
Script->>FileSystem: Write diagnostic summary
Script->>Script: create_archive()
alt No compress flag not set
Script->>FileSystem: Create tar.gz archive
FileSystem-->>Script: Archive created
Script->>FileSystem: Generate SHA256 checksum
Script->>FileSystem: Remove temporary directory
end
Script-->>User: Collection complete with statistics
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 4 comments
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
166eacd to
135c64f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
| fi | ||
|
|
||
| # Method 4: Check common namespaces for any network operator deployment | ||
| for ns in network-operator nvidia-network-operator openshift-network-operator kube-system; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift-network-operator is used by OCP Cluster Network Operator:
|
|
||
| # Collection info | ||
| cat > "$metadata_dir/collection-info.txt" <<EOF | ||
| Network Operator SOS-Report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA Network Operator SOS-Report
No description provided.