Skip to content

Add automated diagnostics convergence workflow for failure pattern analysis#1459

Open
Copilot wants to merge 87 commits intomainfrom
copilot/add-diagnostics-convergence-workflow
Open

Add automated diagnostics convergence workflow for failure pattern analysis#1459
Copilot wants to merge 87 commits intomainfrom
copilot/add-diagnostics-convergence-workflow

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 18, 2026

Implementation Plan: Diagnostics Convergence Workflow

  • Create .github/workflows/diagnostics-convergence.yml workflow file

    • Add workflow_run trigger for "Virtual Integration" workflow (on non-success completion)
    • Add workflow_dispatch trigger with optional inputs (branch, event, runs, jobs)
    • Set minimal permissions (actions: read, contents: read)
    • Implement GitHub API query via actions/github-script to build manifest
    • Download and extract artifacts
    • Run convergence analysis script
    • Upload convergence reports as artifacts
    • Add comment clarifying core is injected by actions/github-script
  • Create ci/diagnostics_convergence.py Python script

    • Implement JSON manifest parsing
    • Scan extracted artifacts for diagnostics_full_*.json files
    • Parse diagnostics JSON and extract failure data
    • Generate stable failure signatures with normalization:
      • Pod errors: normalize timestamps, IPs, UUIDs, image digests, pod RS suffixes
      • Deployment not ready: normalize reasons
      • ArgoCD unhealthy: create signatures from health/sync/phase
      • PVC not bound: create signatures from status
    • Aggregate across runs: frequency counts, example runs, co-occurrence analysis
    • Generate convergence.json (machine-readable with metadata)
    • Generate convergence.md (human-readable report with per-job tables)
    • Fix job name extraction to handle hyphenated job names correctly
  • Add documentation

    • Create docs/diagnostics-convergence.md explaining convergence workflow usage
  • Test and validate

    • Verify workflow syntax
    • Test Python script with sample data
    • Test with hyphenated job names (deploy-kind, deploy-on-prem, deploy-oxm-profile)
    • Run code review and address feedback
    • Run security scan (no vulnerabilities found)
    • Rebased onto latest main branch (multiple times)

Recent Changes

Rebased onto latest main (6977565): Branch has been successfully rebased onto the latest main branch (commit 6977565 - "Release EMT OS 3.0. 20260310"). This rebase includes 7 new commits from main since the last rebase. All conflicts have been resolved and all previous functionality is preserved.

Previous fixes maintained:

  • Job name extraction correctly handles hyphenated job names (deploy-kind, deploy-on-prem, deploy-oxm-profile)
  • Core injection clarification comment in github-script block
  • Improved normalization patterns for stable failure signatures
  • All files validated successfully after rebase
Original prompt

Create an automated diagnostics convergence workflow for the open-edge-platform/edge-manageability-framework repo.

Context:

  • The repo already has a composite action .github/actions/collect_diagnostics/action.yaml that runs ci/orch_k8s_diagnostics.py and uploads artifacts including diagnostics_full_*.json under an artifact name like diagnostics-${{ github.job }}-${{ github.event_name }}-${{ github.event.pull_request.number || github.run_number }}-${{ github.run_attempt }}.
  • The main workflow .github/workflows/virtual-integration.yml runs jobs including deploy-kind, deploy-on-prem, and deploy-oxm-profile, each invoking the collect_diagnostics action with --output-json.
  • User requirement: add a new convergence mechanism that analyzes diagnostics across multiple runs to converge on distinct failure signatures. It should run automatically only when the upstream workflow conclusion is not success, but keep branch and event filtering options open (optional).

Requested change:

  1. Add a new GitHub Actions workflow triggered by workflow_run completion of the Virtual Integration workflow:

    • on: workflow_run for workflows: ["Virtual Integration"], types: [completed].
    • The job(s) must run only when github.event.workflow_run.conclusion != 'success'.
    • Provide a workflow_dispatch entry too (manual trigger) to allow ad-hoc runs and optional filters.
    • Defaults:
      • last N runs: 20
      • jobs to include: deploy-kind, deploy-on-prem, deploy-oxm-profile
    • Optional inputs/filters:
      • branch (optional; if empty, aggregate across all branches)
      • event (optional; if empty, aggregate across all event types)
      • runs (default 20)
      • jobs (default csv of the 3 jobs)
  2. Use actions/github-script to query GitHub API:

    • Resolve workflow id for Virtual Integration.
    • List the last N workflow runs (respecting optional branch/event filters if provided; otherwise omit those filters).
    • For each run, list artifacts and select those whose name starts with diagnostics- and match selected jobs (artifact name contains diagnostics-<job>-).
    • Create a manifest JSON containing run metadata (run_id, run_attempt if available, html_url, head_sha, head_branch, event, conclusion, created_at) and artifact metadata (id, name, archive_download_url).
  3. Download artifact zips based on the manifest and extract them.

  4. Add a small Python script (e.g., ci/diagnostics_convergence.py) that:

    • Scans extracted artifact folders for diagnostics_full_*.json.
    • Parses each JSON and generates stable failure signatures grouped by workflow job.
    • Signature rules (minimum viable):
      • Pod errors (summary.pods_w_errors): signature using reason if present else status, and a normalized hash of message or last_event (normalize to strip timestamps, IPs, UUIDs, image digests, and pod RS suffixes).
      • Deployments not ready (summary.deployments_not_ready): signature deploy_not_ready::<namespace>::<name>::<reason> (normalize reason).
      • ArgoCD unhealthy (summary.argocd_apps_unhealthy): signature argocd::<namespace>::<name>::<health>::<sync>::<operation_phase> plus optional normalized message hash.
      • PVC not bound (summary.pvc_not_bound): pvc::<namespace>::<pvc>::<status>.
    • Aggregates across runs:
      • frequency counts per signature per job
      • list of example runs for each signature (include run html_url)
      • co-occurrence top 3 per signature (based on being present in the same run+job)
    • Outputs:
      • convergence.json (machine-readable, includes metadata like runs_scanned, filters, generated_at)
      • convergence.md (human readable report with per-job tables/rankings)
  5. Upload convergence.json and convergence.md as artifacts in the convergence workflow.

  6. Permissions:

    • Set workflow permissions minimally to allow listing workflow runs and downloading artifacts: actions: read, contents: read.

Implementation notes:

  • The workflow should work when triggered via workflow_run (use info from github.event.workflow_run to infer repo/owner) and when manually triggered via workflow_dispatch.
  • When workflow_run triggered, default filters should be empty (aggregate across all branches/events) unless user specifies via inputs (not available in workflow_run); consider using the failing run’s head_branch as an optional default only if it doesn’t violate the “keep options open” requirement—prefer leaving unfiltered by default.
  • Ensure the job filter matches the three jobs by default.
  • Make the workflow robust to missing artifacts (skip runs without diagnostics artifacts).

Deliverables in PR:

  • New workflow file under .github/workflows/ (name appropriately, e.g. diagnostics-convergence.yml).
  • New Python script under ci/diagnostics_convergence.py (plus any helper module if needed).
  • Update README or add a short markdown doc (optional) describing how to run the convergence m...

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits February 18, 2026 14:42
Co-authored-by: hwindlas <108932456+hwindlas@users.noreply.github.com>
Co-authored-by: hwindlas <108932456+hwindlas@users.noreply.github.com>
Co-authored-by: hwindlas <108932456+hwindlas@users.noreply.github.com>
Co-authored-by: hwindlas <108932456+hwindlas@users.noreply.github.com>
Copilot AI changed the title [WIP] Add automated diagnostics convergence workflow Add automated diagnostics convergence workflow for failure pattern analysis Feb 18, 2026
Copilot AI requested a review from hwindlas February 18, 2026 14:49
@hwindlas hwindlas marked this pull request as ready for review February 18, 2026 15:49
dependabot bot and others added 13 commits March 11, 2026 09:58
…t-merge.yml from 2026.0.15 to 2026.0.17 (#1572)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…1570)

Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
Co-authored-by: rranjan3 <rajeev2.ranjan@intel.com>
Co-authored-by: ppanigra <106302035+ppanigra@users.noreply.github.com>
Co-authored-by: Hardik Windlass <108932456+hwindlas@users.noreply.github.com>
Co-authored-by: SYS-EMF <sys-emf-github@intel.com>
Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
Co-authored-by: Vignesh Chandrasekharan <vignesh.chandrasekharan@intel.com>
Co-authored-by: Vignesh Chandrasekharan <vignesh.chandrasekharan@intel.com>
Co-authored-by: Sunil Parida <sunil.kumar.parida@intel.com>
Co-authored-by: PalashGoelIntel <palash.goel@intel.com>
Co-authored-by: Sunil Parida <sunil.kumar.parida@intel.com>
Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
Co-authored-by: johnoloughlin <26579487+johnoloughlin@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 11, 2026

@copilot rebase

Rebase completed in commit fa2423b. The branch has been successfully rebased onto the latest main branch (8d74c3e - "Bump chart versions following orch-utils updates"). All previous changes including job name extraction fix and normalization improvements are preserved.

@johnoloughlin
Copy link
Copy Markdown
Contributor

@copilot sort the conflicts

PalashGoelIntel and others added 11 commits March 12, 2026 11:05
Co-authored-by: Sunil Parida <sunil.kumar.parida@intel.com>
Co-authored-by: John O'Loughlin <john.oloughlin@intel.com>
Co-authored-by: SYS-EMF <sys-emf-github@intel.com>
Co-authored-by: Hardik Windlass <108932456+hwindlas@users.noreply.github.com>
Co-authored-by: Andrei Palade <andrei.palade@intel.com>
Co-authored-by: SYS-EMF <sys-emf-github@intel.com>
Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
…uration (#1559)

Co-authored-by: SYS-EMF <sys-emf-github@intel.com>
Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
Co-authored-by: Sunil Parida <sunil.kumar.parida@intel.com>
Co-authored-by: Hardik Windlass <108932456+hwindlas@users.noreply.github.com>
Signed-off-by: Goyal, Pankaj <pankaj.goyal@intel.com>
Co-authored-by: hemapriya <hema.senthilkumar@outlook.com>
Co-authored-by: Loughnane, Gary <gary.loughnane@intel.com>
Co-authored-by: mengalax <madhulathax.engalathuru@intel.com>
Co-authored-by: Sunil Parida <sunil.kumar.parida@intel.com>
Co-authored-by: Pankaj Goyal <pankaj.goyal@intel.com>
Co-authored-by: Sonia Bhattacharya <sonia.bhattacharya@intel.com>
Signed-off-by: Rajeev Ranjan <rajeev2.ranjan@intel.com>
Co-authored-by: johnoloughlin <26579487+johnoloughlin@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 12, 2026

@copilot sort the conflicts

Conflicts resolved in commit 1143c6c. Successfully rebased onto latest main (commit 6977565 - "Release EMT OS 3.0. 20260310"). The branch now includes 7 additional commits from main since the previous rebase. All files validated and all previous fixes preserved (job name extraction, core injection comment, normalization patterns).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.