Skip to content

feat: support token usage diff and multiple comparison runs in audit diff command#24544

Merged
pelikhan merged 5 commits intomainfrom
copilot/support-audit-diff-token-usage
Apr 4, 2026
Merged

feat: support token usage diff and multiple comparison runs in audit diff command#24544
pelikhan merged 5 commits intomainfrom
copilot/support-audit-diff-token-usage

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 4, 2026

Summary

Enhances the audit diff command with two improvements:

1. Detailed token usage diff

Adds a TokenUsageDiff struct populated from RunSummary.TokenUsage (the firewall proxy token-usage.jsonl data) with a per-field breakdown:

  • Input tokens, output tokens, cache read/write tokens, effective tokens
  • Pre-computed percentage change strings for each field
  • Total API requests and change count
  • Cache efficiency per run

This is surfaced in both markdown (table) and pretty console output. The existing Run.TokenUsage (int) field in RunMetricsDiff is preserved for backward compatibility.

2. Multiple comparison runs from a single base

Changes the command signature from diff <run-id-1> <run-id-2> to diff <base-run-id> <compare-run-id-1> [<compare-run-id-2>...]:

  • The base run summary is downloaded/loaded once and reused for all comparisons
  • Each comparison run is diffed against the base independently
  • Self-comparison and duplicate run IDs are rejected with clear error messages
  • For single comparisons: JSON output remains a single object (backward compatible)
  • For multiple comparisons: pretty/markdown outputs are separated with dividers; JSON outputs an array

Changes

  • pkg/cli/audit_diff.go — New TokenUsageDiff struct + field on RunMetricsDiff; computeTokenUsageDiff; updated computeRunMetricsDiff
  • pkg/cli/audit_diff_command.goMinimumNArgs(2), multi-run parsing with validation, base-run loaded once
  • pkg/cli/audit_diff_render.go — Multi-diff render functions, token usage detail sections in markdown/pretty
  • pkg/cli/audit_diff_test.go — Tests for computeTokenUsageDiff, RunMetricsDiff with token details, and multi-run diff

Examples

# Two-run diff (backward compatible)
gh aw audit diff 12345 12346

# Base vs multiple runs
gh aw audit diff 12345 12346 12347 12348

# Markdown output with token details
gh aw audit diff 12345 12346 --format markdown

…diff command

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/c2cbe169-434c-4f94-8c73-7af8797b7c5d

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 4, 2026

@copilot use console pkg helpers

… labels in audit diff pretty render

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/2e45f51d-dc34-4f35-a127-3c42a1b0ea44

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 4, 2026

@copilot use console pkg helpers

Updated in dceb448. All pretty section renderers now use console.FormatSectionHeader() for section headings (e.g. "Firewall Changes", "MCP Tool Changes", "Run Metrics", "Token Usage Details") and console.FormatInfoMessage() for sub-section labels (e.g. "New Domains (2):", "Changed Tools (1):"), consistent with the pattern in audit_report_render.go.

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 4, 2026

@copilot collect all diff data in tables and use console table rendering helpers

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 4, 2026

@copilot collect all diff data in tables and use console table rendering helpers

Done in 453a36b. All diff sections in the pretty console renderer now use console.RenderTable:

  • Firewall: separate tables for new domains (Domain/Status/Requests/Anomaly), removed domains (Domain/Previous Status/Previous Requests), status changes (Domain/Before/After/Anomaly), and volume changes (Domain/Requests before/Requests after/Change)
  • MCP Tools: separate tables for new tools (Server/Tool/Calls/Anomaly), removed tools (Server/Tool/Previous Calls), and changed tools (Server/Tool/Calls before/Calls after/Change/Errors before/Errors after/Anomaly)
  • Run Metrics: table with Metric/Run#N/Run#M/Change rows
  • Token Usage Details: table with Token Type/Run#N/Run#M/Change rows

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances the audit diff command to provide a richer token-usage breakdown (from firewall proxy token-usage.jsonl summaries) and to support diffing multiple comparison runs against a single base run in one invocation.

Changes:

  • Added a detailed TokenUsageDiff and integrated it into RunMetricsDiff computation.
  • Updated CLI command parsing/execution to accept base-run-id + 1..N comparison run IDs and reuse the base summary across comparisons.
  • Updated pretty/markdown/JSON renderers to support rendering multiple diffs and to display token-usage detail sections; added tests covering the new diff logic.
Show a summary per file
File Description
pkg/cli/audit_diff.go Adds TokenUsageDiff and computes detailed token usage diffs as part of run metrics.
pkg/cli/audit_diff_command.go Changes command signature and loads the base run once for multiple comparisons.
pkg/cli/audit_diff_render.go Updates JSON/markdown/pretty output to handle multiple diffs and render token usage details.
pkg/cli/audit_diff_test.go Adds unit tests for token usage diffing and for metrics diff integration.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

EffectiveTokensChange string `json:"effective_tokens_change,omitempty"`
Run1TotalRequests int `json:"run1_total_requests"`
Run2TotalRequests int `json:"run2_total_requests"`
RequestsChange string `json:"requests_change,omitempty"`
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestsChange is formatted via formatCountChange (absolute delta like "+4") while the other "Change" fields in TokenUsageDiff are percentage strings (e.g. "+50%"). This mixes units in the same diff/table output and can be misleading; consider either formatting requests as a percentage (consistent with token fields) or renaming the field/output to make it explicit that it's an absolute request delta.

Suggested change
RequestsChange string `json:"requests_change,omitempty"`
RequestsDelta string `json:"requests_delta,omitempty"` // Absolute request-count delta, e.g. "+4"

Copilot uses AI. Check for mistakes.
Comment on lines +284 to +286
Run1CacheEfficiency float64 `json:"run1_cache_efficiency"`
Run2CacheEfficiency float64 `json:"run2_cache_efficiency"`
}
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenUsageDiff includes per-run cache efficiency values, but there is no corresponding change field, so renderers currently leave the "Change" column blank for that row. If cache efficiency is intended to be part of the diff (per PR description), consider adding a CacheEfficiencyChange (percentage-point delta or relative %) or rendering it in a way that doesn't imply a missing value.

Copilot uses AI. Check for mistakes.
@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 4, 2026

@copilot review comments

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 4, 2026

@copilot review comments

Both review issues addressed in 3f36c01:

  1. RequestsChangeRequestsDelta — renamed field and JSON tag (requests_delta) to make it explicit that it's an absolute count delta (e.g. +4), not a relative percentage like the other *Change fields.

  2. CacheEfficiencyChange — added field computed as a percentage-point delta via a new formatPercentagePointChange helper (e.g. -1.5pp). The Change column in the cache efficiency table row is now populated in both markdown and pretty output.

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented Apr 4, 2026

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 4, 2026

@copilot fix https://github.com/github/gh-aw/actions/runs/23985383782/job/69956569649#step:10:1

Fixed in 87402be. TestArgumentNamingConventions was failing because the Use string "diff <base-run-id> <compare-run-id> [<compare-run-id-2>...]" left a <compare-run-id-2> token (after the test's prefix/suffix stripping) that didn't match ^[a-z][a-z0-9-]*$. Changed to the standard variadic form "diff <base-run-id> <compare-run-id>..." consistent with add_command.go and others.

@pelikhan pelikhan merged commit cc344d1 into main Apr 4, 2026
@pelikhan pelikhan deleted the copilot/support-audit-diff-token-usage branch April 4, 2026 19:13
github-actions bot added a commit that referenced this pull request Apr 4, 2026
- Update gh aw audit diff signature to reflect multi-run support:
  accepts one base run and one or more comparison runs (from #24544)
- Document token usage diff breakdown (input/output/cache tokens,
  API requests, cache efficiency) added in #24544
- Document JSON array output and divider behavior for multiple diffs
- Add lock file header section to workflow-structure.md documenting
  the gh-aw-metadata first-line placement and the Secrets/Custom
  actions manifest sections added in #24495

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants