Conversation
- Introduce `enableMetrics` boolean to workflow schema and docs - Track CPU and memory usage for tasks (host, container, k8s) - Show metrics in web UI and expose via API - Add types.Metrics struct and plumbing throughout codebase
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Pull Request Overview
This PR adds metrics collection functionality to track resource usage (CPU and memory) for tasks during workflow execution. The feature is controlled by a new enableMetrics option that defaults to enabled.
- Introduces a new Metrics type to capture CPU and memory usage
- Implements resource metrics collection for different process types (host, container, Kubernetes)
- Adds metrics display in the web UI and HTTP API endpoint for accessing metrics
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| schema/workflow.schema.json | Adds enableMetrics boolean field to workflow schema |
| internal/types/spec.go | Adds EnableMetrics field and getter method with default true value |
| internal/types/metrics.go | Defines new Metrics struct for CPU and memory tracking |
| internal/task_node.go | Adds Metrics field to TaskNode structure |
| internal/server.go | Implements HTTP endpoint to retrieve task metrics |
| internal/run.go | Adds periodic metrics collection goroutine for running tasks |
| internal/proc/proc.go | Adds GetMetrics method to process interface |
| internal/proc/noop.go | Implements GetMetrics returning empty metrics |
| internal/proc/kubernetes.go | Implements kubectl-based metrics collection for Kubernetes pods |
| internal/proc/host.go | Implements ps-based metrics collection for host processes |
| internal/proc/container.go | Implements Docker stats-based metrics collection for containers |
| internal/index.html | Updates UI to display metrics and improves code formatting |
| docs/reference/workflow.md | Documents new enableMetrics configuration option |
| docs/reference/workflow-defs-workflow.md | Adds enableMetrics to workflow definition documentation |
| docs/reference/workflow-defs-workflow-properties-enablemetrics.md | Adds detailed documentation for enableMetrics property |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Alex Collins <alexec@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (6)
internal/types/metrics.go (1)
4-5: Clarify CPU unit semantics (millicores, aggregated across cores).Document that CPU is in millicores aggregated across all cores (1000m = 1 core; values may exceed 1000 on multi‑core) to keep backends/UI consistent.
Apply this diff:
- CPU uint64 `json:"cpu"` // CPU usage in millicores + CPU uint64 `json:"cpu"` // CPU usage in millicores aggregated across cores (1000m = 1 core; can exceed 1000m)internal/proc/kubernetes.go (4)
43-47: Track pods by namespace/name and guard with a mutex.Using only names risks collisions across namespaces and introduces races. Store ns+name and protect with RWMutex.
type k8s struct { - log *log.Logger - spec types.Spec - name string - podNames []string + log *log.Logger + spec types.Spec + name string + mu sync.RWMutex + pods []struct{ ns, name string } types.Task }
276-279: Deduplicate under lock using namespace/name; avoid races.- if !slices.Contains(k.podNames, pod.Name) { - k.podNames = append(k.podNames, pod.Name) - } + k.mu.Lock() + seen := false + for _, p := range k.pods { + if p.ns == pod.Namespace && p.name == pod.Name { + seen = true + break + } + } + if !seen { + k.pods = append(k.pods, struct{ ns, name string }{ns: pod.Namespace, name: pod.Name}) + } + k.mu.Unlock()
435-446: Aggregate with snapshot under RLock; tolerate per‑pod errors.One failing pod should not zero the whole read; also avoid iterating while mutating.
func (k *k8s) GetMetrics(ctx context.Context) (*types.Metrics, error) { - sum := &types.Metrics{} - for _, podName := range k.podNames { - metrics, err := k.getMetrics(ctx, podName) - if err != nil { - return nil, err - } - sum.CPU += metrics.CPU - sum.Mem += metrics.Mem - } - return sum, nil + sum := &types.Metrics{} + k.mu.RLock() + pods := append([]struct{ ns, name string }{}, k.pods...) // snapshot + k.mu.RUnlock() + for _, p := range pods { + metrics, err := k.getMetrics(ctx, p.ns, p.name) + if err != nil { + k.log.Printf("kubectl top failed for %s/%s: %v", p.ns, p.name, err) + continue + } + sum.CPU += metrics.CPU + sum.Mem += metrics.Mem + } + return sum, nil }
448-456: Pass namespace to kubectl top.Without -n, results are wrong outside the default namespace.
-func (k *k8s) getMetrics(ctx context.Context, podName string) (*types.Metrics, error) { - cmd := exec.CommandContext(ctx, "kubectl", "top", "pod", podName, "--no-headers") +func (k *k8s) getMetrics(ctx context.Context, ns, podName string) (*types.Metrics, error) { + cmd := exec.CommandContext(ctx, "kubectl", "top", "pod", podName, "-n", ns, "--no-headers")internal/proc/container.go (1)
335-339: Prefer inactive_file on cgroup v2 for more accurate RSS‑like memory.Fallback to cache only if inactive_file is missing/zero.
- memoryBytes := dockerStats.MemoryStats.Usage - if dockerStats.MemoryStats.Stats["cache"] != 0 { - memoryBytes -= dockerStats.MemoryStats.Stats["cache"] - } + memoryBytes := dockerStats.MemoryStats.Usage + if v := dockerStats.MemoryStats.Stats["inactive_file"]; v != 0 { + memoryBytes -= v + } else if v := dockerStats.MemoryStats.Stats["cache"]; v != 0 { + memoryBytes -= v + }
🧹 Nitpick comments (2)
internal/proc/kubernetes.go (1)
458-465: Handle multi-line output defensively.kubectl may return multiple lines if the selector broadens; trim to the first non-empty line before parsing.
- fields := strings.Fields(strings.TrimSpace(output)) + lines := strings.Split(strings.TrimSpace(output), "\n") + if len(lines) == 0 { + return nil, fmt.Errorf("unexpected kubectl top output format") + } + fields := strings.Fields(lines[0])internal/proc/container.go (1)
313-318: Consider reusing a Docker client for metrics polling.Creating/closing a client per poll adds overhead. Cache a client on the struct or inject one; handle Close() on shutdown.
Would you like a follow‑up patch to add a lazy‑init client with a small mutex and lifecycle tied to Run/stop?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
internal/proc/container.go(5 hunks)internal/proc/host.go(4 hunks)internal/proc/kubernetes.go(4 hunks)internal/types/metrics.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
internal/proc/host.go (1)
internal/types/metrics.go (1)
Metrics(3-6)
internal/proc/container.go (1)
internal/types/metrics.go (1)
Metrics(3-6)
internal/proc/kubernetes.go (1)
internal/types/metrics.go (1)
Metrics(3-6)
🔇 Additional comments (2)
internal/proc/container.go (2)
178-179: Persisting containerID is correct.
341-350: CPU millicores conversion is off by 100× (inflated values).cpuPercent here is in percent of a single core (100% = 1 core). To get millicores, multiply by 10, not 1000.
- cpuPercent := (float64(cpuDelta) / float64(systemDelta)) * float64(len(dockerStats.CPUStats.CPUUsage.PercpuUsage)) - cpuMillicores = uint64(cpuPercent * 1000) // Convert to millicores + cpuPercent := (float64(cpuDelta) / float64(systemDelta)) * float64(len(dockerStats.CPUStats.CPUUsage.PercpuUsage)) * 100.0 + cpuMillicores = uint64(cpuPercent * 10) // 100% = 1000mLikely an incorrect or invalid review comment.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| return nil, fmt.Errorf("failed to parse CPU usage '%s': %w", fields[0], err) | ||
| } | ||
|
|
||
| cpuMillicores := cpuPercentage * 10 // Convert percentage to millicores (1% = 10 millicores) |
There was a problem hiding this comment.
The CPU percentage to millicores conversion is incorrect. 1% CPU usage should equal 10 millicores, but this assumes a single-core system. On multi-core systems, CPU percentage can exceed 100%. The conversion should account for the number of CPU cores or use a different approach to get absolute CPU usage.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Summary by CodeRabbit