Skip to content

feat: Add metrics tracking#96

Merged
alexec merged 8 commits intomainfrom
metrics
Sep 15, 2025
Merged

feat: Add metrics tracking#96
alexec merged 8 commits intomainfrom
metrics

Conversation

@alexec
Copy link
Collaborator

@alexec alexec commented Sep 14, 2025

Summary by CodeRabbit

  • New Features
    • Introduced a two-pane, resizable interface: DAG viewer (left) and live logs (right).
    • Real-time updates to node status, grouping, and re-parenting via events.
    • Node selection updates header (name, metrics, message) and recenters the graph.
    • Per-node status icons; dark-themed, ANSI-colored logs with links and line numbers.
    • Displays per-task CPU and memory metrics with periodic updates while running.
    • Auto-scroll toggle and follow control, click-to-scroll, and hover cues.
    • Responsive layout with draggable resizer and window resize handling.

- Introduce `enableMetrics` boolean to workflow schema and docs - Track
CPU and memory usage for tasks (host, container, k8s) - Show metrics in
web UI and expose via API - Add types.Metrics struct and plumbing
throughout codebase
@alexec alexec requested a review from Copilot September 14, 2025 20:40
@coderabbitai

This comment was marked as outdated.

@alexec alexec marked this pull request as ready for review September 14, 2025 20:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds metrics collection functionality to track resource usage (CPU and memory) for tasks during workflow execution. The feature is controlled by a new enableMetrics option that defaults to enabled.

  • Introduces a new Metrics type to capture CPU and memory usage
  • Implements resource metrics collection for different process types (host, container, Kubernetes)
  • Adds metrics display in the web UI and HTTP API endpoint for accessing metrics

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
schema/workflow.schema.json Adds enableMetrics boolean field to workflow schema
internal/types/spec.go Adds EnableMetrics field and getter method with default true value
internal/types/metrics.go Defines new Metrics struct for CPU and memory tracking
internal/task_node.go Adds Metrics field to TaskNode structure
internal/server.go Implements HTTP endpoint to retrieve task metrics
internal/run.go Adds periodic metrics collection goroutine for running tasks
internal/proc/proc.go Adds GetMetrics method to process interface
internal/proc/noop.go Implements GetMetrics returning empty metrics
internal/proc/kubernetes.go Implements kubectl-based metrics collection for Kubernetes pods
internal/proc/host.go Implements ps-based metrics collection for host processes
internal/proc/container.go Implements Docker stats-based metrics collection for containers
internal/index.html Updates UI to display metrics and improves code formatting
docs/reference/workflow.md Documents new enableMetrics configuration option
docs/reference/workflow-defs-workflow.md Adds enableMetrics to workflow definition documentation
docs/reference/workflow-defs-workflow-properties-enablemetrics.md Adds detailed documentation for enableMetrics property

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

alexec and others added 3 commits September 14, 2025 13:42
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Alex Collins <alexec@users.noreply.github.com>
coderabbitai[bot]

This comment was marked as outdated.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
internal/types/metrics.go (1)

4-5: Clarify CPU unit semantics (millicores, aggregated across cores).

Document that CPU is in millicores aggregated across all cores (1000m = 1 core; values may exceed 1000 on multi‑core) to keep backends/UI consistent.

Apply this diff:

-	CPU uint64 `json:"cpu"` // CPU usage in millicores
+	CPU uint64 `json:"cpu"` // CPU usage in millicores aggregated across cores (1000m = 1 core; can exceed 1000m)
internal/proc/kubernetes.go (4)

43-47: Track pods by namespace/name and guard with a mutex.

Using only names risks collisions across namespaces and introduces races. Store ns+name and protect with RWMutex.

 type k8s struct {
-	log      *log.Logger
-	spec     types.Spec
-	name     string
-	podNames []string
+	log  *log.Logger
+	spec types.Spec
+	name string
+	mu   sync.RWMutex
+	pods []struct{ ns, name string }
 	types.Task
 }

276-279: Deduplicate under lock using namespace/name; avoid races.

-		if !slices.Contains(k.podNames, pod.Name) {
-			k.podNames = append(k.podNames, pod.Name)
-		}
+		k.mu.Lock()
+		seen := false
+		for _, p := range k.pods {
+			if p.ns == pod.Namespace && p.name == pod.Name {
+				seen = true
+				break
+			}
+		}
+		if !seen {
+			k.pods = append(k.pods, struct{ ns, name string }{ns: pod.Namespace, name: pod.Name})
+		}
+		k.mu.Unlock()

435-446: Aggregate with snapshot under RLock; tolerate per‑pod errors.

One failing pod should not zero the whole read; also avoid iterating while mutating.

 func (k *k8s) GetMetrics(ctx context.Context) (*types.Metrics, error) {
-	sum := &types.Metrics{}
-	for _, podName := range k.podNames {
-		metrics, err := k.getMetrics(ctx, podName)
-		if err != nil {
-			return nil, err
-		}
-		sum.CPU += metrics.CPU
-		sum.Mem += metrics.Mem
-	}
-	return sum, nil
+	sum := &types.Metrics{}
+	k.mu.RLock()
+	pods := append([]struct{ ns, name string }{}, k.pods...) // snapshot
+	k.mu.RUnlock()
+	for _, p := range pods {
+		metrics, err := k.getMetrics(ctx, p.ns, p.name)
+		if err != nil {
+			k.log.Printf("kubectl top failed for %s/%s: %v", p.ns, p.name, err)
+			continue
+		}
+		sum.CPU += metrics.CPU
+		sum.Mem += metrics.Mem
+	}
+	return sum, nil
 }

448-456: Pass namespace to kubectl top.

Without -n, results are wrong outside the default namespace.

-func (k *k8s) getMetrics(ctx context.Context, podName string) (*types.Metrics, error) {
-	cmd := exec.CommandContext(ctx, "kubectl", "top", "pod", podName, "--no-headers")
+func (k *k8s) getMetrics(ctx context.Context, ns, podName string) (*types.Metrics, error) {
+	cmd := exec.CommandContext(ctx, "kubectl", "top", "pod", podName, "-n", ns, "--no-headers")
internal/proc/container.go (1)

335-339: Prefer inactive_file on cgroup v2 for more accurate RSS‑like memory.

Fallback to cache only if inactive_file is missing/zero.

-	memoryBytes := dockerStats.MemoryStats.Usage
-	if dockerStats.MemoryStats.Stats["cache"] != 0 {
-		memoryBytes -= dockerStats.MemoryStats.Stats["cache"]
-	}
+	memoryBytes := dockerStats.MemoryStats.Usage
+	if v := dockerStats.MemoryStats.Stats["inactive_file"]; v != 0 {
+		memoryBytes -= v
+	} else if v := dockerStats.MemoryStats.Stats["cache"]; v != 0 {
+		memoryBytes -= v
+	}
🧹 Nitpick comments (2)
internal/proc/kubernetes.go (1)

458-465: Handle multi-line output defensively.

kubectl may return multiple lines if the selector broadens; trim to the first non-empty line before parsing.

-	fields := strings.Fields(strings.TrimSpace(output))
+	lines := strings.Split(strings.TrimSpace(output), "\n")
+	if len(lines) == 0 {
+		return nil, fmt.Errorf("unexpected kubectl top output format")
+	}
+	fields := strings.Fields(lines[0])
internal/proc/container.go (1)

313-318: Consider reusing a Docker client for metrics polling.

Creating/closing a client per poll adds overhead. Cache a client on the struct or inject one; handle Close() on shutdown.

Would you like a follow‑up patch to add a lazy‑init client with a small mutex and lifecycle tied to Run/stop?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f42f28e and 01f282c.

📒 Files selected for processing (4)
  • internal/proc/container.go (5 hunks)
  • internal/proc/host.go (4 hunks)
  • internal/proc/kubernetes.go (4 hunks)
  • internal/types/metrics.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
internal/proc/host.go (1)
internal/types/metrics.go (1)
  • Metrics (3-6)
internal/proc/container.go (1)
internal/types/metrics.go (1)
  • Metrics (3-6)
internal/proc/kubernetes.go (1)
internal/types/metrics.go (1)
  • Metrics (3-6)
🔇 Additional comments (2)
internal/proc/container.go (2)

178-179: Persisting containerID is correct.


341-350: CPU millicores conversion is off by 100× (inflated values).

cpuPercent here is in percent of a single core (100% = 1 core). To get millicores, multiply by 10, not 1000.

-			cpuPercent := (float64(cpuDelta) / float64(systemDelta)) * float64(len(dockerStats.CPUStats.CPUUsage.PercpuUsage))
-			cpuMillicores = uint64(cpuPercent * 1000) // Convert to millicores
+			cpuPercent := (float64(cpuDelta) / float64(systemDelta)) * float64(len(dockerStats.CPUStats.CPUUsage.PercpuUsage)) * 100.0
+			cpuMillicores = uint64(cpuPercent * 10) // 100% = 1000m

Likely an incorrect or invalid review comment.

@alexec alexec changed the title feat: Add enableMetrics option and resource metrics tracking feat: Add metrics tracking Sep 14, 2025
@alexec alexec requested a review from Copilot September 14, 2025 21:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return nil, fmt.Errorf("failed to parse CPU usage '%s': %w", fields[0], err)
}

cpuMillicores := cpuPercentage * 10 // Convert percentage to millicores (1% = 10 millicores)
Copy link

Copilot AI Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU percentage to millicores conversion is incorrect. 1% CPU usage should equal 10 millicores, but this assumes a single-core system. On multi-core systems, CPU percentage can exceed 100%. The conversion should account for the number of CPU cores or use a different approach to get absolute CPU usage.

Copilot uses AI. Check for mistakes.
@alexec alexec requested a review from Copilot September 14, 2025 21:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@alexec alexec merged commit cdb8753 into main Sep 15, 2025
2 checks passed
@alexec alexec deleted the metrics branch September 15, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants