Skip to content

telemetry for real usage detection #390

@gjkim42

Description

@gjkim42

Plan: Phone-Home Telemetry for Real Usage Detection (#390)

Context

Axon currently has Prometheus metrics for in-cluster monitoring, but no way for the project maintainer to understand real-world adoption — how many installations exist, which agent types are popular, what features are used. This change adds anonymous phone-home telemetry via a Kubernetes CronJob (inspired by Linkerd's linkerd-heartbeat) so the maintainer can detect real usage patterns.

User sentiment: Phone-home telemetry can be controversial. The CronJob approach maximizes transparency: operators can see it via kubectl get cronjobs, inspect its logs, and opt-out by simply deleting the CronJob. Data collected is strictly anonymous aggregates — no PII, repo URLs, prompts, or secrets.

Architecture

A Kubernetes CronJob runs daily using the existing controller image with a --telemetry-report flag. This flag switches the binary to one-shot mode: collect data from the K8s API, send an HTTP POST, then exit.

CronJob (daily) → axon-controller --telemetry-report
                       ↓
              Query K8s API for aggregates
                       ↓
              Get/create installation ID (ConfigMap)
                       ↓
              HTTP POST to telemetry endpoint
                       ↓
              Exit 0

Opt-out: Delete the CronJob (kubectl delete cronjob axon-telemetry -n axon-system) or remove it from install.yaml before applying.

Implementation

1. New package: internal/telemetry/telemetry.go

Report data structures and collection/sending logic.

type Report struct {
    InstallationID string        `json:"installationId"`
    Version        string        `json:"version"`
    K8sVersion     string        `json:"k8sVersion"`
    Timestamp      time.Time     `json:"timestamp"`
    Tasks          TaskReport    `json:"tasks"`
    Features       FeatureReport `json:"features"`
    Scale          ScaleReport   `json:"scale"`
    Usage          UsageReport   `json:"usage"`
}

type TaskReport struct {
    Total   int            `json:"total"`
    ByType  map[string]int `json:"byType"`  // claude-code: 50, codex: 10, ...
    ByPhase map[string]int `json:"byPhase"` // Succeeded: 45, Failed: 5, ...
}

type FeatureReport struct {
    TaskSpawners int      `json:"taskSpawners"`
    AgentConfigs int      `json:"agentConfigs"`
    Workspaces   int      `json:"workspaces"`
    SourceTypes  []string `json:"sourceTypes"` // ["github", "cron", "jira"]
}

type ScaleReport struct {
    Namespaces int `json:"namespaces"` // distinct namespaces with axon resources
}

type UsageReport struct {
    TotalCostUSD      float64 `json:"totalCostUsd"`
    TotalInputTokens  float64 `json:"totalInputTokens"`
    TotalOutputTokens float64 `json:"totalOutputTokens"`
}

Key functions:

  • Run(ctx context.Context, c client.Client, clientset kubernetes.Interface, endpoint string) error — one-shot: collect → log → send → return
  • collect(ctx context.Context, c client.Client, clientset kubernetes.Interface) (*Report, error) — queries K8s API
  • send(ctx context.Context, endpoint string, report *Report) error — HTTP POST with 10s timeout
  • getOrCreateInstallationID(ctx context.Context, c client.Client, namespace string) (string, error) — manages ConfigMap axon-telemetry with a persistent UUID

Data collection (all via K8s API):

  • List all Tasks across namespaces → count by .spec.type and .status.phase, sum cost/tokens from .status.results
  • List TaskSpawners → count, extract source types from .spec.when (github, cron, jira)
  • List AgentConfigs → count
  • List Workspaces → count
  • Distinct namespaces from all listed resources
  • K8s server version via clientset.Discovery().ServerVersion()
  • Version from internal/version.Version

2. New file: internal/telemetry/telemetry_test.go

Unit tests using controller-runtime's fake client and httptest.NewServer:

  • TestCollect — populate fake client with Tasks/TaskSpawners/etc, verify report aggregates
  • TestCollectEmpty — empty cluster produces zero counts
  • TestSend — verify HTTP POST body, Content-Type header, User-Agent
  • TestSendFailure — verify errors are returned but don't panic
  • TestGetOrCreateInstallationID — verify ConfigMap creation and idempotent reads
  • TestSourceTypeExtraction — verify TaskSpawner.spec.when → sourceTypes mapping

3. Modify: cmd/axon-controller/main.go

Add --telemetry-report flag and --telemetry-endpoint flag:

var telemetryReport bool
var telemetryEndpoint string
flag.BoolVar(&telemetryReport, "telemetry-report", false,
    "Run a one-shot telemetry report and exit.")
flag.StringVar(&telemetryEndpoint, "telemetry-endpoint",
    "https://telemetry.axon.dev/v1/report",
    "The endpoint to send telemetry reports to.")

After flag parsing, before manager setup:

if telemetryReport {
    // One-shot mode: collect and send telemetry, then exit
    // Set up minimal K8s client (no manager needed)
    // Call telemetry.Run(...)
    os.Exit(0)
}

The one-shot path creates a lightweight K8s client (no controller manager overhead) and calls telemetry.Run().

4. Modify: install.yaml

Add CronJob resource after the Deployment:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: axon-telemetry
  namespace: axon-system
  labels:
    app.kubernetes.io/name: axon
    app.kubernetes.io/component: telemetry
spec:
  schedule: "0 6 * * *"  # Daily at 6 AM UTC
  concurrencyPolicy: Replace
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 3
      template:
        spec:
          serviceAccountName: axon-controller  # reuse existing SA
          restartPolicy: OnFailure
          containers:
            - name: telemetry
              image: gjkim42/axon-controller:latest
              args:
                - --telemetry-report
              securityContext:
                allowPrivilegeEscalation: false
                capabilities:
                  drop: ["ALL"]

Add ConfigMap RBAC to the existing axon-leader-election-role Role (it already has ConfigMap/Lease access in axon-system namespace, so may just need to confirm coverage).

Files to Create/Modify

File Action
internal/telemetry/telemetry.go Create — Report types, collect, send, installation ID
internal/telemetry/telemetry_test.go Create — Unit tests
cmd/axon-controller/main.go Modify — Add --telemetry-report and --telemetry-endpoint flags, one-shot mode
install.yaml Modify — Add CronJob, verify RBAC covers ConfigMap access

Existing Code to Reuse

  • internal/version.Version (internal/version/version.go) — report version field
  • github.com/google/uuid (already in go.mod indirect) — installation ID generation
  • k8s.io/client-go/kubernetes Clientset — K8s server version discovery
  • sigs.k8s.io/controller-runtime/pkg/client — listing CRDs
  • Existing axon-controller ServiceAccount and RBAC — reuse for CronJob (already has list access to Tasks, TaskSpawners, etc.)
  • Leader election Role in install.yaml — already has ConfigMap access in axon-system

Verification

  1. make verify — lint/fmt/vet pass
  2. make test — new unit tests in internal/telemetry/ pass
  3. Manual: run go run ./cmd/axon-controller --telemetry-report --telemetry-endpoint=http://localhost:8888 against a local cluster, verify logged payload and HTTP request
  4. Opt-out: confirm the CronJob can be deleted without affecting controller operation

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions