-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Plan: Phone-Home Telemetry for Real Usage Detection (#390)
Context
Axon currently has Prometheus metrics for in-cluster monitoring, but no way for the project maintainer to understand real-world adoption — how many installations exist, which agent types are popular, what features are used. This change adds anonymous phone-home telemetry via a Kubernetes CronJob (inspired by Linkerd's linkerd-heartbeat) so the maintainer can detect real usage patterns.
User sentiment: Phone-home telemetry can be controversial. The CronJob approach maximizes transparency: operators can see it via kubectl get cronjobs, inspect its logs, and opt-out by simply deleting the CronJob. Data collected is strictly anonymous aggregates — no PII, repo URLs, prompts, or secrets.
Architecture
A Kubernetes CronJob runs daily using the existing controller image with a --telemetry-report flag. This flag switches the binary to one-shot mode: collect data from the K8s API, send an HTTP POST, then exit.
CronJob (daily) → axon-controller --telemetry-report
↓
Query K8s API for aggregates
↓
Get/create installation ID (ConfigMap)
↓
HTTP POST to telemetry endpoint
↓
Exit 0
Opt-out: Delete the CronJob (kubectl delete cronjob axon-telemetry -n axon-system) or remove it from install.yaml before applying.
Implementation
1. New package: internal/telemetry/telemetry.go
Report data structures and collection/sending logic.
type Report struct {
InstallationID string `json:"installationId"`
Version string `json:"version"`
K8sVersion string `json:"k8sVersion"`
Timestamp time.Time `json:"timestamp"`
Tasks TaskReport `json:"tasks"`
Features FeatureReport `json:"features"`
Scale ScaleReport `json:"scale"`
Usage UsageReport `json:"usage"`
}
type TaskReport struct {
Total int `json:"total"`
ByType map[string]int `json:"byType"` // claude-code: 50, codex: 10, ...
ByPhase map[string]int `json:"byPhase"` // Succeeded: 45, Failed: 5, ...
}
type FeatureReport struct {
TaskSpawners int `json:"taskSpawners"`
AgentConfigs int `json:"agentConfigs"`
Workspaces int `json:"workspaces"`
SourceTypes []string `json:"sourceTypes"` // ["github", "cron", "jira"]
}
type ScaleReport struct {
Namespaces int `json:"namespaces"` // distinct namespaces with axon resources
}
type UsageReport struct {
TotalCostUSD float64 `json:"totalCostUsd"`
TotalInputTokens float64 `json:"totalInputTokens"`
TotalOutputTokens float64 `json:"totalOutputTokens"`
}Key functions:
Run(ctx context.Context, c client.Client, clientset kubernetes.Interface, endpoint string) error— one-shot: collect → log → send → returncollect(ctx context.Context, c client.Client, clientset kubernetes.Interface) (*Report, error)— queries K8s APIsend(ctx context.Context, endpoint string, report *Report) error— HTTP POST with 10s timeoutgetOrCreateInstallationID(ctx context.Context, c client.Client, namespace string) (string, error)— manages ConfigMapaxon-telemetrywith a persistent UUID
Data collection (all via K8s API):
- List all Tasks across namespaces → count by
.spec.typeand.status.phase, sum cost/tokens from.status.results - List TaskSpawners → count, extract source types from
.spec.when(github, cron, jira) - List AgentConfigs → count
- List Workspaces → count
- Distinct namespaces from all listed resources
- K8s server version via
clientset.Discovery().ServerVersion() - Version from
internal/version.Version
2. New file: internal/telemetry/telemetry_test.go
Unit tests using controller-runtime's fake client and httptest.NewServer:
TestCollect— populate fake client with Tasks/TaskSpawners/etc, verify report aggregatesTestCollectEmpty— empty cluster produces zero countsTestSend— verify HTTP POST body, Content-Type header, User-AgentTestSendFailure— verify errors are returned but don't panicTestGetOrCreateInstallationID— verify ConfigMap creation and idempotent readsTestSourceTypeExtraction— verify TaskSpawner.spec.when → sourceTypes mapping
3. Modify: cmd/axon-controller/main.go
Add --telemetry-report flag and --telemetry-endpoint flag:
var telemetryReport bool
var telemetryEndpoint string
flag.BoolVar(&telemetryReport, "telemetry-report", false,
"Run a one-shot telemetry report and exit.")
flag.StringVar(&telemetryEndpoint, "telemetry-endpoint",
"https://telemetry.axon.dev/v1/report",
"The endpoint to send telemetry reports to.")After flag parsing, before manager setup:
if telemetryReport {
// One-shot mode: collect and send telemetry, then exit
// Set up minimal K8s client (no manager needed)
// Call telemetry.Run(...)
os.Exit(0)
}The one-shot path creates a lightweight K8s client (no controller manager overhead) and calls telemetry.Run().
4. Modify: install.yaml
Add CronJob resource after the Deployment:
apiVersion: batch/v1
kind: CronJob
metadata:
name: axon-telemetry
namespace: axon-system
labels:
app.kubernetes.io/name: axon
app.kubernetes.io/component: telemetry
spec:
schedule: "0 6 * * *" # Daily at 6 AM UTC
concurrencyPolicy: Replace
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
backoffLimit: 3
template:
spec:
serviceAccountName: axon-controller # reuse existing SA
restartPolicy: OnFailure
containers:
- name: telemetry
image: gjkim42/axon-controller:latest
args:
- --telemetry-report
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]Add ConfigMap RBAC to the existing axon-leader-election-role Role (it already has ConfigMap/Lease access in axon-system namespace, so may just need to confirm coverage).
Files to Create/Modify
| File | Action |
|---|---|
internal/telemetry/telemetry.go |
Create — Report types, collect, send, installation ID |
internal/telemetry/telemetry_test.go |
Create — Unit tests |
cmd/axon-controller/main.go |
Modify — Add --telemetry-report and --telemetry-endpoint flags, one-shot mode |
install.yaml |
Modify — Add CronJob, verify RBAC covers ConfigMap access |
Existing Code to Reuse
internal/version.Version(internal/version/version.go) — report version fieldgithub.com/google/uuid(already in go.mod indirect) — installation ID generationk8s.io/client-go/kubernetesClientset — K8s server version discoverysigs.k8s.io/controller-runtime/pkg/client— listing CRDs- Existing
axon-controllerServiceAccount and RBAC — reuse for CronJob (already has list access to Tasks, TaskSpawners, etc.) - Leader election Role in
install.yaml— already has ConfigMap access inaxon-system
Verification
make verify— lint/fmt/vet passmake test— new unit tests ininternal/telemetry/pass- Manual: run
go run ./cmd/axon-controller --telemetry-report --telemetry-endpoint=http://localhost:8888against a local cluster, verify logged payload and HTTP request - Opt-out: confirm the CronJob can be deleted without affecting controller operation