Symptoms: Prometheus shows no cronjob_monitor_* metrics.
Possible causes:
-
Operator not running
kubectl get pods -n monitoring -l app.kubernetes.io/name=onax
Check the pod is in
Runningstate. -
Prometheus not scraping
- Check Prometheus targets: Status > Targets
- Look for
onaxtarget - If missing, check ServiceMonitor or scrape config
-
No CronJobs in cluster
kubectl get cronjobs --all-namespaces
Metrics only appear when CronJobs exist.
-
RBAC issues
kubectl logs -n monitoring -l app.kubernetes.io/name=onax
Look for permission errors.
Alert: CronJobFailed (severity: warning)
Runbook:
-
Check CronJob status:
kubectl describe cronjob <name> -n <namespace>
-
Check recent Job logs:
kubectl get jobs -n <namespace> | grep <cronjob-name> kubectl logs job/<job-name> -n <namespace>
-
Check Pod events:
kubectl get events -n <namespace> --field-selector involvedObject.kind=Job
-
Common failure reasons:
- Image pull errors
- Resource limits exceeded
- Command/script errors
- Missing secrets/configmaps
Alert: CronJobMissedSchedule (severity: warning)
Runbook:
-
Check CronJob events:
kubectl describe cronjob <name> -n <namespace>
Look for "Cannot determine if job needs to be started" or similar.
-
Common causes:
- Cluster was down during scheduled time
- CronJob controller issues
- Starting deadline too short
- Too many concurrent jobs
-
Fix: Increase
startingDeadlineSeconds:spec: startingDeadlineSeconds: 300
Alert: CronJobSlowExecution (severity: warning)
Runbook:
-
Check current execution duration:
cronjob_monitor_execution_duration_seconds{cronjob="<name>"} -
Common causes:
- Increased data volume
- External service slowdown
- Resource contention
- Network issues
-
Check Pod resource usage:
kubectl top pod -n <namespace> | grep <cronjob-name>
-
Consider setting
activeDeadlineSeconds:spec: jobTemplate: spec: activeDeadlineSeconds: 3600
Alert: CronJobLowSuccessRate (severity: critical)
Runbook:
-
Check failure patterns:
rate(cronjob_monitor_executions_total{cronjob="<name>", status="failed"}[24h]) -
Review recent failures:
kubectl get jobs -n <namespace> | grep <cronjob-name> # Check failed jobs kubectl describe job <failed-job-name> -n <namespace>
-
Look for patterns:
- Time-based (certain hours)
- Resource-based (memory/CPU spikes)
- External dependency failures
Alert: CronJobNoRecentSuccess (severity: critical)
Runbook:
-
Check if CronJob is suspended:
kubectl get cronjob <name> -n <namespace> -o jsonpath='{.spec.suspend}'
-
Check concurrency policy:
kubectl get cronjob <name> -n <namespace> -o jsonpath='{.spec.concurrencyPolicy}'
If
Forbid, a running job may block new ones. -
Check last schedule time:
cronjob_monitor_last_success_timestamp{cronjob="<name>"} -
Check for missed schedules:
increase(cronjob_monitor_missed_schedules_total{cronjob="<name>"}[24h]) -
Check cluster resources:
kubectl describe nodes | grep -A 5 "Allocated resources"
Enable debug logging for more verbose output:
helm upgrade onax oci://ghcr.io/varaxlabs/charts/onax \
--set logging.level=debug- Check GitHub Issues
- Search existing discussions
- Open a new issue with:
- Kubernetes version
- onax version
- Relevant logs
- Steps to reproduce