Splunk Observability Cloud dashboards for monitoring the Secure AI Factory (SAIF) Platform.
This repository contains dashboard specifications, chart definitions, and import scripts for the SAIF Platform Splunk Observability dashboards. These dashboards provide real-time visibility into infrastructure health, AI workload performance, and security posture across all clusters.
Cluster resource monitoring including node CPU/memory utilization, storage capacity, and UCS hardware health metrics.
- Location:
dashboards/infrastructure/ - Data source: Splunk OTEL Collector (DaemonSet metrics)
GPU utilization, NIM inference performance, model latency, and throughput metrics for AI/ML workloads.
- Location:
dashboards/ai-workloads/ - Data source: DCGM Exporter metrics via Splunk OTEL Collector
Main overview dashboard with status tiles, Cilium/Hubble network flow metrics, Tetragon security event counts, and platform health summary.
- Location:
dashboards/secure-ai-factory/ - Data source: Splunk OTEL Collector, CronJob text chart updates
saif-splunk-dashboard/
├── charts/ # Individual chart JSON definitions
├── dashboards/
│ ├── infrastructure/ # Infrastructure dashboard spec + import script
│ ├── ai-workloads/ # AI workloads dashboard
│ ├── secure-ai-factory/ # Main overview dashboard
│ └── isovalent-redesign/ # Cilium/Hubble-focused dashboard
├── docs/
│ ├── ARCHITECTURE.md # Dashboard structure and data sources
│ └── EXECUTION_PLAN.md # Implementation phases
└── README.md
The dashboards are populated by two data paths:
-
Metrics: Splunk OTEL Collector (deployed as a DaemonSet on each cluster) scrapes Prometheus endpoints and forwards metrics to Splunk Observability Cloud.
-
Text charts: A CronJob queries cluster state and updates Splunk text charts via the Splunk API.
Both components are deployed via ArgoCD from the saif-gitops repository:
apps/splunk-otel/-- OTEL Collector configurationapps/splunk-reporter/-- CronJob for text chart updates
| Repository | Relationship |
|---|---|
| saif-gitops | CronJob and OTEL Collector deployment |
| saif-platform | Platform orchestration and SBOM |
- Architecture -- Dashboard structure, data sources, chart reference
- Execution Plan -- Implementation phases and tasks
This project is licensed under the Cisco Sample Code License, Version 1.1. See LICENSE for details.