A comprehensive Grafana dashboard designed to identify oversized and undersized Kubernetes deployments, enabling data-driven cost optimization decisions for infrastructure teams and business stakeholders.
Maintainer: dhruvimehta228@gmail.com
- Overview
- Features
- Quick Start
- Prerequisites
- Installation
- Configuration
- Usage Guide
- Methodology
- Troubleshooting
- Cost Optimization Workflow
- Limitations
- Contributing
This dashboard helps organizations optimize their Kubernetes resource allocation by:
- Identifying oversized deployments that waste money by requesting more resources than needed
- Detecting undersized deployments that may suffer performance issues due to resource constraints
- Calculating potential cost savings from right-sizing resources
- Providing actionable insights through intuitive visualizations for both technical and non-technical users
- Oversized: Deployments using < 20% of requested CPU/memory
- Undersized: Deployments using > 80% of requested CPU/memory
- Optimal: Deployments with 20-80% resource utilization
-
Resource Status Overview
- Pie chart showing deployment distribution across utilization categories
- Instant overview of optimization opportunities with color-coded segments
-
Detailed Analysis Table
- Deployment-level resource usage and recommendations
- Color-coded status indicators for quick decision making
- Sortable by impact and savings potential
-
Trend Analysis
- Time-series charts showing top resource consumers
- Historical patterns to validate optimization decisions
-
Cost Impact Summary
- Monthly savings potential from oversized deployments
- Real-time counts of deployments requiring attention
- Non-technical friendly: Emoji icons and clear status indicators
- Color-coded backgrounds: Green (optimal), Orange (oversized), Red (undersized)
- Real-time updates: 30-second refresh interval
- Executive dashboard: Perfect for both technical teams and business stakeholders
- Download the
resource-optimization-dashboard.jsonfile - Open Grafana → + (Plus) → Import
- Upload JSON file or paste content
- Select Prometheus datasource
- Save dashboard
Install the full monitoring stack using Helm:
# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Prometheus, Grafana, exporters)
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123- Kubernetes cluster (v1.16+)
- Prometheus (v2.20+) with proper configuration
- Grafana (v7.0+)
- kube-state-metrics (v2.0+)
- cAdvisor (usually bundled with kubelet)
The dashboard requires these Prometheus metrics to be available:
# Container resource usage
container_cpu_usage_seconds_total
container_memory_working_set_bytes
# Resource requests
kube_pod_container_resource_requests
# Pod metadata
kube_pod_info
Ensure your monitoring setup can access:
- Pod metrics and metadata
- ReplicaSet information
- Resource requests and limits
The dashboard uses default cloud pricing estimates. To customize:
-
CPU Pricing (default: $0.024 per CPU-hour):
# Find this in the dashboard queries and modify: kube_pod_container_resource_requests{resource="cpu"} * 1000 * 0.024 # Change 0.024 to your actual CPU cost per core-hour -
Memory Pricing (default: $0.012 per GB-hour):
# Modify this multiplier: kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024 * 0.012 # Change 0.012 to your actual memory cost per GB-hour
To modify the utilization thresholds:
-
Oversized threshold (default: < 20%):
# Change 20 to your preferred percentage ) < 20 -
Undersized threshold (default: > 80%):
# Change 80 to your preferred percentage ) > 80
-
Quick Assessment
- Look at the pie chart at the top for immediate overview
- Focus on orange (oversized) sections for cost-saving opportunities
- Red sections indicate potential performance risks
-
Priority Actions
- Review the analysis table sorted by potential impact
- Focus on deployments with highest dollar impact first
- Use the emoji indicators for quick status understanding
-
Decision Making
- ✅ Optimal: No action needed
⚠️ Oversized: Safe to reduce resource requests- 🚨 Undersized: Needs more resources or investigation
-
Deep Analysis
- Use the detailed table to see exact utilization percentages
- Review trend charts to understand usage patterns over time
- Cross-reference with application performance metrics
-
Implementation Planning
- Start with highest-impact oversized deployments
- Make gradual adjustments (10-20% reductions)
- Monitor for 1-2 weeks before further optimization
-
Validation Process
- Use time-series charts to confirm usage patterns
- Check if low utilization is due to recent deployment or genuine over-provisioning
- Consider business requirements and SLA needs
| Panel | Purpose | Action Items |
|---|---|---|
| 🎯 Resource Status Overview | Executive summary of resource distribution | Identify overall optimization opportunity |
| 💰 Deployments Needing Attention | Detailed per-deployment metrics | Prioritize optimization efforts |
| 📊 Resource Usage Trends | Historical usage patterns | Validate optimization decisions |
| 💸 Cost Savings Potential | Financial impact metrics | Report ROI to stakeholders |
| Quick status indicators | Monitor optimization progress |
CPU Utilization:
(rate(container_cpu_usage_seconds_total[5m]) * 100) /
(kube_pod_container_resource_requests{resource="cpu"} * 1000)
Memory Utilization:
(container_memory_working_set_bytes) /
(kube_pod_container_resource_requests{resource="memory"})
The dashboard specifically targets Kubernetes Deployments by:
- Filtering for pods created by ReplicaSets (
created_by_kind="ReplicaSet") - Extracting deployment names from ReplicaSet naming convention
- Excluding StatefulSets, DaemonSets, Jobs, and CronJobs
Monthly Cost Estimate:
- CPU:
CPU_cores × hours_per_month × cost_per_core_hour - Memory:
Memory_GB × hours_per_month × cost_per_GB_hour - Hours per month: 720 (24 × 30)
Savings Calculation:
- Identifies resources above/below optimal thresholds
- Calculates potential reduction for oversized resources
- Estimates monthly savings based on resource pricing
Symptoms:
- All panels show "No data"
- Queries return empty results
Solutions:
# Check Prometheus connectivity
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Verify metrics are being scraped
curl 'http://localhost:9090/api/v1/query?query=up{job="kubelet"}'
# Check kube-state-metrics
kubectl get pods -n monitoring | grep kube-state-metrics
kubectl logs -n monitoring deployment/kube-state-metricsSymptoms:
- Expected deployments missing from table
- Lower counts than expected
Diagnosis:
# Check if pods have resource requests
kubectl describe deployment <deployment-name> | grep -A 10 "requests"
# Verify ReplicaSet labeling
kubectl get replicasets -o custom-columns="NAME:.metadata.name,OWNER:.metadata.ownerReferences[0].name"
# Check pod creation method
kubectl get pods -o yaml | grep "created_by_kind"Solutions:
- Ensure deployments have resource requests defined
- Verify ReplicaSet naming follows standard convention
- Check if workloads are actually Deployments (not StatefulSets, etc.)
Symptoms:
- Dashboard loads slowly
- Grafana becomes unresponsive
Solutions:
- Increase query intervals from 30s to 1m or 5m
- Add namespace filters to reduce query scope
- Implement Prometheus recording rules for complex calculations
Test individual components:
# Basic container metrics
container_cpu_usage_seconds_total{container!="POD",container!=""}
# Resource requests
kube_pod_container_resource_requests{resource="cpu"}
# Pod-to-deployment mapping
kube_pod_info{created_by_kind="ReplicaSet"}
# Complete utilization calculation
(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]) * 100) /
(kube_pod_container_resource_requests{resource="cpu"} * 1000)
- Install and configure the dashboard
- Collect baseline data for 2 weeks minimum
- Identify patterns in resource utilization
- Document findings and create optimization plan
- Prioritize deployments by potential savings
- Assess business impact of each optimization
- Create rollback plans for critical applications
- Schedule optimization windows during low-traffic periods
- Start with highest-impact, lowest-risk deployments
- Make incremental changes (10-20% adjustments)
- Monitor application performance closely
- Wait 3-7 days between optimization rounds
- Track cost savings using dashboard metrics
- Monitor application health and performance
- Document lessons learned for future optimizations
- Repeat cycle quarterly or as needed
- Never optimize during peak business hours
- Always have rollback procedures ready
- Coordinate with application owners
- Monitor for at least 48 hours after changes
- Document all changes for compliance
-
Workload Patterns
- May not account for seasonal or cyclical usage patterns
- Short-term spikes might not be captured in 5-minute averages
- Cold start effects can skew new deployment metrics
-
Kubernetes Scope
- Only covers Deployment workloads (excludes StatefulSets, DaemonSets, Jobs)
- Requires resource requests to be defined
- Multi-container pods are aggregated, potentially masking individual issues
-
Cost Accuracy
- Uses estimated cloud pricing, not actual billing
- Doesn't account for reserved instances or volume discounts
- No consideration for networking, storage, or other costs
-
Context Awareness
- Cannot determine business criticality of applications
- No awareness of SLA requirements or compliance needs
- May suggest optimizations that conflict with disaster recovery plans
-
Performance Correlation
- Doesn't directly measure application performance impact
- Can't predict performance degradation from resource reductions
- No integration with APM or user experience metrics
When reporting issues, please include:
- Kubernetes version and distribution
- Prometheus and Grafana versions
- Dashboard JSON version
- Error messages or unexpected behavior
- Steps to reproduce the issue
We welcome suggestions for:
- Additional metrics and calculations
- New visualization types
- Integration with other monitoring tools
- Cost model improvements
- Fork the repository
- Set up test environment with minikube or kind
- Install monitoring stack using provided instructions
- Test changes against multiple deployment patterns
- Submit pull request with detailed description
This project is licensed under the MIT License - see the LICENSE file for details.
- Maintainer: dhruvimehta228@gmail.com
- Documentation: Check this README and inline comments
- Issues: Use GitHub Issues for bug reports and feature requests
Made with ❤️ for the Kubernetes community
Help us improve this dashboard by sharing your feedback and optimization success stories!