diff --git a/docs/error_classification_performance_analysis.md b/docs/error_classification_performance_analysis.md
new file mode 100644
index 0000000..ca5b555
--- /dev/null
+++ b/docs/error_classification_performance_analysis.md
@@ -0,0 +1,271 @@
+# 错误分类性能影响分析
+
+**分析时间**: 2026-01-27
+**分析范围**: GORM、Redis、HTTP 客户端错误分类实现
+
+---
+
+## 📊 性能开销分析
+
+### 1. 错误分类函数调用开销
+
+#### 主要性能开销点
+
+1. **错误类型检查** (最快)
+ - `errors.Is()` - O(1) 到 O(n),n 为错误链长度
+ - `errors.As()` - O(1) 到 O(n)
+ - `os.IsTimeout()` - O(1)
+ - **开销**: ~1-10 ns
+
+2. **字符串操作** (中等)
+ - `err.Error()` - 可能涉及内存分配
+ - `strings.ToLower()` - 字符串转换
+ - `strings.Contains()` - 字符串搜索
+ - **开销**: ~10-100 ns(取决于错误消息长度)
+
+3. **关键词匹配** (最慢)
+ - 遍历关键词列表
+ - 多次 `strings.Contains()` 调用
+ - **开销**: ~50-500 ns(取决于关键词数量和匹配位置)
+
+### 2. 各客户端错误分类性能对比
+
+#### GORM 错误分类 (`classifyError`)
+
+**调用路径**: `afterCallback` → `classifyError` → 多个辅助函数
+
+**性能开销**:
+- **最快路径** (GORM 标准错误): ~5-20 ns
+ - `errors.Is()` 检查
+ - 直接返回分类结果
+- **中等路径** (标准库错误): ~20-50 ns
+ - `errors.Is()` + `os.IsTimeout()` + `net.Error` 检查
+- **最慢路径** (字符串匹配): ~100-300 ns
+ - `err.Error()` + `strings.ToLower()` + 关键词匹配
+
+**平均开销**: ~50-150 ns
+
+#### Redis 错误分类 (`classifyRedisError`)
+
+**调用路径**: `report` → `classifyRedisError` → 多个辅助函数
+
+**性能开销**:
+- **最快路径** (redis.Nil): ~1-5 ns
+ - 直接比较
+- **中等路径** (标准错误): ~20-50 ns
+ - `errors.Is()` + `os.IsTimeout()` + `net.Error` 检查
+- **最慢路径** (字符串匹配): ~100-400 ns
+ - 多个关键词列表匹配(连接、命令、事务、权限、OOM、集群)
+
+**平均开销**: ~60-180 ns
+
+#### HTTP 客户端错误分类 (`classifyHTTPError`)
+
+**调用路径**: `metricEnd` → `classifyHTTPError` → 多个辅助函数
+
+**性能开销**:
+- **最快路径** (nil 错误): ~1-5 ns
+ - 直接返回
+- **中等路径** (标准错误): ~20-50 ns
+ - `errors.Is()` + `os.IsTimeout()` + `net.Error` 检查
+- **最慢路径** (字符串匹配): ~150-500 ns
+ - DNS、TLS、连接错误关键词匹配
+ - HTTP 状态码分类(switch 语句,很快)
+
+**平均开销**: ~70-200 ns
+
+---
+
+## 📈 性能影响评估
+
+### 1. 相对性能开销
+
+假设一次数据库查询/Redis 命令/HTTP 请求的平均耗时:
+
+| 操作类型 | 平均耗时 | 错误分类开销 | 相对开销 |
+|---------|---------|------------|---------|
+| **数据库查询** | 1-10 ms | ~50-150 ns | **0.0015% - 0.015%** |
+| **Redis 命令** | 0.1-1 ms | ~60-180 ns | **0.006% - 0.18%** |
+| **HTTP 请求** | 10-100 ms | ~70-200 ns | **0.0007% - 0.002%** |
+
+**结论**: 错误分类的性能开销相对于实际网络/IO 操作来说**几乎可以忽略不计**。
+
+### 2. 内存分配开销
+
+#### 字符串操作内存分配
+
+- `err.Error()`: 可能分配新字符串(取决于错误实现)
+- `strings.ToLower()`: 分配新字符串(如果原字符串不是小写)
+- **影响**: 每次错误分类可能分配 1-2 个字符串对象
+
+**优化建议**:
+- 对于高频错误,可以考虑缓存分类结果
+- 使用 `strings.EqualFold()` 代替 `ToLower()` + `Contains()`(如果可能)
+
+### 3. CPU 缓存影响
+
+#### 关键词列表遍历
+
+- 关键词列表存储在代码段,CPU 缓存友好
+- 字符串匹配可能触发缓存未命中
+- **影响**: 最小,关键词列表通常很小(< 50 个元素)
+
+---
+
+## 🎯 性能优化建议
+
+### 1. 快速路径优化
+
+**当前实现**: 已经优化,先检查标准错误类型
+
+**进一步优化**:
+```go
+// 使用 switch 语句处理常见错误(如果可能)
+switch err {
+case nil:
+ return "success"
+case redis.Nil:
+ return "success"
+case context.DeadlineExceeded:
+ return "timeout_error"
+// ...
+}
+```
+
+### 2. 字符串操作优化
+
+**当前实现**: `strings.ToLower()` + `strings.Contains()`
+
+**优化方案**:
+```go
+// 使用 strings.EqualFold() 进行大小写不敏感匹配
+// 避免分配新字符串
+func containsIgnoreCase(s, substr string) bool {
+ return strings.Contains(strings.ToLower(s), strings.ToLower(substr))
+}
+
+// 或者使用更高效的实现(如果关键词列表固定)
+var connectionKeywords = []string{"connection", "connect", ...}
+```
+
+### 3. 缓存优化(可选)
+
+**适用场景**: 相同错误频繁出现
+
+```go
+// 使用 sync.Map 缓存错误分类结果
+var errorClassCache sync.Map
+
+func classifyErrorCached(err error) string {
+ if err == nil {
+ return "success"
+ }
+
+ // 检查缓存
+ if cached, ok := errorClassCache.Load(err); ok {
+ return cached.(string)
+ }
+
+ // 分类并缓存
+ result := classifyError(err)
+ errorClassCache.Store(err, result)
+ return result
+}
+```
+
+**注意**: 缓存可能增加内存使用,需要权衡。
+
+### 4. 预编译优化
+
+**使用编译时常量**:
+```go
+// 将关键词列表定义为常量(如果可能)
+const (
+ connectionKeyword1 = "connection"
+ connectionKeyword2 = "connect"
+ // ...
+)
+```
+
+---
+
+## 📊 性能测试结果(预期)
+
+### 基准测试预期结果
+
+```
+BenchmarkClassifyHTTPError/success_case-8 500000000 2.5 ns/op 0 B/op 0 allocs/op
+BenchmarkClassifyHTTPError/timeout_error-8 200000000 8.0 ns/op 0 B/op 0 allocs/op
+BenchmarkClassifyHTTPError/connection_error-8 50000000 25.0 ns/op 16 B/op 1 allocs/op
+BenchmarkClassifyHTTPError/dns_error-8 30000000 40.0 ns/op 32 B/op 2 allocs/op
+BenchmarkClassifyHTTPError/http_status_400-8 100000000 5.0 ns/op 0 B/op 0 allocs/op
+BenchmarkClassifyHTTPError/mixed_errors-8 50000000 30.0 ns/op 16 B/op 1 allocs/op
+```
+
+### 对比:简单错误检查
+
+```
+BenchmarkClassifyHTTPError_Old/simple_error_check-8 1000000000 1.0 ns/op 0 B/op 0 allocs/op
+```
+
+**性能差异**: 错误分类比简单检查慢 **2-40 倍**,但绝对时间仍然很小(< 50 ns)。
+
+---
+
+## ✅ 结论
+
+### 性能影响评估
+
+1. **绝对开销**: 很小(< 200 ns)
+2. **相对开销**: 可忽略(< 0.2%)
+3. **内存开销**: 最小(每次 1-2 个字符串分配)
+4. **CPU 开销**: 最小(关键词列表很小)
+
+### 建议
+
+1. **当前实现已经足够高效**,不需要进一步优化
+2. **性能开销可以接受**,相对于网络/IO 操作来说微不足道
+3. **错误分类带来的价值**(避免指标爆炸、更好的监控)远大于性能开销
+4. **如果遇到性能瓶颈**,优先考虑:
+ - 减少错误分类调用频率(只在错误时调用)
+ - 优化字符串操作(使用更高效的匹配方法)
+ - 考虑缓存(如果相同错误频繁出现)
+
+### 实际场景影响
+
+- **高并发场景** (10,000+ QPS): 错误分类开销 < 0.1% CPU
+- **低延迟场景** (P99 < 1ms): 错误分类开销 < 0.02% 延迟
+- **内存受限场景**: 每次错误分类分配 < 100 bytes
+
+**总体评估**: ✅ **性能影响可忽略,建议保持当前实现**
+
+---
+
+## 🔧 性能测试方法
+
+### 运行性能测试
+
+```bash
+# 测试 HTTP 客户端错误分类
+go test -bench=BenchmarkClassifyHTTPError -benchmem ./pkg/client/wukong
+
+# 测试 GORM 错误分类
+go test -bench=BenchmarkClassifyError -benchmem ./pkg/client/gormx
+
+# 测试 Redis 错误分类
+go test -bench=BenchmarkClassifyRedisError -benchmem ./pkg/client/redis
+```
+
+### 性能分析
+
+```bash
+# 使用 pprof 分析
+go test -bench=BenchmarkClassifyHTTPError -cpuprofile=cpu.prof ./pkg/client/wukong
+go tool pprof cpu.prof
+```
+
+---
+
+**报告生成时间**: 2026-01-27
+**分析基于**: 代码审查和理论分析
+**建议**: 运行实际基准测试以获取精确数据
diff --git a/docs/grafana_dashboard.json b/docs/grafana_dashboard.json
new file mode 100644
index 0000000..c2c4073
--- /dev/null
+++ b/docs/grafana_dashboard.json
@@ -0,0 +1,3457 @@
+{
+ "annotations": {
+ "list": [
+ {
+ "builtIn": 1,
+ "datasource": {
+ "type": "datasource",
+ "uid": "grafana"
+ },
+ "enable": true,
+ "hide": true,
+ "iconColor": "rgba(0, 211, 255, 1)",
+ "name": "Annotations & Alerts",
+ "target": {
+ "limit": 100,
+ "matchAny": false,
+ "tags": [],
+ "type": "dashboard"
+ },
+ "type": "dashboard"
+ }
+ ]
+ },
+ "editable": true,
+ "fiscalYearStartMonth": 0,
+ "graphTooltip": 1,
+ "id": null,
+ "links": [],
+ "liveNow": false,
+ "panels": [
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 0
+ },
+ "id": 100,
+ "panels": [],
+ "title": "📊 概览 (Overview)",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "dark-red",
+ "value": null
+ },
+ {
+ "color": "red",
+ "value": 0.5
+ },
+ {
+ "color": "orange",
+ "value": 0.7
+ },
+ {
+ "color": "yellow",
+ "value": 0.85
+ },
+ {
+ "color": "green",
+ "value": 0.94
+ }
+ ]
+ },
+ "max": 1,
+ "min": 0,
+ "noValue": "N/A",
+ "unit": "percentunit"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 3,
+ "x": 0,
+ "y": 1
+ },
+ "id": 101,
+ "options": {
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "showThresholdLabels": false,
+ "showThresholdMarkers": true
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "(sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\", le=\"0.25\"}[5m])) * 0.5 + sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\", le=\"1\"}[5m])) * 0.5) / sum(rate(http_server_request_duration_seconds_count{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[5m]))",
+ "legendFormat": "Apdex Score",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Apdex Score (T=250ms)",
+ "type": "gauge"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 21,
+ "x": 3,
+ "y": 1
+ },
+ "id": 102,
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "Apdex Score",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 20,
+ "gradientMode": "scheme",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "smooth",
+ "lineWidth": 2,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "area"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "dark-red",
+ "value": null
+ },
+ {
+ "color": "red",
+ "value": 0.5
+ },
+ {
+ "color": "orange",
+ "value": 0.7
+ },
+ {
+ "color": "yellow",
+ "value": 0.85
+ },
+ {
+ "color": "green",
+ "value": 0.94
+ }
+ ]
+ },
+ "unit": "percentunit",
+ "min": 0,
+ "max": 1,
+ "noValue": "N/A"
+ },
+ "overrides": []
+ },
+ "options": {
+ "legend": {
+ "calcs": [
+ "lastNotNull",
+ "mean"
+ ],
+ "displayMode": "list",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "single",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "(sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\", le=\"0.25\"}[5m])) * 0.5 + sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\", le=\"1\"}[5m])) * 0.5) / sum(rate(http_server_request_duration_seconds_count{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[5m]))",
+ "instant": false,
+ "legendFormat": "Apdex Score",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Apdex Rating Trend",
+ "description": "🟢Excellent(94-100%) | 🟡Good(85-94%) | 🟠Fair(70-85%) | 🔴Poor(50-70%) | ⚫Unacceptable(0-50%)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 5,
+ "x": 0,
+ "y": 9
+ },
+ "id": 103,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m]))",
+ "legendFormat": "HTTP QPS",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP QPS",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 0.5
+ },
+ {
+ "color": "red",
+ "value": 1
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 5,
+ "x": 5,
+ "y": 9
+ },
+ "id": 104,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le))",
+ "legendFormat": "HTTP P99",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP P99 Latency",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "red",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 95
+ },
+ {
+ "color": "green",
+ "value": 99
+ }
+ ]
+ },
+ "unit": "percent"
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 5,
+ "x": 10,
+ "y": 9
+ },
+ "id": 105,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\", status!~\"5..\"}[1m])) / sum(rate(http_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) * 100",
+ "legendFormat": "HTTP Success Rate",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Success Rate",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "blue",
+ "value": 1000
+ },
+ {
+ "color": "yellow",
+ "value": 5000
+ },
+ {
+ "color": "red",
+ "value": 10000
+ }
+ ]
+ },
+ "unit": "reqps",
+ "decimals": 2
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 5,
+ "x": 15,
+ "y": 9
+ },
+ "id": 109,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "max"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "max_over_time(sum(rate(http_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m]))[1d:])",
+ "legendFormat": "Peak QPS Today",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Today's Peak QPS",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 5000
+ },
+ {
+ "color": "red",
+ "value": 10000
+ }
+ ]
+ },
+ "unit": "short"
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 12,
+ "x": 0,
+ "y": 13
+ },
+ "id": 106,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(go_goroutines{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"})",
+ "legendFormat": "Total Goroutines",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Goroutines (Total)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "yellow",
+ "value": 500000000
+ },
+ {
+ "color": "red",
+ "value": 1000000000
+ }
+ ]
+ },
+ "unit": "bytes"
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 12,
+ "x": 12,
+ "y": 13
+ },
+ "id": 107,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(go_memstats_heap_inuse_bytes{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"})",
+ "legendFormat": "Total Memory InUse",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Memory InUse (Total)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ }
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 4,
+ "x": 20,
+ "y": 9
+ },
+ "id": 108,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "area",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "textMode": "auto"
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(http_server_requests_inflight{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"})",
+ "legendFormat": "Inflight",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Inflight",
+ "type": "stat"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 17
+ },
+ "id": 200,
+ "panels": [],
+ "title": "🌐 HTTP Server",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 18
+ },
+ "id": 201,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (method, url)",
+ "legendFormat": "{{method}} {{url}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP QPS by Endpoint",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 18
+ },
+ "id": 202,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, method, url))",
+ "legendFormat": "P99 {{method}} {{url}}",
+ "range": true,
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, method, url))",
+ "legendFormat": "P95 {{method}} {{url}}",
+ "range": true,
+ "refId": "B"
+ }
+ ],
+ "title": "HTTP Latency (P99/P95) by Endpoint",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 26
+ },
+ "id": 203,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (status)",
+ "legendFormat": "Status {{status}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Status Codes",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "Bps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 26
+ },
+ "id": 204,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_server_request_size_bytes_sum{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m]))",
+ "legendFormat": "Request",
+ "range": true,
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_server_response_size_bytes_sum{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m]))",
+ "legendFormat": "Response",
+ "range": true,
+ "refId": "B"
+ }
+ ],
+ "title": "HTTP Network Traffic",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 34
+ },
+ "id": 300,
+ "panels": [],
+ "title": "📤 HTTP Client",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 35
+ },
+ "id": 301,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_client_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (baseUrl, url)",
+ "legendFormat": "{{baseUrl}}{{url}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Client QPS",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 35
+ },
+ "id": 302,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(http_client_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, baseUrl, url))",
+ "legendFormat": "P99 {{baseUrl}}{{url}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Client Latency (P99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 43
+ },
+ "id": 303,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(http_client_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\",error!=\"\"}[1m])) by (baseUrl, url)",
+ "legendFormat": "Error {{baseUrl}}{{url}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "HTTP Client Errors",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 51
+ },
+ "id": 400,
+ "panels": [],
+ "title": "🔌 gRPC Server",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 52
+ },
+ "id": 401,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(grpc_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (method, type)",
+ "legendFormat": "{{type}} {{method}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "gRPC Server QPS",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 52
+ },
+ "id": 402,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(grpc_server_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, method))",
+ "legendFormat": "P99 {{method}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "gRPC Server Latency (P99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 60
+ },
+ "id": 403,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(grpc_server_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\",code!=\"OK\"}[1m])) by (method, code)",
+ "legendFormat": "{{code}} {{method}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "gRPC Server Errors",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "cpm"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 60
+ },
+ "id": 404,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "increase(grpc_server_panics_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])",
+ "legendFormat": "Panic {{method}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "gRPC Server Panics",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 68
+ },
+ "id": 600,
+ "panels": [],
+ "title": "🐹 Go Runtime",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 8,
+ "x": 0,
+ "y": 69
+ },
+ "id": 601,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "single",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "go_goroutines{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "{{instance}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Goroutines",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "bytes"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 8,
+ "x": 8,
+ "y": 69
+ },
+ "id": 602,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "go_memstats_heap_alloc_bytes{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "Alloc {{instance}}",
+ "range": true,
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "go_memstats_heap_inuse_bytes{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "InUse {{instance}}",
+ "range": true,
+ "refId": "B"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "go_memstats_heap_sys_bytes{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "Sys {{instance}}",
+ "range": true,
+ "refId": "C"
+ }
+ ],
+ "title": "Heap Memory",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 8,
+ "x": 16,
+ "y": 69
+ },
+ "id": 603,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "rate(go_gc_duration_seconds_sum{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m]) / rate(go_gc_duration_seconds_count{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])",
+ "legendFormat": "Avg {{instance}}",
+ "range": true,
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "go_gc_duration_seconds{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\",quantile=\"1\"}",
+ "legendFormat": "Max {{instance}}",
+ "range": true,
+ "refId": "B"
+ }
+ ],
+ "title": "GC Duration",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "ops"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 8,
+ "x": 0,
+ "y": 77
+ },
+ "id": 604,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "rate(go_gc_duration_seconds_count{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])",
+ "legendFormat": "{{instance}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "GC Rate",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "max": 1,
+ "min": 0,
+ "unit": "percentunit"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 8,
+ "x": 8,
+ "y": 77
+ },
+ "id": 605,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "single",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "go_memstats_gc_cpu_fraction{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "{{instance}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "GC CPU Fraction",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "Bps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 8,
+ "x": 16,
+ "y": 77
+ },
+ "id": 606,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "rate(go_memstats_alloc_bytes_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])",
+ "legendFormat": "{{instance}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Memory Allocation Rate",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 85
+ },
+ "id": 700,
+ "panels": [],
+ "title": "🗄️ Database (DB)",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 86
+ },
+ "id": 701,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "none"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "db_client_connections_open{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "Open {{database}}",
+ "range": true,
+ "refId": "A"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "db_client_connections_in_use{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "InUse {{database}}",
+ "range": true,
+ "refId": "B"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "db_client_connections_idle{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}",
+ "legendFormat": "Idle {{database}}",
+ "range": true,
+ "refId": "C"
+ }
+ ],
+ "title": "DB Connection Pool",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 86
+ },
+ "id": 702,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(db_client_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, type, database))",
+ "legendFormat": "P99 {{type}} {{database}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "DB Query Latency (P99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 94
+ },
+ "id": 703,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(db_client_request_duration_seconds_count{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (type, database)",
+ "legendFormat": "{{type}} {{database}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "DB Query QPS",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 94
+ },
+ "id": 704,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(db_client_request_duration_seconds_count{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\",result=\"error\"}[1m])) by (type, database)",
+ "legendFormat": "Error {{type}} {{database}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "DB Query Errors",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 102
+ },
+ "id": 800,
+ "panels": [],
+ "title": "🔴 Redis",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 103
+ },
+ "id": 801,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(redis_client_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (cmd)",
+ "legendFormat": "{{cmd}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Redis Command QPS",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 103
+ },
+ "id": 802,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(redis_client_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, cmd))",
+ "legendFormat": "P99 {{cmd}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Redis Command Latency (P99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 111
+ },
+ "id": 803,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(redis_client_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\",result!=\"success\"}[1m])) by (cmd)",
+ "legendFormat": "Error {{cmd}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Redis Command Errors",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 119
+ },
+ "id": 900,
+ "panels": [],
+ "title": "🍃 MongoDB",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 120
+ },
+ "id": 901,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(mongo_client_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (command)",
+ "legendFormat": "{{command}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "MongoDB Command QPS",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "s"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 120
+ },
+ "id": 902,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "histogram_quantile(0.99, sum(rate(mongo_client_request_duration_seconds_bucket{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\"}[1m])) by (le, command))",
+ "legendFormat": "P99 {{command}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "MongoDB Command Latency (P99)",
+ "type": "timeseries"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "normal"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "reqps"
+ }
+ },
+ "gridPos": {
+ "h": 8,
+ "w": 12,
+ "x": 0,
+ "y": 128
+ },
+ "id": 903,
+ "options": {
+ "legend": {
+ "calcs": [
+ "mean",
+ "max"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "editorMode": "code",
+ "expr": "sum(rate(mongo_client_requests_total{namespace=~\"$namespace\",job=~\"$service\",instance=~\"$instance\",result=\"error\"}[1m])) by (command)",
+ "legendFormat": "Error {{command}}",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "MongoDB Command Errors",
+ "type": "timeseries"
+ }
+ ],
+ "schemaVersion": 38,
+ "style": "dark",
+ "tags": [
+ "box",
+ "server",
+ "sre"
+ ],
+ "templating": {
+ "list": [
+ {
+ "current": {
+ "selected": false,
+ "text": "Prometheus",
+ "value": "Prometheus"
+ },
+ "hide": 0,
+ "includeAll": false,
+ "label": "Datasource",
+ "multi": false,
+ "name": "datasource",
+ "options": [],
+ "query": "prometheus",
+ "refresh": 1,
+ "regex": "",
+ "skipUrlSync": false,
+ "type": "datasource"
+ },
+ {
+ "allValue": ".*",
+ "current": {
+ "selected": true,
+ "text": [
+ "All"
+ ],
+ "value": [
+ "$__all"
+ ]
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "definition": "box_info",
+ "hide": 0,
+ "includeAll": true,
+ "multi": true,
+ "name": "namespace",
+ "options": [],
+ "query": {
+ "query": "box_info",
+ "refId": "StandardVariableQuery"
+ },
+ "refresh": 1,
+ "regex": "/.*namespace=\"([^\"]*).*/",
+ "skipUrlSync": false,
+ "sort": 0,
+ "type": "query"
+ },
+ {
+ "allValue": ".*",
+ "current": {
+ "selected": true,
+ "text": [
+ "All"
+ ],
+ "value": [
+ "$__all"
+ ]
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "definition": "box_info",
+ "hide": 0,
+ "includeAll": true,
+ "multi": true,
+ "name": "service",
+ "options": [],
+ "query": {
+ "query": "box_info",
+ "refId": "StandardVariableQuery"
+ },
+ "refresh": 1,
+ "regex": "/.*job=\"([^\"]*).*/",
+ "skipUrlSync": false,
+ "sort": 0,
+ "type": "query"
+ },
+ {
+ "allValue": ".*",
+ "current": {
+ "selected": true,
+ "text": [
+ "All"
+ ],
+ "value": [
+ "$__all"
+ ]
+ },
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${datasource}"
+ },
+ "definition": "box_info",
+ "hide": 0,
+ "includeAll": true,
+ "multi": true,
+ "name": "instance",
+ "options": [],
+ "query": {
+ "query": "box_info",
+ "refId": "StandardVariableQuery"
+ },
+ "refresh": 1,
+ "regex": "/.*instance=\"([^\"]*).*/",
+ "skipUrlSync": false,
+ "sort": 0,
+ "type": "query"
+ }
+ ]
+ },
+ "time": {
+ "from": "now-6h",
+ "to": "now"
+ },
+ "timepicker": {},
+ "timezone": "",
+ "title": "Box Server Monitoring Dashboard",
+ "uid": "box-server-dashboard",
+ "version": 1,
+ "weekStart": ""
+}
\ No newline at end of file
diff --git a/docs/metric.md b/docs/metric.md
new file mode 100644
index 0000000..11790a1
--- /dev/null
+++ b/docs/metric.md
@@ -0,0 +1,700 @@
+- [Metrics Documentation](#metrics-documentation)
+ - [1. 核心指标定义 (Definitions)](#1-核心指标定义-definitions)
+ - [1.1 HTTP Server (Gin)](#11-http-server-gin)
+ - [1.2 HTTP Client (Wukong)](#12-http-client-wukong)
+ - [1.3 gRPC Server](#13-grpc-server)
+ - [1.4 Redis Client](#14-redis-client)
+ - [1.5 Database Client (GORM)](#15-database-client-gorm)
+ - [1.6 MongoDB Client](#16-mongodb-client)
+ - [1.7 Schedule (定时任务)](#17-schedule-定时任务)
+ - [1.8 Go Runtime (运行时)](#18-go-runtime-运行时)
+ - [基础指标](#基础指标)
+ - [内存分配统计](#内存分配统计)
+ - [堆内存统计](#堆内存统计)
+ - [栈内存统计](#栈内存统计)
+ - [MSpan / MCache 统计](#mspan--mcache-统计)
+ - [GC 统计](#gc-统计)
+ - [其他系统内存](#其他系统内存)
+ - [2. 推荐看板指标 (Grafana PromQL)](#2-推荐看板指标-grafana-promql)
+ - [2.1 📊 概览 (Overview)](#21--概览-overview)
+ - [Apdex Score](#apdex-score)
+ - [关键指标卡片](#关键指标卡片)
+ - [2.2 🌐 HTTP Server](#22--http-server)
+ - [2.3 📤 HTTP Client](#23--http-client)
+ - [2.4 🔌 gRPC Server](#24--grpc-server)
+ - [2.5 🐹 Go Runtime](#25--go-runtime)
+ - [2.6 🔴 Redis](#26--redis)
+ - [2.7 🗄️ Database (DB)](#27-️-database-db)
+ - [2.8 🍃 MongoDB](#28--mongodb)
+ - [3. 告警规则 (Alerting Rules)](#3-告警规则-alerting-rules)
+ - [4. Go Runtime 指标解读](#4-go-runtime-指标解读)
+ - [4.1 内存指标关系](#41-内存指标关系)
+ - [4.2 关键指标说明](#42-关键指标说明)
+ - [Goroutine 监控](#goroutine-监控)
+ - [内存监控](#内存监控)
+ - [GC 监控](#gc-监控)
+ - [5. 错误分类说明 (Error Classification)](#5-错误分类说明-error-classification)
+ - [5.1 错误分类原则](#51-错误分类原则)
+ - [5.2 HTTP Client 错误分类](#52-http-client-错误分类)
+ - [5.3 Redis Client 错误分类](#53-redis-client-错误分类)
+ - [5.4 Database Client 错误分类](#54-database-client-错误分类)
+ - [5.5 错误分类性能影响](#55-错误分类性能影响)
+ - [6. 常见问题诊断 (Troubleshooting)](#6-常见问题诊断-troubleshooting)
+ - [6.1 Go Runtime 问题](#61-go-runtime-问题)
+ - [问题 1: Goroutine 泄漏](#问题-1-goroutine-泄漏)
+ - [问题 2: 内存泄漏](#问题-2-内存泄漏)
+ - [问题 3: GC 压力过大](#问题-3-gc-压力过大)
+ - [问题 4: 线程数异常增长](#问题-4-线程数异常增长)
+ - [6.2 中间件与服务问题](#62-中间件与服务问题)
+ - [问题 5: 数据库连接池耗尽](#问题-5-数据库连接池耗尽)
+ - [问题 6: Redis 延迟抖动](#问题-6-redis-延迟抖动)
+ - [问题 7: Context Cancelled / Timeout](#问题-7-context-cancelled--timeout)
+ - [问题 8: 定时任务堆积](#问题-8-定时任务堆积)
+
+# Metrics Documentation
+
+本文档记录了 `box` 框架中各组件暴露的 Prometheus 监控指标、推荐的 Grafana 看板配置以及告警规则。
+
+## 1. 核心指标定义 (Definitions)
+
+### 1.1 HTTP Server (Gin)
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------------- | :-------- | :----------------------------------- | :------------------------------------------ |
+| `http_server_requests_inflight` | Gauge | `method`, `url` | 当前正在处理的 HTTP 请求数 (饱和度) |
+| `http_server_requests_total` | Counter | `method`, `url`, `status`, `errcode` | 处理的 HTTP 请求总数 (流量 & 错误) |
+| `http_server_request_duration_seconds` | Histogram | `method`, `url`, `status`, `errcode` | HTTP 请求耗时分布 (延迟),桶:.005s - 10s |
+| `http_server_request_size_bytes` | Histogram | `method`, `url` | HTTP 请求体大小分布 (流量),桶:1KB - 100MB |
+| `http_server_response_size_bytes` | Histogram | `method`, `url`, `status`, `errcode` | HTTP 响应体大小分布 (流量),桶:1KB - 100MB |
+
+### 1.2 HTTP Client (Wukong)
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------------- | :-------- | :-------------------------------------------- | :----------------------------- |
+| `http_client_requests_inflight` | Gauge | `method`, `baseUrl`, `url` | 当前正在进行的下游 HTTP 请求数 |
+| `http_client_requests_total` | Counter | `method`, `baseUrl`, `url`, `status`, `error` | 发起的 HTTP 请求总数 |
+| `http_client_request_duration_seconds` | Histogram | `method`, `baseUrl`, `url`, `status`, `error` | HTTP 请求耗时分布 |
+
+**错误分类 (`error` 标签值)**:
+
+- `` - 成功(无错误)
+- `timeout_error` - 超时错误(context 超时、I/O 超时等)
+- `connection_error` - 连接错误(连接被拒绝、连接丢失等)
+- `dns_error` - DNS 解析错误
+- `tls_error` - TLS/SSL 错误(证书错误、握手失败等)
+- `other_error` - 其他未分类错误
+
+**注意**: HTTP 状态码通过 `status` 标签单独上报,`error` 标签仅用于底层网络/协议错误。
+
+### 1.3 gRPC Server
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------------- | :-------- | :----------------------- | :------------------------- |
+| `grpc_server_requests_inflight` | Gauge | `method`, `type` | 当前正在处理的 gRPC 请求数 |
+| `grpc_server_requests_total` | Counter | `method`, `type`, `code` | 处理的 gRPC 请求总数 |
+| `grpc_server_request_duration_seconds` | Histogram | `method`, `type`, `code` | gRPC 请求耗时分布 |
+| `grpc_server_panics_total` | Counter | `method` | gRPC 服务 Panic 总次数 |
+
+### 1.4 Redis Client
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :-------------------------------------- | :-------- | :----------------------------------------------------- | :--------------------- |
+| `redis_client_requests_total` | Counter | `address`, `db`, `masterName`, `pipe`, `cmd`, `result` | Redis 命令执行总数 |
+| `redis_client_request_duration_seconds` | Histogram | `address`, `db`, `masterName`, `pipe`, `cmd`, `result` | Redis 命令执行耗时分布 |
+
+**错误分类 (`result` 标签值)**:
+
+- `success` - 成功(包括 `redis.Nil`,键不存在是正常情况)
+- `timeout_error` - 超时错误(context 超时、I/O 超时等)
+- `connection_error` - 连接错误(连接被拒绝、连接丢失等)
+- `command_error` - Redis 命令错误(WRONGTYPE、未知命令、参数错误等)
+- `transaction_error` - 事务错误(事务失败、WATCH 失败等)
+- `auth_error` - 权限/认证错误(NOAUTH、认证失败等)
+- `oom_error` - 内存不足错误(OOM、内存限制等)
+- `cluster_error` - 集群相关错误(MOVED、ASK、CLUSTERDOWN 等)
+- `other_error` - 其他未分类错误
+
+### 1.5 Database Client (GORM)
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :----------------------------------- | :-------- | :------------------------------------- | :------------------------- |
+| `db_client_connections_idle` | Gauge | `driver`, `database` | 连接池空闲连接数 |
+| `db_client_connections_in_use` | Gauge | `driver`, `database` | 连接池正在使用的连接数 |
+| `db_client_connections_open` | Gauge | `driver`, `database` | 连接池当前打开的总连接数 |
+| `db_client_connections_max_open` | Gauge | `driver`, `database` | 连接池最大允许打开的连接数 |
+| `db_client_connections_wait_total` | Gauge | `driver`, `database` | 等待连接的总次数 |
+| `db_client_connections_wait_seconds` | Gauge | `driver`, `database` | 等待连接的总耗时 |
+| `db_client_request_duration_seconds` | Histogram | `driver`, `database`, `type`, `result` | SQL 执行耗时分布 |
+
+**错误分类 (`result` 标签值)**:
+
+- `success` - 成功(包括 `gorm.ErrRecordNotFound`,记录不存在是正常情况)
+- `timeout_error` - 超时错误(context 超时、查询超时等)
+- `connection_error` - 连接错误(连接被拒绝、连接丢失、连接池耗尽等)
+- `constraint_error` - 约束错误(唯一键冲突、外键约束、非空约束等)
+- `syntax_error` - SQL 语法错误(语法错误、未知列/表等)
+- `transaction_error` - 事务相关错误(死锁、锁等待超时等)
+- `other_error` - 其他未分类错误
+
+### 1.6 MongoDB Client
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :-------------------------------------- | :-------- | :------------------ | :---------------------------- |
+| `mongo_client_requests_total` | Counter | `command`, `result` | MongoDB 命令执行总数 |
+| `mongo_client_request_duration_seconds` | Histogram | `command`, `result` | MongoDB 命令耗时分布 |
+| `mongo_client_sessions_inflight` | Gauge | - | 当前正在进行的 MongoDB 会话数 |
+
+### 1.7 Schedule (定时任务)
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------ | :-------- | :--------------- | :------------------- |
+| `schedule_jobs_total` | Counter | `task`, `result` | 定时任务执行总数 |
+| `schedule_job_duration_seconds` | Histogram | `task`, `result` | 定时任务执行耗时分布 |
+
+### 1.8 Go Runtime (运行时)
+
+#### 基础指标
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :-------------- | :---- | :-------- | :------------------ |
+| `go_info` | Gauge | `version` | Go 版本信息 |
+| `go_goroutines` | Gauge | - | 当前 Goroutine 数量 |
+| `go_threads` | Gauge | - | 当前 OS 线程数量 |
+
+#### 内存分配统计
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------ | :------ | :----- | :--------------------------------------- |
+| `go_memstats_alloc_bytes` | Gauge | - | 已分配且仍在使用的堆内存字节数 |
+| `go_memstats_alloc_bytes_total` | Counter | - | 累计分配的堆内存总字节数(包括已释放的) |
+| `go_memstats_sys_bytes` | Gauge | - | 从操作系统获取的内存总字节数 |
+| `go_memstats_lookups_total` | Counter | - | 指针查找总次数(通常为 0) |
+| `go_memstats_mallocs_total` | Counter | - | 累计内存分配次数 |
+| `go_memstats_frees_total` | Counter | - | 累计内存释放次数 |
+
+#### 堆内存统计
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :-------------------------------- | :---- | :----- | :----------------------------------- |
+| `go_memstats_heap_alloc_bytes` | Gauge | - | 堆内存已分配字节数(已分配且在使用) |
+| `go_memstats_heap_sys_bytes` | Gauge | - | 从系统获取的堆内存字节数 |
+| `go_memstats_heap_idle_bytes` | Gauge | - | 堆内存空闲字节数(等待被使用) |
+| `go_memstats_heap_inuse_bytes` | Gauge | - | 堆内存正在使用的字节数 |
+| `go_memstats_heap_released_bytes` | Gauge | - | 已释放回操作系统的堆内存字节数 |
+| `go_memstats_heap_objects` | Gauge | - | 堆中已分配的对象数量 |
+| `go_memstats_next_gc_bytes` | Gauge | - | 下次 GC 触发时的堆内存目标字节数 |
+
+#### 栈内存统计
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------ | :---- | :----- | :----------------------- |
+| `go_memstats_stack_inuse_bytes` | Gauge | - | 栈分配器正在使用的字节数 |
+| `go_memstats_stack_sys_bytes` | Gauge | - | 从系统获取的栈内存字节数 |
+
+#### MSpan / MCache 统计
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :------------------------------- | :---- | :----- | :----------------------------- |
+| `go_memstats_mspan_inuse_bytes` | Gauge | - | MSpan 结构体正在使用的字节数 |
+| `go_memstats_mspan_sys_bytes` | Gauge | - | 从系统获取的 MSpan 内存字节数 |
+| `go_memstats_mcache_inuse_bytes` | Gauge | - | MCache 结构体正在使用的字节数 |
+| `go_memstats_mcache_sys_bytes` | Gauge | - | 从系统获取的 MCache 内存字节数 |
+
+#### GC 统计
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :--------------------------------- | :------ | :--------- | :------------------------------------------------- |
+| `go_gc_duration_seconds` | Summary | `quantile` | GC 暂停耗时分布(quantile: 0, 0.25, 0.5, 0.75, 1) |
+| `go_memstats_gc_sys_bytes` | Gauge | - | GC 元数据使用的内存字节数 |
+| `go_memstats_gc_cpu_fraction` | Gauge | - | 程序启动以来 GC 使用的 CPU 时间占比 |
+| `go_memstats_last_gc_time_seconds` | Gauge | - | 上次 GC 的 Unix 时间戳(秒) |
+
+#### 其他系统内存
+
+| 指标名称 | 类型 | Labels | 说明 |
+| :-------------------------------- | :---- | :----- | :----------------------------- |
+| `go_memstats_buck_hash_sys_bytes` | Gauge | - | 性能分析哈希表使用的内存字节数 |
+| `go_memstats_other_sys_bytes` | Gauge | - | 其他系统分配的内存字节数 |
+
+---
+
+## 2. 推荐看板指标 (Grafana PromQL)
+
+以下 PromQL 假设你有一个 Dashboard 变量 `$namespace`、`$service` 和 `$instance`。
+
+看板结构分为以下板块:
+
+- **概览** - Apdex、关键指标概览
+- **HTTP Server** - HTTP 服务器详细指标
+- **HTTP Client** - HTTP 客户端详细指标
+- **gRPC Server** - gRPC 服务器详细指标
+- **Go Runtime** - Go 运行时详细指标
+- **Database (DB)** - 数据库详细指标
+- **Redis** - Redis 详细指标
+- **MongoDB** - MongoDB 详细指标
+
+### 2.1 📊 概览 (Overview)
+
+#### Apdex Score
+
+| 面板名称 | 说明 | PromQL |
+| :-------------- | :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **Apdex Score** | 用户满意度 (T=250ms) | `(sum(rate(http_server_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance", le="0.25"}[5m])) * 0.5 + sum(rate(http_server_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance", le="1"}[5m])) * 0.5) / sum(rate(http_server_request_duration_seconds_count{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[5m]))` |
+
+**Apdex 计算说明**:
+
+- **Satisfied(满意)**: 响应时间 ≤ T (250ms)
+- **Tolerating(可容忍)**: T < 响应时间 ≤ 4T (250ms < t ≤ 1s)
+- **Frustrated(失望)**: 响应时间 > 4T (> 1s)
+- **公式**: `Apdex = (Satisfied + Tolerating/2) / Total`
+- **取值范围**: 0 到 1,越接近 1 表示用户体验越好
+- **无请求时**: 当 Total = 0 时(无流量),Apdex 分数显示为 **N/A**(不可用),因为没有用户访问就无法评估用户体验
+
+**评价标准与阈值区域**(Grafana 看板会自动显示评级与颜色):
+
+| Apdex 分数 | 评级 | 颜色 | 阈值 | 用户体验 | 建议措施 |
+| :--------- | :----------------------------- | :----- | :--- | :------------- | :----------------------------- |
+| 0.94-1.00 | **Excellent** (优秀) 🟢 | 绿色 | 0.94 | 极佳,用户满意 | 保持现状,持续监控 |
+| 0.85-0.94 | **Good** (良好) 🟡 | 黄色 | 0.85 | 良好,可接受 | 关注趋势,优化慢请求 |
+| 0.70-0.85 | **Fair** (一般) 🟠 | 橙色 | 0.70 | 一般,需改进 | 排查性能瓶颈,优化关键路径 |
+| 0.50-0.70 | **Poor** (较差) 🔴 | 红色 | 0.50 | 较差,影响体验 | 立即介入,分析慢查询和依赖服务 |
+| 0.00-0.50 | **Unacceptable** (不可接受) ⚫ | 深红色 | 0.00 | 不可接受,严重 | 紧急处理,可能需要扩容或限流 |
+
+**Grafana 阈值配置**:
+
+```json
+{
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ { "color": "dark-red", "value": null },
+ { "color": "red", "value": 0.5 },
+ { "color": "orange", "value": 0.7 },
+ { "color": "yellow", "value": 0.85 },
+ { "color": "green", "value": 0.94 }
+ ]
+ }
+}
+```
+
+**PromQL 实现**:
+
+基于 Prometheus Histogram 的累积桶特性,Apdex 公式实现为:
+
+```promql
+(
+ sum(rate(http_server_request_duration_seconds_bucket{le="0.25"}[5m])) * 0.5 +
+ sum(rate(http_server_request_duration_seconds_bucket{le="1"}[5m])) * 0.5
+) / sum(rate(http_server_request_duration_seconds_count[5m]))
+```
+
+**注意事项**:
+
+- `le="0.25"` 桶包含 ≤250ms 的所有请求(Satisfied)
+- `le="1"` 桶包含 ≤1s 的所有请求(Satisfied + Tolerating)
+- 由于桶的累积特性,需要用 `le="1"` 的值减去 `le="0.25"` 来计算 Tolerating 部分
+- 公式简化为:`(Satisfied * 0.5 + (Satisfied + Tolerating) * 0.5) / Total`
+- 结果等价于标准 Apdex 公式:`(Satisfied + Tolerating/2) / Total`
+
+#### 关键指标卡片
+
+| 面板名称 | 说明 | PromQL |
+| :-------------------- | :--------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **HTTP QPS** | 当前请求速率 | `sum(rate(http_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m]))` |
+| **HTTP P99 Latency** | P99 延迟 | `histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le))` |
+| **HTTP Success Rate** | 成功率 | `sum(rate(http_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance", status!~"5.."}[1m])) / sum(rate(http_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) * 100` |
+| **Today's Peak QPS** | 今日 QPS 峰值 | `max_over_time(sum(rate(http_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m]))[1d:])` |
+| **Goroutines** | 协程总数(多实例聚合) | `sum(go_goroutines{namespace=~"$namespace",job=~"$service",instance=~"$instance"})` |
+| **Memory InUse** | 内存总量(多实例聚合) | `sum(go_memstats_heap_inuse_bytes{namespace=~"$namespace",job=~"$service",instance=~"$instance"})` |
+| **HTTP Inflight** | 并发请求数 | `sum(http_server_requests_inflight{namespace=~"$namespace",job=~"$service",instance=~"$instance"})` |
+
+**注意事项**:
+
+- 概览区域的 **Goroutines** 和 **Memory InUse** 面板使用 `sum()` 聚合显示所有实例的总和,适合快速了解整体资源使用情况
+- 如需查看单个实例的详细情况,请访问 **Go Runtime** 板块,其中的时序图按 `instance` 分组显示每个实例的详细趋势
+
+### 2.2 🌐 HTTP Server
+
+| 面板名称 | 说明 | PromQL |
+| :------------------------------------- | :----------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **HTTP QPS by Endpoint** | 按端点的 QPS | `sum(rate(http_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (method, url)` |
+| **HTTP Latency (P99/P95) by Endpoint** | 按端点的延迟 | P99: `histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, method, url))`
P95: `histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, method, url))` |
+| **HTTP Status Codes** | 状态码分布 | `sum(rate(http_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (status)` |
+| **HTTP Network Traffic** | 网络流量 | Request: `sum(rate(http_server_request_size_bytes_sum{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m]))`
Response: `sum(rate(http_server_response_size_bytes_sum{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m]))` |
+
+### 2.3 📤 HTTP Client
+
+| 面板名称 | 说明 | PromQL |
+| :----------------------------- | :------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **HTTP Client QPS** | 客户端请求 QPS | `sum(rate(http_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (baseUrl, url)` |
+| **HTTP Client Latency (P99)** | 客户端延迟 | `histogram_quantile(0.99, sum(rate(http_client_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, baseUrl, url))` |
+| **HTTP Client Errors** | 客户端错误 | `sum(rate(http_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance",error!=""}[1m])) by (baseUrl, url, error)` |
+| **HTTP Client Errors by Type** | 按错误类型分类 | `sum(rate(http_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance",error!=""}[1m])) by (error)` |
+
+### 2.4 🔌 gRPC Server
+
+| 面板名称 | 说明 | PromQL |
+| :---------------------------- | :--------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **gRPC Server QPS** | gRPC 调用量 | `sum(rate(grpc_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (method, type)` |
+| **gRPC Server Latency (P99)** | gRPC 接口延迟 | `histogram_quantile(0.99, sum(rate(grpc_server_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, method))` |
+| **gRPC Server Errors** | gRPC 错误数 | `sum(rate(grpc_server_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance", code!="OK"}[1m])) by (method, code)` |
+| **gRPC Server Panics** | Panic 发生的次数 | `increase(grpc_server_panics_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])` |
+
+### 2.5 🐹 Go Runtime
+
+| 面板名称 | 说明 | PromQL |
+| :------------------------- | :-------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **Goroutines** | Goroutine 数量 | `go_goroutines{namespace=~"$namespace",job=~"$service",instance=~"$instance"}` |
+| **Heap Memory** | 堆内存使用情况 | Alloc: `go_memstats_heap_alloc_bytes{namespace=~"$namespace",job=~"$service",instance=~"$instance"}`
InUse: `go_memstats_heap_inuse_bytes{namespace=~"$namespace",job=~"$service",instance=~"$instance"}`
Sys: `go_memstats_heap_sys_bytes{namespace=~"$namespace",job=~"$service",instance=~"$instance"}` |
+| **GC Duration** | GC 耗时 | Avg: `rate(go_gc_duration_seconds_sum{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m]) / rate(go_gc_duration_seconds_count{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])`
Max: `go_gc_duration_seconds{namespace=~"$namespace",job=~"$service",instance=~"$instance",quantile="1"}` |
+| **GC Rate** | GC 执行频率 | `rate(go_gc_duration_seconds_count{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])` |
+| **GC CPU Fraction** | GC CPU 占用比例 | `go_memstats_gc_cpu_fraction{namespace=~"$namespace",job=~"$service",instance=~"$instance"}` |
+| **Memory Allocation Rate** | 内存分配速率 | `rate(go_memstats_alloc_bytes_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])` |
+
+### 2.6 🔴 Redis
+
+| 面板名称 | 说明 | PromQL |
+| :------------------------------ | :------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **Redis Command QPS** | 命令 QPS | `sum(rate(redis_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (cmd)` |
+| **Redis Command Latency (P99)** | 命令延迟 | `histogram_quantile(0.99, sum(rate(redis_client_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, cmd))` |
+| **Redis Command Errors** | 命令错误 | `sum(rate(redis_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance",result!="success"}[1m])) by (cmd)` |
+
+### 2.7 🗄️ Database (DB)
+
+| 面板名称 | 说明 | PromQL |
+| :------------------------- | :------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **DB Connection Pool** | 连接池状态 | Open: `db_client_connections_open{namespace=~"$namespace",job=~"$service",instance=~"$instance"}`
InUse: `db_client_connections_in_use{namespace=~"$namespace",job=~"$service",instance=~"$instance"}`
Idle: `db_client_connections_idle{namespace=~"$namespace",job=~"$service",instance=~"$instance"}` |
+| **DB Query Latency (P99)** | 查询延迟 | `histogram_quantile(0.99, sum(rate(db_client_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, type, database))` |
+| **DB Query QPS** | 查询 QPS | `sum(rate(db_client_request_duration_seconds_count{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (type, database)` |
+| **DB Query Errors** | 查询错误 | `sum(rate(db_client_request_duration_seconds_count{namespace=~"$namespace",job=~"$service",instance=~"$instance",result!="success"}[1m])) by (type, database, result)` |
+| **DB Errors by Type** | 按错误类型分类 | `sum(rate(db_client_request_duration_seconds_count{namespace=~"$namespace",job=~"$service",instance=~"$instance",result!="success"}[1m])) by (result)` |
+
+### 2.8 🍃 MongoDB
+
+| 面板名称 | 说明 | PromQL |
+| :-------------------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **MongoDB Command QPS** | 命令 QPS | `sum(rate(mongo_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (command)` |
+| **MongoDB Command Latency (P99)** | 命令延迟 | `histogram_quantile(0.99, sum(rate(mongo_client_request_duration_seconds_bucket{namespace=~"$namespace",job=~"$service",instance=~"$instance"}[1m])) by (le, command))` |
+| **MongoDB Command Errors** | 命令错误 | `sum(rate(mongo_client_requests_total{namespace=~"$namespace",job=~"$service",instance=~"$instance",result="error"}[1m])) by (command)` |
+
+---
+
+## 3. 告警规则 (Alerting Rules)
+
+以下是基于 Prometheus 的推荐告警规则配置,涵盖了可用性、延迟、错误率、资源饱和度及运行时异常。
+
+[prometheus_alerts_template](./prometheus_alerts_template.yaml)
+
+---
+
+## 4. Go Runtime 指标解读
+
+### 4.1 内存指标关系
+
+```
+go_memstats_sys_bytes (从系统获取的总内存)
+├── go_memstats_heap_sys_bytes (堆内存)
+│ ├── go_memstats_heap_inuse_bytes (使用中的堆内存)
+│ │ └── go_memstats_heap_alloc_bytes (已分配的堆内存)
+│ └── go_memstats_heap_idle_bytes (空闲堆内存)
+│ └── go_memstats_heap_released_bytes (已释放给 OS 的内存)
+├── go_memstats_stack_sys_bytes (栈内存)
+├── go_memstats_mspan_sys_bytes (MSpan 元数据)
+├── go_memstats_mcache_sys_bytes (MCache 元数据)
+├── go_memstats_buck_hash_sys_bytes (性能分析哈希表)
+├── go_memstats_gc_sys_bytes (GC 元数据)
+└── go_memstats_other_sys_bytes (其他系统内存)
+```
+
+### 4.2 关键指标说明
+
+#### Goroutine 监控
+
+- **正常范围**: 取决于业务负载,通常在 100-1000 之间
+- **泄漏迹象**: 持续增长且不下降,或增长速率过快 (>100/s)
+- **优化建议**: 确保所有 goroutine 都有退出机制,避免永久阻塞
+
+#### 内存监控
+
+- **heap_alloc**: 实际使用的堆内存,频繁上下波动是正常的(GC 会回收)
+- **heap_inuse**: 包含已分配和待回收的内存,通常比 heap_alloc 大
+- **heap_sys**: 从系统申请的堆内存,增长后不会轻易释放
+- **泄漏迹象**: `heap_alloc` 持续增长、`heap_sys` 不断扩大且 GC 无法回收
+
+#### GC 监控
+
+- **正常 GC 耗时**: P99 应在 10-100ms 之内(依赖堆大小)
+- **正常 GC 频率**: 每分钟几次到几十次(依赖分配速率)
+- **GC CPU 占比**: 通常在 5%-25% 之间
+- **异常情况**:
+ - GC 耗时过长 (>1s): 可能堆太大或存在大对象
+ - GC 频率过高 (>5 次/s): 分配速率过快,考虑对象池复用
+ - GC CPU 占比过高 (>30%): 严重影响业务性能
+
+## 5. 错误分类说明 (Error Classification)
+
+为了在保留有用错误信息的同时避免指标爆炸(cardinality explosion),框架对错误进行了分类汇总。
+
+### 5.1 错误分类原则
+
+1. **避免指标爆炸**: 将错误归类为有限的几个类别(通常 5-10 个),而不是每个错误一个指标
+2. **保留有用信息**: 通过类别区分常见错误类型,便于监控和告警
+3. **性能优化**: 错误分类开销极小(< 200ns),相对于网络/IO 操作可忽略不计
+
+### 5.2 HTTP Client 错误分类
+
+HTTP 客户端错误通过 `error` 标签分类:
+
+| 错误类型 | 说明 | 常见场景 |
+| ------------------ | -------------- | ------------------------------------- |
+| `success` | 成功(无错误) | 请求成功完成 |
+| `timeout_error` | 超时错误 | context 超时、I/O 超时、网络超时 |
+| `connection_error` | 连接错误 | 连接被拒绝、连接丢失、EOF、网络不可达 |
+| `dns_error` | DNS 解析错误 | 主机未找到、DNS 查询失败 |
+| `tls_error` | TLS/SSL 错误 | 证书错误、握手失败、X509 验证失败 |
+| `other_error` | 其他错误 | 未分类的错误 |
+
+**注意**: HTTP 状态码(如 404、500)通过 `status` 标签单独上报,`error` 标签仅用于底层网络/协议错误。
+
+**示例 PromQL**:
+
+```promql
+# 查看超时错误
+sum(rate(http_client_requests_total{error="timeout_error"}[5m])) by (baseUrl, url)
+
+# 查看连接错误
+sum(rate(http_client_requests_total{error="connection_error"}[5m])) by (baseUrl, url)
+
+# 查看 DNS 错误
+sum(rate(http_client_requests_total{error="dns_error"}[5m])) by (baseUrl, url)
+```
+
+### 5.3 Redis Client 错误分类
+
+Redis 客户端错误通过 `result` 标签分类:
+
+| 错误类型 | 说明 | 常见场景 |
+| ------------------- | -------- | --------------------------------------- |
+| `success` | 成功 | 命令执行成功(包括 `redis.Nil`) |
+| `timeout_error` | 超时错误 | context 超时、I/O 超时 |
+| `connection_error` | 连接错误 | 连接被拒绝、连接丢失、连接关闭 |
+| `command_error` | 命令错误 | WRONGTYPE、未知命令、参数错误、NOSCRIPT |
+| `transaction_error` | 事务错误 | 事务失败、WATCH 失败、EXECABORT |
+| `auth_error` | 权限错误 | NOAUTH、认证失败、ACL 权限错误 |
+| `oom_error` | 内存不足 | OOM、内存限制 |
+| `cluster_error` | 集群错误 | MOVED、ASK、CLUSTERDOWN、跨槽错误 |
+| `other_error` | 其他错误 | 未分类的错误 |
+
+**示例 PromQL**:
+
+```promql
+# 查看连接错误
+sum(rate(redis_client_requests_total{result="connection_error"}[5m])) by (cmd)
+
+# 查看命令错误(可能是代码问题)
+sum(rate(redis_client_requests_total{result="command_error"}[5m])) by (cmd)
+
+# 查看内存不足错误(紧急)
+sum(rate(redis_client_requests_total{result="oom_error"}[5m]))
+```
+
+### 5.4 Database Client 错误分类
+
+数据库客户端错误通过 `result` 标签分类:
+
+| 错误类型 | 说明 | 常见场景 |
+| ------------------- | ------------ | ----------------------------------------- |
+| `success` | 成功 | 查询成功(包括 `gorm.ErrRecordNotFound`) |
+| `timeout_error` | 超时错误 | context 超时、查询超时、I/O 超时 |
+| `connection_error` | 连接错误 | 连接被拒绝、连接丢失、连接池耗尽 |
+| `constraint_error` | 约束错误 | 唯一键冲突、外键约束、非空约束 |
+| `syntax_error` | SQL 语法错误 | 语法错误、未知列/表、表不存在 |
+| `transaction_error` | 事务错误 | 死锁、锁等待超时、事务回滚 |
+| `other_error` | 其他错误 | 未分类的错误 |
+
+**示例 PromQL**:
+
+```promql
+# 查看连接错误
+sum(rate(db_client_request_duration_seconds_count{result="connection_error"}[5m])) by (database)
+
+# 查看超时错误
+sum(rate(db_client_request_duration_seconds_count{result="timeout_error"}[5m])) by (database)
+
+# 查看约束错误(可能是业务逻辑问题)
+sum(rate(db_client_request_duration_seconds_count{result="constraint_error"}[5m])) by (database)
+
+# 查看死锁错误(紧急)
+sum(rate(db_client_request_duration_seconds_count{result="transaction_error"}[5m])) by (database)
+```
+
+### 5.5 错误分类性能影响
+
+错误分类的性能开销极小:
+
+- **绝对开销**: 50-200 纳秒(ns)
+- **相对开销**: < 0.2%(相对于网络/IO 操作)
+- **内存开销**: 每次 1-2 个字符串分配(< 100 bytes)
+
+详细性能分析请参考:[错误分类性能分析文档](./error_classification_performance_analysis.md)
+
+---
+
+## 6. 常见问题诊断 (Troubleshooting)
+
+### 6.1 Go Runtime 问题
+
+#### 问题 1: Goroutine 泄漏
+
+**症状**: `go_goroutines` 持续增长
+**排查**:
+
+```promql
+# 查看 Goroutine 增长速率
+rate(go_goroutines[5m])
+
+# 对比不同实例
+go_goroutines by (instance)
+```
+
+**解决方案**:
+
+- 使用 `pprof` 工具分析 goroutine 堆栈
+- 检查 channel 是否正确关闭
+- 确保 context 取消信号正确传递
+
+#### 问题 2: 内存泄漏
+
+**症状**: `go_memstats_heap_alloc_bytes` 持续增长,GC 无法回收
+**排查**:
+
+```promql
+# 查看内存分配速率
+rate(go_memstats_alloc_bytes_total[1m])
+
+# 查看存活对象数
+go_memstats_mallocs_total - go_memstats_frees_total
+```
+
+**解决方案**:
+
+- 使用 `pprof` 工具分析内存分配
+- 检查是否有全局变量持续引用对象
+- 排查 map、slice 等容器是否及时清理
+
+#### 问题 3: GC 压力过大
+
+**症状**: `go_gc_duration_seconds` 过高或 `go_memstats_gc_cpu_fraction` 过高
+**排查**:
+
+```promql
+# GC 执行频率
+rate(go_gc_duration_seconds_count[1m])
+
+# GC 平均耗时
+rate(go_gc_duration_seconds_sum[1m]) / rate(go_gc_duration_seconds_count[1m])
+```
+
+**解决方案**:
+
+- 优化内存分配,使用对象池(sync.Pool)
+- 减少小对象分配,批量处理
+- 调大 `GOGC` 环境变量(默认 100)
+- 考虑使用 Go 1.19+ 的 Soft Memory Limit 特性
+
+#### 问题 4: 线程数异常增长
+
+**症状**: `go_threads` 持续增长,甚至导致程序 Crash (达到系统限制)。
+**排查**:
+
+```promql
+# 查看线程数趋势
+go_threads
+```
+
+**原因**:
+
+- Go runtime 在进行系统调用(System Call)或 CGO 调用时,如果被阻塞,会创建新的 OS 线程来调度其他 Goroutine。
+- 典型的阻塞场景:DNS 查询慢、文件 IO 阻塞、锁竞争。
+
+**解决方案**:
+
+- 优化阻塞的系统调用,使用非阻塞 IO
+- 限制并发度
+- 检查 CGO 代码逻辑
+
+### 6.2 中间件与服务问题
+
+#### 问题 5: 数据库连接池耗尽
+
+**症状**: 数据库操作延迟增加,出现 `driver: bad connection` 或连接等待超时错误。
+**排查**:
+
+```promql
+# 查看连接池使用率
+sum(db_client_connections_in_use) by (database) / sum(db_client_connections_max_open) by (database)
+
+# 查看连接等待次数
+rate(db_client_connections_wait_total[1m])
+```
+
+**解决方案**:
+
+- 调大 `SetMaxOpenConns`(需考虑 DB 服务端承载能力)
+- 检查是否存在慢 SQL 长期占用连接
+- 确保事务在所有路径(包括错误处理)中都能正确 `Commit` 或 `Rollback`
+
+#### 问题 6: Redis 延迟抖动
+
+**症状**: `redis_client_request_duration_seconds` P99 偶尔飙升,影响接口响应。
+**排查**:
+
+```promql
+# 按命令查看延迟
+histogram_quantile(0.99, sum(rate(redis_client_request_duration_seconds_bucket[5m])) by (le, cmd))
+```
+
+**解决方案**:
+
+- 检查是否使用了 `KEYS`、`HGETALL` 等 O(N) 复杂度的命令
+- 检查是否存在 Big Key(Value 过大),导致网络传输和序列化耗时增加
+- 检查 Redis 服务端是否有慢查询日志
+
+#### 问题 7: Context Cancelled / Timeout
+
+**症状**: 客户端收到大量 `context canceled` 或 `deadline exceeded` 错误。
+**排查**:
+
+```promql
+# 查看 gRPC/HTTP 错误码分布
+sum(rate(grpc_server_requests_total{code="Canceled"}[1m]))
+sum(rate(grpc_server_requests_total{code="DeadlineExceeded"}[1m]))
+```
+
+**原因**:
+
+- 上游服务设置的超时时间过短
+- 当前服务处理过慢(检查延迟指标)
+- 客户端在请求完成前主动断开了连接
+
+**解决方案**:
+
+- 检查链路超时配置,确保下游超时时间 < 上游超时时间
+- 优化接口性能
+- 增加重试机制(需配合指数退避)
+
+#### 问题 8: 定时任务堆积
+
+**症状**: `schedule_jobs_total` 正常,但任务执行时间超过了调度间隔,导致上一轮未结束下一轮又开始。
+**排查**:
+
+```promql
+# 查看任务执行耗时
+histogram_quantile(0.99, sum(rate(schedule_job_duration_seconds_bucket[5m])) by (le, task))
+```
+
+**解决方案**:
+
+- 增加分布式锁,确保同一时刻只有一个实例执行任务
+- 优化任务逻辑,减少单次执行时间
+- 调整调度间隔或使用消息队列异步处理
diff --git a/docs/prometheus_alerts_template.yaml b/docs/prometheus_alerts_template.yaml
new file mode 100644
index 0000000..d009edc
--- /dev/null
+++ b/docs/prometheus_alerts_template.yaml
@@ -0,0 +1,510 @@
+groups:
+ - name: box-alerts
+ rules:
+ # ==========================================================
+ # HTTP Server Alerts
+ # HTTP 服务器相关告警:用户体验、错误率、延迟
+ # ==========================================================
+
+ # Apdex Score
+ - alert: HttpServerApdexScoreLow
+ expr: |
+ (
+ sum by(namespace, job) (rate(http_server_request_duration_seconds_bucket{le="0.25"}[5m])) * 0.5 +
+ sum by(namespace, job) (rate(http_server_request_duration_seconds_bucket{le="1"}[5m])) * 0.5
+ )
+ /
+ sum by(namespace, job) (rate(http_server_request_duration_seconds_count[5m])) < 0.7
+ for: 5m
+ labels:
+ severity: warning
+ component: http-server
+ type: performance
+ annotations:
+ summary: 'HTTP Server Apdex score: {{ $value | printf "%.2f" }}'
+ description: "User satisfaction score (Apdex) is below 0.7 (Fair)."
+
+ # QPS / Traffic
+ - alert: HttpServerQpsHigh
+ expr: sum by (namespace, job) (rate(http_server_requests_total[1m])) > 1000
+ for: 10s
+ labels:
+ severity: warning
+ component: http-server
+ type: saturation
+ annotations:
+ summary: "HTTP Server QPS: {{ $value | printf \"%.0f\" }}"
+ description: "QPS exceeded 1000 requests per second"
+
+ # Error Rate - Critical
+ - alert: HttpServerStatusCodeErrorRateCritical
+ expr: ((sum by(namespace,job,instance,method,status,url) (rate(http_server_requests_total{status!="200"}[1m]))) / (sum by(namespace,job,instance,method,status,url) (rate(http_server_requests_total[1m])))) * 100 > 1
+ for: 0s
+ labels:
+ severity: critical
+ component: http-server
+ type: availability
+ annotations:
+ summary: "HTTP Server status code error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{ $labels.url }} status: {{ $labels.status }}"
+
+ # Error Rate - Warning
+ - alert: HttpServerStatusCodeErrorRateHigh
+ expr: ((sum by(namespace,job,instance,method,status,url) (rate(http_server_requests_total{status!="200"}[1m]))) / (sum by(namespace,job,instance,method,status,url) (rate(http_server_requests_total[1m])))) * 100 > 0.1
+ for: 10s
+ labels:
+ severity: warning
+ component: http-server
+ type: availability
+ annotations:
+ summary: "HTTP Server status code error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{ $labels.url }} status: {{ $labels.status }}"
+
+ # Error Code Rate - Critical
+ - alert: HttpServerErrorCodeRateCritical
+ expr: ((sum by(namespace,job,instance,method,errcode,url) (rate(http_server_requests_total{errcode!="0"}[1m]))) / (sum by(namespace,job,instance,method,errcode,url) (rate(http_server_requests_total[1m])))) * 100 > 1
+ for: 0s
+ labels:
+ severity: critical
+ component: http-server
+ type: availability
+ annotations:
+ summary: "HTTP Server business error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{ $labels.url }} errcode: {{ $labels.errcode }}"
+
+ # Error Code Rate - Warning
+ - alert: HttpServerErrorCodeRateHigh
+ expr: ((sum by(namespace,job,instance,method,errcode,url) (rate(http_server_requests_total{errcode!="0"}[1m]))) / (sum by(namespace,job,instance,method,errcode,url) (rate(http_server_requests_total[1m])))) * 100 > 0.1
+ for: 10s
+ labels:
+ severity: warning
+ component: http-server
+ type: availability
+ annotations:
+ summary: "HTTP Server business error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{ $labels.url }} errcode: {{ $labels.errcode }}"
+
+ # Latency
+ - alert: HttpServerLatencyP99Critical
+ expr: histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket[1m])) by (le, namespace, job, instance, method, url)) > 5
+ for: 0s
+ labels:
+ severity: critical
+ component: http-server
+ type: performance
+ annotations:
+ summary: "HTTP Server P99 latency: {{ $value | humanizeDuration }}"
+ description: "{{ $labels.method }} {{ $labels.url }}"
+
+ - alert: HttpServerLatencyP99High
+ expr: histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket[1m])) by (le, namespace, job, instance, method, url)) > 0.5
+ for: 10s
+ labels:
+ severity: warning
+ component: http-server
+ type: performance
+ annotations:
+ summary: "HTTP Server P99 latency: {{ $value | humanizeDuration }}"
+ description: "{{ $labels.method }} {{ $labels.url }}"
+
+
+
+ # ==========================================================
+ # HTTP Client Alerts
+ # HTTP 客户端相关告警:下游服务调用错误、延迟、并发
+ # ==========================================================
+
+ # Status Code Error Rate - Critical
+ - alert: HttpClientStatusCodeErrorRateCritical
+ expr: ((sum by(namespace,job,instance,method,baseUrl,url,status) (rate(http_client_requests_total{status!="200"}[1m]))) / (sum by(namespace,job,instance,method,baseUrl,url,status) (rate(http_client_requests_total[1m])))) * 100 > 1
+ for: 0s
+ labels:
+ severity: critical
+ component: http-client
+ type: availability
+ annotations:
+ summary: "HTTP Client status code error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{ $labels.status }} {{$labels.baseUrl}}{{ $labels.url }}"
+
+ # Status Code Error Rate - Warning
+ - alert: HttpClientStatusCodeErrorRateHigh
+ expr: ((sum by(namespace,job,instance,method,baseUrl,url,status) (rate(http_client_requests_total{status!="200"}[1m]))) / (sum by(namespace,job,instance,method,baseUrl,url,status) (rate(http_client_requests_total[1m])))) * 100 > 0.1
+ for: 10s
+ labels:
+ severity: warning
+ component: http-client
+ type: availability
+ annotations:
+ summary: "HTTP Client status code error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{ $labels.status }} {{$labels.baseUrl}}{{ $labels.url }}"
+
+ # Error Result Rate - Critical
+ - alert: HttpClientErrorRateCritical
+ expr: ((sum by(namespace,job,instance,method,baseUrl,url) (rate(http_client_requests_total{error!=""}[1m]))) / (sum by(namespace,job,instance,method,baseUrl,url) (rate(http_client_requests_total[1m])))) * 100 > 1
+ for: 0s
+ labels:
+ severity: critical
+ component: http-client
+ type: availability
+ annotations:
+ summary: "HTTP Client error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{$labels.baseUrl}}{{ $labels.url }}"
+
+ # Error Result Rate - Warning
+ - alert: HttpClientErrorRateHigh
+ expr: ((sum by(namespace,job,instance,method,baseUrl,url) (rate(http_client_requests_total{error!=""}[1m]))) / (sum by(namespace,job,instance,method,baseUrl,url) (rate(http_client_requests_total[1m])))) * 100 > 0.1
+ for: 10s
+ labels:
+ severity: warning
+ component: http-client
+ type: availability
+ annotations:
+ summary: "HTTP Client error rate: {{ $value | printf \"%.2f\" }}%"
+ description: "{{ $labels.method }} {{$labels.baseUrl}}{{ $labels.url }}"
+
+ # Latency - Critical
+ - alert: HttpClientLatencyP99Critical
+ expr: histogram_quantile(0.99, sum(rate(http_client_request_duration_seconds_bucket[1m])) by (le, namespace, job, instance, method, baseUrl, url)) > 10
+ for: 0s
+ labels:
+ severity: critical
+ component: http-client
+ type: performance
+ annotations:
+ summary: "HTTP Client P99 latency: {{ $value | humanizeDuration }}"
+ description: "{{ $labels.method }} {{$labels.baseUrl}}{{ $labels.url }}"
+
+ # Latency - Warning
+ - alert: HttpClientLatencyP99High
+ expr: histogram_quantile(0.99, sum(rate(http_client_request_duration_seconds_bucket[1m])) by (le, namespace, job, instance, method, baseUrl, url)) > 0.5
+ for: 10s
+ labels:
+ severity: warning
+ component: http-client
+ type: performance
+ annotations:
+ summary: "HTTP Client P99 latency: {{ $value | humanizeDuration }}"
+ description: "{{ $labels.method }} {{$labels.baseUrl}}{{ $labels.url }}"
+
+ # Inflight / Saturation
+ - alert: HttpClientInflightHigh
+ expr: http_client_requests_inflight > 20
+ for: 10s
+ labels:
+ severity: warning
+ component: http-client
+ type: saturation
+ annotations:
+ summary: "HTTP Client inflight requests: {{ $value }}"
+ description: "{{ $labels.method }} {{ $labels.url }}"
+
+
+
+ # ==========================================================
+ # gRPC Server Alerts
+ # gRPC 服务器相关告警:错误率、Panic
+ # ==========================================================
+
+ - alert: GrpcServerErrorRateHigh
+ expr: |
+ (sum by(namespace, job) (rate(grpc_server_requests_total{code!="OK"}[1m]))
+ /
+ sum by(namespace, job) (rate(grpc_server_requests_total[1m]))) > 0.05
+ for: 2m
+ labels:
+ severity: critical
+ component: grpc-server
+ type: availability
+ annotations:
+ summary: "gRPC Server error rate: {{ $value | humanizePercentage }}"
+ description: "gRPC error rate is above 5% for the last 2 minutes."
+
+ - alert: GrpcServerPanic
+ expr: increase(grpc_server_panics_total[1m]) > 0
+ for: 0m
+ labels:
+ severity: critical
+ component: grpc-server
+ type: availability
+ annotations:
+ summary: "gRPC Server panic detected"
+ description: "gRPC service recovered from a panic."
+
+
+
+ # ==========================================================
+ # Database Alerts
+ # 数据库相关告警:错误率、延迟、连接池饱和
+ # ==========================================================
+
+ - alert: DbErrorRateHigh
+ expr: |
+ (sum by(namespace, job) (rate(db_client_request_duration_seconds_count{result="error"}[1m]))
+ /
+ sum by(namespace, job) (rate(db_client_request_duration_seconds_count[1m]))) > 0.05
+ for: 2m
+ labels:
+ severity: critical
+ component: database
+ type: availability
+ annotations:
+ summary: "Database error rate: {{ $value | humanizePercentage }}"
+ description: "Database query error rate is above 5%."
+
+ - alert: DbLatencyP99High
+ expr: histogram_quantile(0.99, sum(rate(db_client_request_duration_seconds_bucket[5m])) by (le, namespace, job)) > 0.5
+ for: 5m
+ labels:
+ severity: warning
+ component: database
+ type: performance
+ annotations:
+ summary: "Database P99 latency: {{ $value }}s"
+ description: "Database P99 latency is above 500ms for the last 5 minutes."
+
+ - alert: DbConnectionPoolSaturationHigh
+ expr: |
+ sum by(namespace, job, database) (db_client_connections_in_use)
+ /
+ sum by(namespace, job, database) (db_client_connections_max_open) > 0.8
+ for: 5m
+ labels:
+ severity: warning
+ component: database
+ type: saturation
+ annotations:
+ summary: "Database connection pool saturation: {{ $value | humanizePercentage }}"
+ description: "Database {{ $labels.database }} connection pool usage is above 80%."
+
+
+
+ # ==========================================================
+ # Redis Alerts
+ # Redis 相关告警:命令失败、延迟
+ # ==========================================================
+
+ - alert: RedisCommandFailureHigh
+ expr: increase(redis_client_requests_total{result="error"}[1m]) > 5
+ for: 10s
+ labels:
+ severity: critical
+ component: redis
+ type: availability
+ annotations:
+ summary: "Redis command failure count: {{ $value }}"
+ description: "{{ $labels.cmd }} on {{ $labels.address }}"
+
+ - alert: RedisLatencyP99High
+ expr: histogram_quantile(0.99, sum(rate(redis_client_request_duration_seconds_bucket[5m])) by (le, namespace, job)) > 0.1
+ for: 5m
+ labels:
+ severity: warning
+ component: redis
+ type: performance
+ annotations:
+ summary: "Redis P99 latency: {{ $value }}s"
+ description: "Redis P99 latency is above 100ms for the last 5 minutes."
+
+
+
+ # ==========================================================
+ # MongoDB Alerts
+ # MongoDB 相关告警:错误率、延迟
+ # ==========================================================
+
+ - alert: MongoErrorRateHigh
+ expr: |
+ (sum by(namespace, job) (rate(mongo_client_requests_total{result="error"}[1m]))
+ /
+ sum by(namespace, job) (rate(mongo_client_requests_total[1m]))) > 0.05
+ for: 2m
+ labels:
+ severity: critical
+ component: mongodb
+ type: availability
+ annotations:
+ summary: "MongoDB error rate: {{ $value | humanizePercentage }}"
+ description: "MongoDB command error rate is above 5%."
+
+ - alert: MongoLatencyP99High
+ expr: histogram_quantile(0.99, sum(rate(mongo_client_request_duration_seconds_bucket[5m])) by (le, namespace, job)) > 0.5
+ for: 5m
+ labels:
+ severity: warning
+ component: mongodb
+ type: performance
+ annotations:
+ summary: "MongoDB P99 latency: {{ $value }}s"
+ description: "MongoDB P99 latency is above 500ms for the last 5 minutes."
+
+
+
+ # ==========================================================
+ # Schedule Job Alerts
+ # 定时任务相关告警:失败、Panic
+ # ==========================================================
+
+ - alert: ScheduleJobFailed
+ expr: increase(schedule_jobs_total{result="error"}[1m]) > 0
+ for: 0s
+ labels:
+ severity: critical
+ component: schedule
+ type: availability
+ annotations:
+ summary: "Schedule job failed"
+ description: "Job {{ $labels.name }} failed with error"
+
+ - alert: ScheduleJobPanic
+ expr: increase(schedule_jobs_total{result="panic"}[1m]) > 0
+ for: 0s
+ labels:
+ severity: critical
+ component: schedule
+ type: availability
+ annotations:
+ summary: "Schedule job panic"
+ description: "Job {{ $labels.name }} recovered from panic"
+
+
+
+ # ==========================================================
+ # Go Runtime Alerts
+ # Go 运行时相关告警:Goroutine、Thread、Memory、GC
+ # ==========================================================
+
+ # Goroutine Issues
+ - alert: GoGoroutinesCritical
+ expr: go_goroutines > 2000
+ for: 10s
+ labels:
+ severity: critical
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Go goroutines: {{ $value }}"
+ description: "Goroutine count exceeded 2000"
+
+ - alert: GoGoroutinesHigh
+ expr: go_goroutines > 500
+ for: 10s
+ labels:
+ severity: warning
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Go goroutines: {{ $value }}"
+ description: "Goroutine count exceeded 500"
+
+ - alert: GoGoroutineLeak
+ expr: rate(go_goroutines[5m]) > 100
+ for: 10m
+ labels:
+ severity: critical
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Potential goroutine leak"
+ description: "Goroutine count is increasing rapidly (>100/s rate)."
+
+ # Thread Issues
+ - alert: GoThreadsCritical
+ expr: go_threads > 500
+ for: 10s
+ labels:
+ severity: critical
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Go threads: {{ $value }}"
+ description: "OS thread count exceeded 500"
+
+ - alert: GoThreadsHigh
+ expr: go_threads > 200
+ for: 10s
+ labels:
+ severity: warning
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Go threads: {{ $value }}"
+ description: "OS thread count exceeded 200"
+
+ # Memory Issues
+ - alert: GoMemoryUsageCritical
+ expr: go_memstats_sys_bytes > 4096000000
+ for: 1m
+ labels:
+ severity: critical
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Go memory usage: {{ $value | humanize1024 }}B"
+ description: "Memory usage exceeded 4GB"
+
+ - alert: GoMemoryUsageHigh
+ expr: go_memstats_sys_bytes > 1024000000
+ for: 1m
+ labels:
+ severity: warning
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Go memory usage: {{ $value | humanize1024 }}B"
+ description: "Memory usage exceeded 1GB"
+
+ - alert: GoMemoryLeak
+ expr: rate(go_memstats_heap_alloc_bytes[5m]) > 5e6
+ for: 15m
+ labels:
+ severity: critical
+ component: go-runtime
+ type: saturation
+ annotations:
+ summary: "Potential memory leak"
+ description: "Heap allocation is growing rapidly (>5MB/s rate)."
+
+ # GC Issues
+ - alert: GoGcDurationCritical
+ expr: go_gc_duration_seconds{quantile="1"} > 0.1
+ for: 10s
+ labels:
+ severity: critical
+ component: go-runtime
+ type: performance
+ annotations:
+ summary: "Go GC duration: {{ $value | humanizeDuration }}"
+ description: "GC duration exceeded 100ms"
+
+ - alert: GoGcDurationHigh
+ expr: go_gc_duration_seconds{quantile="1"} > 0.01
+ for: 10s
+ labels:
+ severity: warning
+ component: go-runtime
+ type: performance
+ annotations:
+ summary: "Go GC duration: {{ $value | humanizeDuration }}"
+ description: "GC duration exceeded 10ms"
+
+ - alert: GoGcRateHigh
+ expr: rate(go_gc_duration_seconds_count[1m]) > 5
+ for: 5m
+ labels:
+ severity: warning
+ component: go-runtime
+ type: performance
+ annotations:
+ summary: "Go GC rate: {{ $value }}/s"
+ description: "GC is running more than 5 times per second."
+
+ - alert: GoGcCpuUsageHigh
+ expr: go_memstats_gc_cpu_fraction > 0.3
+ for: 5m
+ labels:
+ severity: warning
+ component: go-runtime
+ type: performance
+ annotations:
+ summary: "Go GC CPU usage: {{ $value | humanizePercentage }}"
+ description: "GC is consuming more than 30% of CPU time."
diff --git a/metric.go b/metric.go
index 8d21f05..05b9c8c 100644
--- a/metric.go
+++ b/metric.go
@@ -18,7 +18,7 @@ var (
boxMetricGauge = metric.Default.NewGaugeVec(
"box_info",
"Information about the box config and environment.",
- []string{"name", "version", "tags", "ip", "localhost", "start"})
+ []string{"tags", "ip", "localhost", "start"})
)
func (boxMetric) Name() string {
@@ -27,8 +27,6 @@ func (boxMetric) Name() string {
func (boxMetric) Serve(ctx context.Context) error {
boxMetricGauge.WithLabelValues(
- config.ServiceName(),
- config.ServiceVersion(),
strings.Join(config.ServiceTag(), ","),
system.IP(),
system.Hostname(),
diff --git a/pkg/client/gormx/metric.go b/pkg/client/gormx/metric.go
index 498fe41..2feabe5 100644
--- a/pkg/client/gormx/metric.go
+++ b/pkg/client/gormx/metric.go
@@ -1,10 +1,16 @@
package gormx
import (
+ "context"
"database/sql"
+ "errors"
+ "net"
+ "os"
+ "strings"
"time"
"github.com/boxgo/box/pkg/metric"
+ "gorm.io/gorm"
)
type (
@@ -24,19 +30,19 @@ const (
)
var (
- metricConnIdle = metric.NewGaugeVec("db_connections_idle", `The number of idle connections.`, []string{labelDriver, labelDatabase})
- metricConnInUse = metric.NewGaugeVec("db_connections_in_use", `The number of connections currently in use.`, []string{labelDriver, labelDatabase})
- metricConnOpen = metric.NewGaugeVec("db_connections_open", `The number of established connections both in use and idle.`, []string{labelDriver, labelDatabase})
- metricConnMaxOpen = metric.NewGaugeVec("db_connections_max_open", `Maximum number of open connections to the database.`, []string{labelDriver, labelDatabase})
- metricWaitCount = metric.NewGaugeVec("db_wait_count", `The total number of connections waited for.`, []string{labelDriver, labelDatabase})
- metricWaitDuration = metric.NewGaugeVec("db_wait_duration_seconds", `The total time blocked waiting for a new connection.`, []string{labelDriver, labelDatabase})
- metricSQLSeconds = metric.NewSummaryVec("db_sql_seconds", `All queries requested seconds`, []string{labelDriver, labelDatabase, labelType, labelError}, map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- })
+ metricConnIdle = metric.NewGaugeVec("db_client_connections_idle", `The number of idle connections.`, []string{labelDriver, labelDatabase})
+ metricConnInUse = metric.NewGaugeVec("db_client_connections_in_use", `The number of connections currently in use.`, []string{labelDriver, labelDatabase})
+ metricConnOpen = metric.NewGaugeVec("db_client_connections_open", `The number of established connections both in use and idle.`, []string{labelDriver, labelDatabase})
+ metricConnMaxOpen = metric.NewGaugeVec("db_client_connections_max_open", `Maximum number of open connections to the database.`, []string{labelDriver, labelDatabase})
+ metricWaitCount = metric.NewGaugeVec("db_client_connections_wait_total", `The total number of connections waited for.`, []string{labelDriver, labelDatabase})
+ metricWaitDuration = metric.NewGaugeVec("db_client_connections_wait_seconds", `The total time blocked waiting for a new connection.`, []string{labelDriver, labelDatabase})
+ metricSQLDuration = metric.NewHistogramVec(
+ "db_client_request_duration_seconds",
+ "The SQL execution latencies in seconds.",
+ []string{labelDriver, labelDatabase, labelType, "result"},
+ // 250us, 500us, 1ms, 2.5ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s
+ []float64{0.00025, 0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5},
+ )
)
func newMetric(driver, database string, statsInterval time.Duration) *Metric {
@@ -115,12 +121,8 @@ func (m *Metric) beforeCallback(db *DB) {
func (m *Metric) afterCallback(cmdType string) func(*DB) {
return func(db *DB) {
- err := ""
second := 0.0
-
- if db.Statement.Error != nil {
- err = db.Statement.Error.Error()
- }
+ result := classifyError(db.Statement.Error)
if ts, ok := db.InstanceGet("startTime"); ok {
if startTime, ok := ts.(time.Time); ok {
@@ -128,8 +130,224 @@ func (m *Metric) afterCallback(cmdType string) func(*DB) {
}
}
- metricSQLSeconds.WithLabelValues(m.driver, m.database, cmdType, err).Observe(second)
+ metricSQLDuration.WithLabelValues(m.driver, m.database, cmdType, result).Observe(second)
+ }
+}
+
+// classifyError 将数据库错误分类为有限的几个类别,避免指标爆炸
+// 同时尽可能保留有用的错误信息
+func classifyError(err error) string {
+ if err == nil {
+ return "success"
+ }
+
+ // 检查 GORM 标准错误
+ if errors.Is(err, gorm.ErrRecordNotFound) {
+ return "not_found"
+ }
+ if errors.Is(err, gorm.ErrInvalidTransaction) {
+ return "transaction_error"
+ }
+ if errors.Is(err, gorm.ErrMissingWhereClause) {
+ return "syntax_error"
+ }
+ if errors.Is(err, gorm.ErrPrimaryKeyRequired) {
+ return "constraint_error"
+ }
+
+ errStr := strings.ToLower(err.Error())
+
+ // 连接相关错误
+ if isConnectionError(err, errStr) {
+ return "connection_error"
+ }
+
+ // 超时错误
+ if isTimeoutError(err, errStr) {
+ return "timeout_error"
+ }
+
+ // 约束错误(唯一键冲突、外键约束、非空约束等)
+ if isConstraintError(errStr) {
+ return "constraint_error"
}
+
+ // SQL 语法错误
+ if isSyntaxError(errStr) {
+ return "syntax_error"
+ }
+
+ // 事务相关错误
+ if isTransactionError(errStr) {
+ return "transaction_error"
+ }
+
+ // 其他错误统一归类
+ return "other_error"
+}
+
+// isConnectionError 判断是否为连接相关错误
+func isConnectionError(err error, errStr string) bool {
+ // 检查标准库错误
+ if errors.Is(err, sql.ErrConnDone) {
+ return true
+ }
+
+ // 检查错误消息中的关键词
+ connectionKeywords := []string{
+ "connection",
+ "connect",
+ "connection refused",
+ "connection reset",
+ "connection lost",
+ "connection closed",
+ "no connection",
+ "broken pipe",
+ "network",
+ "dial tcp",
+ "connection timeout",
+ "too many connections",
+ "max connections",
+ "connection pool",
+ "driver: bad connection",
+ "server has gone away",
+ "lost connection",
+ }
+
+ for _, keyword := range connectionKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isTimeoutError 判断是否为超时错误
+func isTimeoutError(err error, errStr string) bool {
+ // 检查标准库超时错误
+ if os.IsTimeout(err) {
+ return true
+ }
+
+ // 检查 context 超时错误
+ if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, os.ErrDeadlineExceeded) {
+ return true
+ }
+
+ // 检查 net.Error 接口的 Timeout() 方法
+ var netErr net.Error
+ if errors.As(err, &netErr) && netErr.Timeout() {
+ return true
+ }
+
+ // 检查错误消息中的关键词
+ timeoutKeywords := []string{
+ "timeout",
+ "context deadline exceeded",
+ "context canceled",
+ "deadline exceeded",
+ "operation timed out",
+ "i/o timeout",
+ "read timeout",
+ "write timeout",
+ "query timeout",
+ "statement timeout",
+ }
+
+ for _, keyword := range timeoutKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isConstraintError 判断是否为约束错误
+func isConstraintError(errStr string) bool {
+ constraintKeywords := []string{
+ "duplicate entry",
+ "unique constraint",
+ "unique violation",
+ "duplicate key",
+ "primary key",
+ "foreign key",
+ "constraint violation",
+ "check constraint",
+ "not null",
+ "cannot be null",
+ "violates not-null constraint",
+ "violates foreign key constraint",
+ "violates unique constraint",
+ "violates check constraint",
+ "integrity constraint",
+ "duplicate",
+ "already exists",
+ "1062", // MySQL duplicate entry error code
+ "23505", // PostgreSQL unique violation error code
+ "23503", // PostgreSQL foreign key violation error code
+ "23502", // PostgreSQL not null violation error code
+ }
+
+ for _, keyword := range constraintKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isSyntaxError 判断是否为 SQL 语法错误
+func isSyntaxError(errStr string) bool {
+ syntaxKeywords := []string{
+ "syntax error",
+ "sql syntax",
+ "parse error",
+ "invalid syntax",
+ "unexpected token",
+ "unexpected end",
+ "missing",
+ "unknown column",
+ "unknown table",
+ "table doesn't exist",
+ "column doesn't exist",
+ "1064", // MySQL syntax error code
+ "42601", // PostgreSQL syntax error code
+ }
+
+ for _, keyword := range syntaxKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isTransactionError 判断是否为事务相关错误
+func isTransactionError(errStr string) bool {
+ transactionKeywords := []string{
+ "transaction",
+ "deadlock",
+ "lock wait timeout",
+ "lock wait",
+ "could not serialize",
+ "serialization failure",
+ "transaction rollback",
+ "transaction commit",
+ "in failed sql transaction",
+ "current transaction is aborted",
+ }
+
+ for _, keyword := range transactionKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
}
func callbackName(cmd string) string {
diff --git a/pkg/client/mongodb/metric.go b/pkg/client/mongodb/metric.go
index 1e65d73..b3614bf 100644
--- a/pkg/client/mongodb/metric.go
+++ b/pkg/client/mongodb/metric.go
@@ -19,25 +19,20 @@ type (
var (
cmdTotal = metric.NewCounterVec(
- "mongo_client_command_total",
- "mongodb client command counter",
- []string{"command", "error"},
+ "mongo_client_requests_total",
+ "The total number of MongoDB commands executed.",
+ []string{"command", "result"},
)
- cmdDuration = metric.NewSummaryVec(
- "mongo_client_command_duration_seconds",
- "mongodb client command duration seconds",
- []string{"command", "error"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
+ cmdDuration = metric.NewHistogramVec(
+ "mongo_client_request_duration_seconds",
+ "The MongoDB command latencies in seconds.",
+ []string{"command", "result"},
+ // 250us, 500us, 1ms, 2.5ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s
+ []float64{0.00025, 0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5},
)
workingSession = metric.NewGaugeVec(
- "mongo_client_session_in_progress",
- "mongo client session in progress gauge",
+ "mongo_client_sessions_inflight",
+ "The number of MongoDB sessions currently in progress.",
[]string{},
)
)
@@ -80,7 +75,7 @@ func (mon *metricMonitor) Started(ctx context.Context, ev *event.CommandStartedE
}
func (mon *metricMonitor) Succeeded(ctx context.Context, ev *event.CommandSucceededEvent) {
- labels := []string{ev.CommandName, ""}
+ labels := []string{ev.CommandName, "success"}
cmdTotal.WithLabelValues(labels...).Inc()
cmdDuration.WithLabelValues(labels...).Observe(time.Duration(ev.DurationNanos).Seconds())
@@ -88,7 +83,7 @@ func (mon *metricMonitor) Succeeded(ctx context.Context, ev *event.CommandSuccee
}
func (mon *metricMonitor) Failed(ctx context.Context, ev *event.CommandFailedEvent) {
- labels := []string{ev.CommandName, ev.Failure}
+ labels := []string{ev.CommandName, "error"}
cmdTotal.WithLabelValues(labels...).Inc()
cmdDuration.WithLabelValues(labels...).Observe(time.Duration(ev.DurationNanos).Seconds())
diff --git a/pkg/client/redis/logger.go b/pkg/client/redis/logger.go
index b67caf5..780ca2f 100644
--- a/pkg/client/redis/logger.go
+++ b/pkg/client/redis/logger.go
@@ -10,10 +10,18 @@ import (
type (
Logger struct {
- cfg *Config
+ cfg *Config
+ addr string
}
)
+func newLogger(cfg *Config) *Logger {
+ return &Logger{
+ cfg: cfg,
+ addr: strings.Join(cfg.Address, ","),
+ }
+}
+
func (inst *Logger) DialHook(next redis.DialHook) redis.DialHook {
return next
}
@@ -58,7 +66,7 @@ func (inst *Logger) log(ctx context.Context, pipe bool, cmds ...redis.Cmder) {
if len(errArr) > 0 {
logger.Trace(ctx).Errorw("Redis.Error",
- "address", strings.Join(inst.cfg.Address, ","),
+ "address", inst.addr,
"db", inst.cfg.DB,
"err", strings.Join(errArr, ";"),
"cmd", strings.Join(cmdArr, ";"),
diff --git a/pkg/client/redis/metric.go b/pkg/client/redis/metric.go
index 91d1db2..0f6c5e1 100644
--- a/pkg/client/redis/metric.go
+++ b/pkg/client/redis/metric.go
@@ -2,40 +2,45 @@ package redis
import (
"context"
- "fmt"
+ "errors"
+ "net"
+ "os"
+ "strconv"
"strings"
"time"
"github.com/boxgo/box/pkg/metric"
- "github.com/boxgo/box/pkg/trace"
"github.com/redis/go-redis/v9"
)
type (
Metric struct {
- cfg *Config
+ cfg *Config
+ addr string
}
startKey struct{}
)
+func newMetric(cfg *Config) *Metric {
+ return &Metric{
+ cfg: cfg,
+ addr: strings.Join(cfg.Address, ","),
+ }
+}
+
var (
cmdTotal = metric.NewCounterVec(
- "redis_client_command_total",
- "redis command counter",
- []string{"bid", "address", "db", "masterName", "pipe", "cmd", "error"},
+ "redis_client_requests_total",
+ "The total number of Redis commands executed.",
+ []string{"address", "db", "masterName", "pipe", "cmd", "result"},
)
- cmdDuration = metric.NewSummaryVec(
- "redis_client_command_duration_seconds",
- "redis command duration seconds",
- []string{"bid", "address", "db", "masterName", "pipe", "cmd", "error"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
+ cmdDuration = metric.NewHistogramVec(
+ "redis_client_request_duration_seconds",
+ "The Redis command latencies in seconds.",
+ []string{"address", "db", "masterName", "pipe", "cmd", "result"},
+ // 100us, 250us, 500us, 1ms, 2.5ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms
+ []float64{0.0001, 0.00025, 0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5},
)
)
@@ -68,40 +73,252 @@ func (m *Metric) ProcessPipelineHook(next redis.ProcessPipelineHook) redis.Proce
}
func (m *Metric) report(ctx context.Context, pipe bool, elapsed time.Duration, cmds ...redis.Cmder) {
- addressStr := strings.Join(m.cfg.Address, ",")
- dbStr := fmt.Sprintf("%d", m.cfg.DB)
- masterNameStr := m.cfg.MasterName
- errStr := ""
cmdStr := ""
- pipeStr := fmt.Sprintf("%t", pipe)
+ result := "success"
+ masterNameStr := m.cfg.MasterName
+ addressStr := m.addr
+ dbStr := strconv.Itoa(m.cfg.DB)
+ pipeStr := strconv.FormatBool(pipe)
- for _, cmd := range cmds {
- cmdStr += cmd.Name() + ";"
+ if pipe {
+ cmdStr = "pipeline"
+ } else if len(cmds) > 0 {
+ cmdStr = cmds[0].Name()
+ }
+ for _, cmd := range cmds {
if err := cmd.Err(); err != nil && err != redis.Nil {
- errStr += err.Error() + ";"
+ result = classifyRedisError(err)
+ break
}
}
- cmdStr = strings.TrimSuffix(cmdStr, ";")
-
- var (
- bizID string
- )
-
- if bizIDStr, ok := ctx.Value(trace.BizID()).(string); ok {
- bizID = bizIDStr
- }
values := []string{
- bizID,
addressStr,
dbStr,
masterNameStr,
pipeStr,
cmdStr,
- errStr,
+ result,
}
cmdDuration.WithLabelValues(values...).Observe(elapsed.Seconds())
cmdTotal.WithLabelValues(values...).Inc()
}
+
+// classifyRedisError 将 Redis 错误分类为有限的几个类别,避免指标爆炸
+// 同时尽可能保留有用的错误信息
+func classifyRedisError(err error) string {
+ if err == nil {
+ return "success"
+ }
+
+ // 检查 redis.Nil(键不存在,这是正常情况,不应该算作错误)
+ if err == redis.Nil {
+ return "success"
+ }
+
+ // 检查标准库超时错误
+ if os.IsTimeout(err) {
+ return "timeout_error"
+ }
+
+ // 检查 context 超时错误
+ if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, os.ErrDeadlineExceeded) {
+ return "timeout_error"
+ }
+
+ // 检查 net.Error 接口的 Timeout() 方法
+ var netErr net.Error
+ if errors.As(err, &netErr) && netErr.Timeout() {
+ return "timeout_error"
+ }
+
+ // 检查连接相关错误
+ if isRedisConnectionError(err) {
+ return "connection_error"
+ }
+
+ // 检查事务错误
+ if err == redis.TxFailedErr || errors.Is(err, redis.TxFailedErr) {
+ return "transaction_error"
+ }
+
+ errStr := strings.ToLower(err.Error())
+
+ // 检查 Redis 命令错误
+ if isRedisCommandError(errStr) {
+ return "command_error"
+ }
+
+ // 检查事务相关错误
+ if isRedisTransactionError(errStr) {
+ return "transaction_error"
+ }
+
+ // 检查权限错误
+ if isRedisAuthError(errStr) {
+ return "auth_error"
+ }
+
+ // 检查内存不足错误
+ if isRedisOOMError(errStr) {
+ return "oom_error"
+ }
+
+ // 检查集群相关错误
+ if isRedisClusterError(errStr) {
+ return "cluster_error"
+ }
+
+ // 其他错误统一归类
+ return "other_error"
+}
+
+// isRedisConnectionError 判断是否为连接相关错误
+func isRedisConnectionError(err error) bool {
+ // 检查标准库错误
+ if errors.Is(err, redis.ErrClosed) {
+ return true
+ }
+
+ errStr := strings.ToLower(err.Error())
+
+ connectionKeywords := []string{
+ "connection",
+ "connect",
+ "connection refused",
+ "connection reset",
+ "connection lost",
+ "connection closed",
+ "no connection",
+ "broken pipe",
+ "network",
+ "dial tcp",
+ "connection timeout",
+ "i/o error",
+ "use of closed network connection",
+ "connection reset by peer",
+ "no such host",
+ "no route to host",
+ "refused",
+ "closed",
+ "EOF",
+ }
+
+ for _, keyword := range connectionKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isRedisCommandError 判断是否为 Redis 命令错误
+func isRedisCommandError(errStr string) bool {
+ commandKeywords := []string{
+ "wrongtype",
+ "wrong type",
+ "wrong number of arguments",
+ "unknown command",
+ "command not allowed",
+ "invalid argument",
+ "invalid command",
+ "syntax error",
+ "parse error",
+ "protocol error",
+ "ERR", // Redis 错误前缀
+ "WRONGTYPE",
+ "NOSCRIPT", // Lua 脚本不存在
+ "BUSYKEY", // 键正在被其他操作使用
+ }
+
+ for _, keyword := range commandKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isRedisTransactionError 判断是否为事务相关错误
+func isRedisTransactionError(errStr string) bool {
+ transactionKeywords := []string{
+ "transaction",
+ "EXECABORT",
+ "transaction failed",
+ "watch",
+ "CAS", // Compare and Swap
+ }
+
+ for _, keyword := range transactionKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isRedisAuthError 判断是否为权限/认证错误
+func isRedisAuthError(errStr string) bool {
+ authKeywords := []string{
+ "noauth",
+ "authentication required",
+ "invalid password",
+ "auth",
+ "permission denied",
+ "ACL",
+ }
+
+ for _, keyword := range authKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isRedisOOMError 判断是否为内存不足错误
+func isRedisOOMError(errStr string) bool {
+ oomKeywords := []string{
+ "oom",
+ "out of memory",
+ "command not allowed when used memory",
+ "maxmemory",
+ }
+
+ for _, keyword := range oomKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isRedisClusterError 判断是否为集群相关错误
+func isRedisClusterError(errStr string) bool {
+ clusterKeywords := []string{
+ "cluster",
+ "MOVED",
+ "ASK",
+ "CLUSTERDOWN",
+ "TRYAGAIN",
+ "crossslot",
+ "slot",
+ "migrating",
+ "importing",
+ }
+
+ for _, keyword := range clusterKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
diff --git a/pkg/client/redis/redis.go b/pkg/client/redis/redis.go
index 68bc617..adcf40f 100644
--- a/pkg/client/redis/redis.go
+++ b/pkg/client/redis/redis.go
@@ -26,8 +26,12 @@ func newRedis(cfg *Config) *Redis {
MinIdleConns: cfg.MinIdleConnCnt,
})
- client.AddHook(&Metric{cfg: cfg})
- client.AddHook(&Logger{cfg: cfg})
+ client.AddHook(newMetric(cfg))
+ client.AddHook(newLogger(cfg))
+
+ if err := redisotel.InstrumentTracing(client); err != nil {
+ logger.Panicf("Redis.InstrumentTracing.Error: %s", err)
+ }
if err := redisotel.InstrumentTracing(client); err != nil {
logger.Panicf("Redis.InstrumentTracing.Error: %s", err)
diff --git a/pkg/client/wukong/metric.go b/pkg/client/wukong/metric.go
index c7c9767..b7e09e4 100644
--- a/pkg/client/wukong/metric.go
+++ b/pkg/client/wukong/metric.go
@@ -1,11 +1,15 @@
package wukong
import (
+ "context"
+ "errors"
+ "net"
+ "os"
"strconv"
+ "strings"
"time"
"github.com/boxgo/box/pkg/metric"
- "golang.org/x/net/context"
)
type (
@@ -18,36 +22,39 @@ const (
var (
requestInflight = metric.NewGaugeVec(
- "http_client_request_in_process",
- "http client requesting",
+ "http_client_requests_inflight",
+ "The number of HTTP client requests currently in flight.",
[]string{"method", "baseUrl", "url"},
)
requestCounter = metric.NewCounterVec(
- "http_client_request_total",
- "http client request counter",
- []string{"method", "baseUrl", "url", "statusCode", "error"},
+ "http_client_requests_total",
+ "The total number of HTTP client requests sent.",
+ []string{"method", "baseUrl", "url", "status", "error"},
)
- requestDuration = metric.NewSummaryVec(
- "http_client_request_seconds",
- "http client request duration",
- []string{"method", "baseUrl", "url", "statusCode", "error"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
+ requestDuration = metric.NewHistogramVec(
+ "http_client_request_duration_seconds",
+ "The HTTP client request latencies in seconds.",
+ []string{"method", "baseUrl", "url", "status", "error"},
+ // 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
+ []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
)
)
+// stripQuery removes query parameters and fragment from URL
+func stripQuery(url string) string {
+ if idx := strings.IndexAny(url, "?#"); idx != -1 {
+ return url[:idx]
+ }
+ return url
+}
+
func metricStart(request *Request) error {
if val, ok := request.Context.Value(metricSwitchKey).(bool); ok && !val {
return nil
}
- requestInflight.WithLabelValues(request.Method, request.BaseUrl, request.Url).Inc()
-
+ url := stripQuery(request.Url)
+ requestInflight.WithLabelValues(request.Method, request.BaseUrl, url).Inc()
request.Context = context.WithValue(request.Context, metricDurationKey{}, time.Now())
return nil
@@ -59,22 +66,154 @@ func metricEnd(request *Request, resp *Response) error {
}
var (
- errMsg = ""
- duration = time.Duration(0)
- statusCode = strconv.Itoa(resp.StatusCode())
+ errMsg = ""
+ duration = time.Duration(0)
+ status = strconv.Itoa(resp.StatusCode())
)
if resp.Error() != nil {
- errMsg = resp.Error().Error()
+ errMsg = classifyHTTPError(resp.Error())
}
if start, ok := request.Context.Value(metricDurationKey{}).(time.Time); ok {
duration = time.Since(start)
}
- requestInflight.WithLabelValues(request.Method, request.BaseUrl, request.Url).Dec()
- requestCounter.WithLabelValues(request.Method, request.BaseUrl, request.Url, statusCode, errMsg).Inc()
- requestDuration.WithLabelValues(request.Method, request.BaseUrl, request.Url, statusCode, errMsg).Observe(duration.Seconds())
+ url := stripQuery(request.Url)
+ requestInflight.WithLabelValues(request.Method, request.BaseUrl, url).Dec()
+ requestCounter.WithLabelValues(request.Method, request.BaseUrl, url, status, errMsg).Inc()
+ requestDuration.WithLabelValues(request.Method, request.BaseUrl, url, status, errMsg).Observe(duration.Seconds())
return nil
}
+
+// classifyHTTPError 将 HTTP 客户端错误分类为有限的几个类别,避免指标爆炸
+// 同时尽可能保留有用的错误信息
+// 注意:HTTP 状态码已通过 status 字段上报,此处不再根据状态码分类
+func classifyHTTPError(err error) string {
+ if err == nil {
+ return ""
+ }
+
+ // 检查标准库超时错误
+ if os.IsTimeout(err) {
+ return "timeout_error"
+ }
+
+ // 检查 context 超时错误
+ if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, os.ErrDeadlineExceeded) {
+ return "timeout_error"
+ }
+
+ // 检查 net.Error 接口的 Timeout() 方法
+ var netErr net.Error
+ if errors.As(err, &netErr) {
+ if netErr.Timeout() {
+ return "timeout_error"
+ }
+ // 如果是网络错误但不是超时,归类为连接错误
+ return "connection_error"
+ }
+
+ errStr := strings.ToLower(err.Error())
+
+ // 检查 DNS 相关错误
+ if isDNSError(errStr) {
+ return "dns_error"
+ }
+
+ // 检查 TLS/SSL 相关错误
+ if isTLSError(errStr) {
+ return "tls_error"
+ }
+
+ // 检查连接相关错误
+ if isHTTPConnectionError(errStr) {
+ return "connection_error"
+ }
+
+ // 其他错误统一归类
+ return "other_error"
+}
+
+// isDNSError 判断是否为 DNS 相关错误
+func isDNSError(errStr string) bool {
+ dnsKeywords := []string{
+ "no such host",
+ "no hosts found",
+ "dns",
+ "lookup",
+ "unknown host",
+ "host not found",
+ "name resolution",
+ "getaddrinfo",
+ }
+
+ for _, keyword := range dnsKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isTLSError 判断是否为 TLS/SSL 相关错误
+func isTLSError(errStr string) bool {
+ tlsKeywords := []string{
+ "tls",
+ "ssl",
+ "certificate",
+ "x509",
+ "handshake failure",
+ "bad certificate",
+ "certificate verify failed",
+ "unknown authority",
+ "certificate signed by unknown authority",
+ "tls:",
+ "remote error",
+ }
+
+ for _, keyword := range tlsKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
+// isHTTPConnectionError 判断是否为连接相关错误
+func isHTTPConnectionError(errStr string) bool {
+ connectionKeywords := []string{
+ "connection",
+ "connect",
+ "connection refused",
+ "connection reset",
+ "connection lost",
+ "connection closed",
+ "no connection",
+ "broken pipe",
+ "network",
+ "dial tcp",
+ "connection timeout",
+ "i/o error",
+ "use of closed network connection",
+ "connection reset by peer",
+ "no route to host",
+ "refused",
+ "closed",
+ "EOF",
+ "unreachable",
+ "network is unreachable",
+ }
+
+ for _, keyword := range connectionKeywords {
+ if strings.Contains(errStr, keyword) {
+ return true
+ }
+ }
+
+ return false
+}
+
diff --git a/pkg/client/wukong/metric_bench_test.go b/pkg/client/wukong/metric_bench_test.go
new file mode 100644
index 0000000..cd1a1c1
--- /dev/null
+++ b/pkg/client/wukong/metric_bench_test.go
@@ -0,0 +1,136 @@
+package wukong
+
+import (
+ "context"
+ "errors"
+ "net"
+ "os"
+ "strings"
+ "testing"
+)
+
+// 模拟各种错误类型用于性能测试
+var (
+ testErrors = []struct {
+ name string
+ err error
+ }{
+ {"nil", nil},
+ {"context_deadline", context.DeadlineExceeded},
+ {"os_timeout", &os.SyscallError{Err: os.ErrDeadlineExceeded}},
+ {"net_timeout", &net.OpError{Err: &os.SyscallError{Err: os.ErrDeadlineExceeded}}},
+ {"dns_error", errors.New("no such host: example.com")},
+ {"tls_error", errors.New("tls: handshake failure")},
+ {"connection_error", errors.New("connection refused")},
+ {"other_error", errors.New("some unknown error")},
+ }
+)
+
+// BenchmarkClassifyHTTPError 测试错误分类函数的性能
+func BenchmarkClassifyHTTPError(b *testing.B) {
+ b.Run("success_case", func(b *testing.B) {
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(nil)
+ }
+ })
+
+ b.Run("timeout_error", func(b *testing.B) {
+ err := context.DeadlineExceeded
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("os_timeout", func(b *testing.B) {
+ err := &os.SyscallError{Err: os.ErrDeadlineExceeded}
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("net_timeout", func(b *testing.B) {
+ err := &net.OpError{Err: &os.SyscallError{Err: os.ErrDeadlineExceeded}}
+ // 需要设置 Timeout() 方法返回 true
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("connection_error", func(b *testing.B) {
+ err := errors.New("connection refused")
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("dns_error", func(b *testing.B) {
+ err := errors.New("no such host: example.com")
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("tls_error", func(b *testing.B) {
+ err := errors.New("tls: handshake failure")
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("other_error", func(b *testing.B) {
+ err := errors.New("some unknown error")
+ for i := 0; i < b.N; i++ {
+ classifyHTTPError(err)
+ }
+ })
+
+ b.Run("mixed_errors", func(b *testing.B) {
+ for i := 0; i < b.N; i++ {
+ testCase := testErrors[i%len(testErrors)]
+ classifyHTTPError(testCase.err)
+ }
+ })
+}
+
+// BenchmarkClassifyHTTPError_Old 测试旧的简单错误处理(作为对比)
+func BenchmarkClassifyHTTPError_Old(b *testing.B) {
+ b.Run("simple_error_check", func(b *testing.B) {
+ err := errors.New("some error")
+ for i := 0; i < b.N; i++ {
+ if err != nil {
+ _ = "error"
+ }
+ }
+ })
+}
+
+// BenchmarkStringOperations 测试字符串操作的开销
+func BenchmarkStringOperations(b *testing.B) {
+ err := errors.New("connection refused: dial tcp 127.0.0.1:8080")
+
+ b.Run("error_string", func(b *testing.B) {
+ for i := 0; i < b.N; i++ {
+ _ = err.Error()
+ }
+ })
+
+ b.Run("to_lower", func(b *testing.B) {
+ errStr := err.Error()
+ for i := 0; i < b.N; i++ {
+ _ = strings.ToLower(errStr)
+ }
+ })
+
+ b.Run("contains_check", func(b *testing.B) {
+ errStr := err.Error()
+ keywords := []string{"connection", "refused", "dial", "tcp"}
+ b.ResetTimer()
+ for i := 0; i < b.N; i++ {
+ for _, keyword := range keywords {
+ if strings.Contains(errStr, keyword) {
+ break
+ }
+ }
+ }
+ })
+}
diff --git a/pkg/client/wukong/metric_test.go b/pkg/client/wukong/metric_test.go
new file mode 100644
index 0000000..b88b42d
--- /dev/null
+++ b/pkg/client/wukong/metric_test.go
@@ -0,0 +1,182 @@
+package wukong
+
+import (
+ "context"
+ "errors"
+ "net"
+ "os"
+ "testing"
+)
+
+func TestClassifyHTTPError(t *testing.T) {
+ tests := []struct {
+ name string
+ err error
+ expected string
+ }{
+ {
+ name: "nil error",
+ err: nil,
+ expected: "success",
+ },
+ {
+ name: "context deadline exceeded",
+ err: context.DeadlineExceeded,
+ expected: "timeout_error",
+ },
+ {
+ name: "os deadline exceeded",
+ err: os.ErrDeadlineExceeded,
+ expected: "timeout_error",
+ },
+ {
+ name: "os timeout",
+ err: &os.SyscallError{Err: os.ErrDeadlineExceeded},
+ expected: "timeout_error",
+ },
+ {
+ name: "net timeout error",
+ err: &net.OpError{
+ Op: "dial",
+ Err: &os.SyscallError{Err: os.ErrDeadlineExceeded},
+ },
+ expected: "timeout_error",
+ },
+ {
+ name: "dns error",
+ err: errors.New("no such host: example.com"),
+ expected: "dns_error",
+ },
+ {
+ name: "dns lookup error",
+ err: errors.New("lookup example.com: no such host"),
+ expected: "dns_error",
+ },
+ {
+ name: "tls error",
+ err: errors.New("tls: handshake failure"),
+ expected: "tls_error",
+ },
+ {
+ name: "tls certificate error",
+ err: errors.New("x509: certificate verify failed"),
+ expected: "tls_error",
+ },
+ {
+ name: "connection refused",
+ err: errors.New("connection refused"),
+ expected: "connection_error",
+ },
+ {
+ name: "connection reset",
+ err: errors.New("connection reset by peer"),
+ expected: "connection_error",
+ },
+ {
+ name: "network error",
+ err: errors.New("dial tcp 127.0.0.1:8080: connect: connection refused"),
+ expected: "connection_error",
+ },
+ {
+ name: "EOF error",
+ err: errors.New("EOF"),
+ expected: "connection_error",
+ },
+ {
+ name: "net error without timeout",
+ err: &net.OpError{Op: "dial", Err: errors.New("connection refused")},
+ expected: "connection_error",
+ },
+ {
+ name: "other error",
+ err: errors.New("some unknown error"),
+ expected: "other_error",
+ },
+ {
+ name: "http error message",
+ err: errors.New("bad request"),
+ expected: "other_error",
+ },
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ result := classifyHTTPError(tt.err)
+ if result != tt.expected {
+ t.Errorf("classifyHTTPError(%v) = %q, want %q", tt.err, result, tt.expected)
+ }
+ })
+ }
+}
+
+func TestIsDNSError(t *testing.T) {
+ tests := []struct {
+ name string
+ errStr string
+ expected bool
+ }{
+ {"no such host", "no such host: example.com", true},
+ {"lookup error", "lookup example.com: no such host", true},
+ {"unknown host", "unknown host", true},
+ {"host not found", "host not found", true},
+ {"not dns error", "connection refused", false},
+ {"empty string", "", false},
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ result := isDNSError(tt.errStr)
+ if result != tt.expected {
+ t.Errorf("isDNSError(%q) = %v, want %v", tt.errStr, result, tt.expected)
+ }
+ })
+ }
+}
+
+func TestIsTLSError(t *testing.T) {
+ tests := []struct {
+ name string
+ errStr string
+ expected bool
+ }{
+ {"tls handshake", "tls: handshake failure", true},
+ {"certificate error", "x509: certificate verify failed", true},
+ {"ssl error", "ssl handshake failure", true},
+ {"not tls error", "connection refused", false},
+ {"empty string", "", false},
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ result := isTLSError(tt.errStr)
+ if result != tt.expected {
+ t.Errorf("isTLSError(%q) = %v, want %v", tt.errStr, result, tt.expected)
+ }
+ })
+ }
+}
+
+func TestIsHTTPConnectionError(t *testing.T) {
+ tests := []struct {
+ name string
+ errStr string
+ expected bool
+ }{
+ {"connection refused", "connection refused", true},
+ {"connection reset", "connection reset by peer", true},
+ {"dial tcp", "dial tcp 127.0.0.1:8080: connect: connection refused", true},
+ {"EOF", "EOF", true},
+ {"network unreachable", "network is unreachable", true},
+ {"not connection error", "dns lookup failed", false},
+ {"empty string", "", false},
+ }
+
+ for _, tt := range tests {
+ t.Run(tt.name, func(t *testing.T) {
+ result := isHTTPConnectionError(tt.errStr)
+ if result != tt.expected {
+ t.Errorf("isHTTPConnectionError(%q) = %v, want %v", tt.errStr, result, tt.expected)
+ }
+ })
+ }
+}
diff --git a/pkg/metric/metric.go b/pkg/metric/metric.go
index b74a970..30d96e8 100644
--- a/pkg/metric/metric.go
+++ b/pkg/metric/metric.go
@@ -52,8 +52,9 @@ func (m *Metric) Serve(context.Context) error {
defer ticker.Stop()
pusher := push.
- New(m.cfg.PushTargetURL, config.ServiceName()+"-"+config.ServiceVersion()).
+ New(m.cfg.PushTargetURL, config.ServiceName()).
Gatherer(prometheus.DefaultRegisterer.(prometheus.Gatherer)).
+ Grouping("namespace", config.ServiceNamespace()).
Grouping("instance", system.Hostname())
for {
diff --git a/pkg/schedule/schedule.go b/pkg/schedule/schedule.go
index 34bca12..53cc744 100644
--- a/pkg/schedule/schedule.go
+++ b/pkg/schedule/schedule.go
@@ -48,9 +48,16 @@ type (
var (
scheduleCounter = metric.NewCounterVec(
- "schedule_total",
- "schedule counter",
- []string{"task", "error", "panic"},
+ "schedule_jobs_total",
+ "The total number of scheduled jobs executed.",
+ []string{"task", "result"},
+ )
+ scheduleDuration = metric.NewHistogramVec(
+ "schedule_job_duration_seconds",
+ "The duration of scheduled jobs.",
+ []string{"task", "result"},
+ // 1s, 2.5s, 5s, 10s, 30s, 60s, 5m, 10m, 30m, 1h
+ []float64{1, 2.5, 5, 10, 30, 60, 300, 600, 1800, 3600},
)
)
@@ -208,24 +215,28 @@ func (sch *Schedule) exec(handler Handler) {
defer func() {
journal.EndTime = time.Now()
journal.Panic = recover()
+ duration := journal.EndTime.Sub(journal.StartTime).Seconds()
+ result := "success"
if journal.Panic != nil {
+ result = "panic"
logger.Trace(ctx).Errorf("Schedule crash: %+v\n%s", journal.Panic, debug.Stack())
- scheduleCounter.WithLabelValues(sch.key(), "", fmt.Sprintf("%s", journal.Panic)).Inc()
+ } else if journal.Error != nil {
+ result = "error"
+ logger.Trace(ctx).Errorf("Schedule run error: [%s]", journal.Error)
+ } else {
+ logger.Trace(ctx).Infof("Schedule run success")
}
+ scheduleCounter.WithLabelValues(sch.key(), result).Inc()
+ scheduleDuration.WithLabelValues(sch.key(), result).Observe(duration)
+
sch.recorder(journal)
}()
logger.Trace(ctx).Infof("Schedule run start")
- if journal.Error = handler(ctx); journal.Error != nil {
- logger.Trace(ctx).Errorf("Schedule run error: [%s]", journal.Error)
- scheduleCounter.WithLabelValues(sch.key(), journal.Error.Error(), "").Inc()
- } else {
- logger.Trace(ctx).Infof("Schedule run success")
- scheduleCounter.WithLabelValues(sch.key(), "", "").Inc()
- }
+ journal.Error = handler(ctx)
}()
}
diff --git a/pkg/server/ginserver/mid/ginprom/ginprom.go b/pkg/server/ginserver/mid/ginprom/ginprom.go
index 7567b14..ad21696 100644
--- a/pkg/server/ginserver/mid/ginprom/ginprom.go
+++ b/pkg/server/ginserver/mid/ginprom/ginprom.go
@@ -10,70 +10,59 @@ import (
type (
GinProm struct {
- cfg *Config
- processingGauge *metric.GaugeVec
- reqSizeSummary *metric.SummaryVec
- reqBeginCounter *metric.CounterVec
- reqFinishCounter *metric.CounterVec
- reqDurationSummary *metric.SummaryVec
- resSizeSummary *metric.SummaryVec
+ cfg *Config
}
)
+var (
+ // Saturation: 饱和度 (Requests Inflight)
+ // 衡量服务当前的忙碌程度,通常使用正在处理的请求数来表示。
+ reqInFlight = metric.NewGaugeVec(
+ "http_server_requests_inflight",
+ "The number of HTTP requests currently being processed.",
+ []string{"method", "url"},
+ )
+
+ // Traffic: 流量 (Request Rate & Size)
+ // 衡量服务的吞吐量,通常使用每秒请求数 (QPS) 或带宽 (IOPS) 来表示。
+ // 这里包含了请求总数(reqTotal)、请求包大小(reqSize)和响应包大小(resSize)。
+ // Errors: 错误 (Error Rate)
+ // 衡量请求失败的比例。
+ // 通过 reqTotal 指标中的 status 和 errcode 标签来计算错误率。
+ reqTotal = metric.NewCounterVec(
+ "http_server_requests_total",
+ "The total number of HTTP requests processed.",
+ []string{"method", "url", "status", "errcode"},
+ )
+ reqSize = metric.NewHistogramVec(
+ "http_server_request_size_bytes",
+ "The HTTP request body sizes in bytes.",
+ []string{"method", "url"},
+ // 1KB, 5KB, 10KB, 100KB, 1MB, 10MB
+ []float64{1024, 5120, 10240, 102400, 1048576, 10485760},
+ )
+ resSize = metric.NewHistogramVec(
+ "http_server_response_size_bytes",
+ "The HTTP response body sizes in bytes.",
+ []string{"method", "url", "status", "errcode"},
+ // 1KB, 5KB, 10KB, 100KB, 1MB, 10MB
+ []float64{1024, 5120, 10240, 102400, 1048576, 10485760},
+ )
+
+ // Latency: 延迟 (Request Duration)
+ // 衡量服务处理请求所需的时间。
+ reqDuration = metric.NewHistogramVec(
+ "http_server_request_duration_seconds",
+ "The HTTP request latencies in seconds.",
+ []string{"method", "url", "status", "errcode"},
+ // 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
+ []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
+ )
+)
+
func newGinProm(c *Config) *GinProm {
return &GinProm{
cfg: c,
- processingGauge: metric.NewGaugeVec(
- "http_server_processing_request",
- "http server processing request",
- []string{"method", "url"},
- ),
- reqSizeSummary: metric.NewSummaryVec(
- "http_server_request_size_bytes",
- "The HTTP request sizes in bytes.",
- []string{"method", "url"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
- ),
- reqBeginCounter: metric.NewCounterVec(
- "http_server_request_begin_total",
- "How many HTTP requests ready to process.",
- []string{"method", "url"},
- ),
- reqFinishCounter: metric.NewCounterVec(
- "http_server_request_finish_total",
- "How many HTTP requests processed.",
- []string{"method", "url", "status", "errcode"},
- ),
- reqDurationSummary: metric.NewSummaryVec(
- "http_server_request_duration_seconds",
- "The HTTP request latencies in seconds.",
- []string{"method", "url", "status", "errcode"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
- ),
- resSizeSummary: metric.NewSummaryVec(
- "http_server_response_size_bytes",
- "The HTTP response sizes in bytes.",
- []string{"method", "url", "status", "errcode"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
- ),
}
}
@@ -85,13 +74,13 @@ func (prom *GinProm) Handler() gin.HandlerFunc {
prom.cfg.requestURLMappingFn(ctx),
}
- reqSz := computeApproximateRequestSize(ctx.Request)
+ // Saturation: +1
+ reqInFlight.WithLabelValues(labels...).Inc()
+ defer reqInFlight.WithLabelValues(labels...).Dec()
- prom.processingGauge.WithLabelValues(labels...).Inc()
- prom.reqSizeSummary.WithLabelValues(labels...).Observe(reqSz)
- prom.reqBeginCounter.WithLabelValues(labels...).Inc()
-
- defer prom.processingGauge.WithLabelValues(labels...).Dec()
+ // Traffic: Request Size
+ reqSz := computeApproximateRequestSize(ctx.Request)
+ reqSize.WithLabelValues(labels...).Observe(reqSz)
ctx.Next()
@@ -101,8 +90,12 @@ func (prom *GinProm) Handler() gin.HandlerFunc {
}
labels = append(labels, strconv.Itoa(ctx.Writer.Status()), strconv.Itoa(ctx.GetInt("errcode")))
- prom.resSizeSummary.WithLabelValues(labels...).Observe(float64(resSz))
- prom.reqFinishCounter.WithLabelValues(labels...).Inc()
- prom.reqDurationSummary.WithLabelValues(labels...).Observe(time.Since(start).Seconds())
+
+ // Traffic: Response Size & Total Count (implies Errors via labels)
+ resSize.WithLabelValues(labels...).Observe(float64(resSz))
+ reqTotal.WithLabelValues(labels...).Inc()
+
+ // Latency
+ reqDuration.WithLabelValues(labels...).Observe(time.Since(start).Seconds())
}
}
diff --git a/pkg/server/grpcserver/interceptor/metric/metric.go b/pkg/server/grpcserver/interceptor/metric/metric.go
index ea3bb8f..d25425b 100644
--- a/pkg/server/grpcserver/interceptor/metric/metric.go
+++ b/pkg/server/grpcserver/interceptor/metric/metric.go
@@ -11,32 +11,41 @@ import (
)
var (
- handledCounter = metric.NewCounterVec(
- "grpc_server_handled_total",
- "gGPC server handle msg count",
+ // Saturation
+ reqInflight = metric.NewGaugeVec(
+ "grpc_server_requests_inflight",
+ "The number of gRPC requests currently being processed.",
+ []string{"method", "type"},
+ )
+
+ // Traffic & Errors
+ reqTotal = metric.NewCounterVec(
+ "grpc_server_requests_total",
+ "The total number of gRPC requests processed.",
[]string{"method", "type", "code"},
)
- handledSeconds = metric.NewSummaryVec(
- "grpc_server_handled_second",
- "gGPC server handle msg duration",
+
+ // Latency
+ reqDuration = metric.NewHistogramVec(
+ "grpc_server_request_duration_seconds",
+ "The gRPC request latencies in seconds.",
[]string{"method", "type", "code"},
- map[float64]float64{
- 0.5: 0.05,
- 0.75: 0.05,
- 0.9: 0.01,
- 0.99: 0.001,
- 1: 0.001,
- },
+ // .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
+ []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
)
)
func UnaryServerInterceptor() grpc.UnaryServerInterceptor {
return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
start := time.Now()
+ typ := "unary"
+
+ reqInflight.WithLabelValues(info.FullMethod, typ).Inc()
+ defer reqInflight.WithLabelValues(info.FullMethod, typ).Dec()
resp, err := handler(ctx, req)
- report(info.FullMethod, "unary", start, err)
+ report(info.FullMethod, typ, start, err)
return resp, err
}
@@ -45,17 +54,22 @@ func UnaryServerInterceptor() grpc.UnaryServerInterceptor {
func StreamServerInterceptor() grpc.StreamServerInterceptor {
return func(srv interface{}, ss grpc.ServerStream, info *grpc.StreamServerInfo, handler grpc.StreamHandler) error {
start := time.Now()
-
- err := handler(srv, ss)
-
+ typ := "stream"
if info.IsClientStream && info.IsServerStream {
- report(info.FullMethod, "stream_bidi", start, err)
+ typ = "stream_bidi"
} else if info.IsClientStream {
- report(info.FullMethod, "stream_client", start, err)
+ typ = "stream_client"
} else if info.IsServerStream {
- report(info.FullMethod, "stream_server", start, err)
+ typ = "stream_server"
}
+ reqInflight.WithLabelValues(info.FullMethod, typ).Inc()
+ defer reqInflight.WithLabelValues(info.FullMethod, typ).Dec()
+
+ err := handler(srv, ss)
+
+ report(info.FullMethod, typ, start, err)
+
return err
}
}
@@ -69,6 +83,6 @@ func report(method, typ string, start time.Time, err error) {
labels = []string{method, typ, "0"}
}
- handledCounter.WithLabelValues(labels...).Inc()
- handledSeconds.WithLabelValues(labels...).Observe(time.Since(start).Seconds())
+ reqTotal.WithLabelValues(labels...).Inc()
+ reqDuration.WithLabelValues(labels...).Observe(time.Since(start).Seconds())
}
diff --git a/pkg/server/grpcserver/interceptor/recovery/recovery.go b/pkg/server/grpcserver/interceptor/recovery/recovery.go
index 67f2b96..cdcd83a 100644
--- a/pkg/server/grpcserver/interceptor/recovery/recovery.go
+++ b/pkg/server/grpcserver/interceptor/recovery/recovery.go
@@ -12,9 +12,9 @@ import (
var (
panicCounter = metric.NewCounterVec(
- "grpc_server_panic_total",
- "grpc server panic counter",
- []string{"method", "panic"},
+ "grpc_server_panics_total",
+ "The total number of gRPC server panics.",
+ []string{"method"},
)
)
@@ -27,7 +27,7 @@ func UnaryServerInterceptor() grpc.UnaryServerInterceptor {
logger.Errorw("grpc unary server panic:", "panicked", panicked, "panic", panicErr)
err = errcode.ErrGRPCServerPanic.Build(panicErr)
- panicCounter.WithLabelValues(info.FullMethod, fmt.Sprintf("%s", panicErr)).Inc()
+ panicCounter.WithLabelValues(info.FullMethod).Inc()
}
}()
@@ -47,7 +47,7 @@ func StreamServerInterceptor() grpc.StreamServerInterceptor {
logger.Errorw("grpc stream server panic:", "panicked", panicked, "panic", panicErr)
err = errcode.ErrGRPCServerPanic.Build(panicErr)
- panicCounter.WithLabelValues(info.FullMethod, fmt.Sprintf("%s", panicErr)).Inc()
+ panicCounter.WithLabelValues(info.FullMethod).Inc()
}
}()
diff --git a/pkg/trace/config.go b/pkg/trace/config.go
index beda781..f0b22ef 100644
--- a/pkg/trace/config.go
+++ b/pkg/trace/config.go
@@ -33,7 +33,7 @@ func StdConfig() *Config {
func DefaultConfig() *Config {
return &Config{
TraceUID: "box.trace.uid",
- TraceReqID: "box.trace.reqId",
+ TraceReqID: "X-Request-Id",
TraceSpanID: "box.trace.spanId",
TraceBizID: "box.trace.bizId",
}
diff --git a/scripts/README.md b/scripts/README.md
new file mode 100644
index 0000000..35657b7
--- /dev/null
+++ b/scripts/README.md
@@ -0,0 +1,453 @@
+# Alert Rules Generator
+
+自动生成针对特定 namespace 和 job 的 Prometheus 告警规则。
+
+## 功能特性
+
+- 📦 基于模板自动生成定制化告警规则
+- 🎯 支持按 namespace 和 job 过滤
+- 🔧 Shell 脚本实现,无需额外依赖
+- ✅ 自动添加标签过滤器到所有 PromQL 表达式
+- 🔍 内置验证功能,自动检查生成的规则
+
+## 使用方法
+
+```bash
+# 基本用法(自动验证)
+./scripts/generate_alerts.sh
+
+# 指定输出文件
+./scripts/generate_alerts.sh
+
+# 跳过验证
+./scripts/generate_alerts.sh --no-verify
+```
+
+**示例:**
+
+```bash
+# 为 prod 命名空间的 api-service 生成告警规则(自动验证)
+./scripts/generate_alerts.sh prod api-service
+
+# 输出: docs/prod_api-service_alerts.yaml
+
+# 自定义输出路径
+./scripts/generate_alerts.sh prod api-service alerts/production/api.yaml
+
+# 快速生成,跳过验证
+./scripts/generate_alerts.sh prod api-service --no-verify
+```
+
+## 参数说明
+
+| 参数 | 说明 | 必需 |
+|------|------|------|
+| `namespace` | Kubernetes 命名空间 | 是 |
+| `job` | 服务名称(Job) | 是 |
+| `output_file` | 输出文件路径 | 否,默认: `docs/${namespace}_${job}_alerts.yaml` |
+| `--no-verify` | 跳过自动验证 | 否 |
+| 模板文件 | 告警规则模板 | - (固定: `docs/prometheus_alerts_template.yaml`) |
+
+
+## 生成的文件内容
+
+生成的告警规则文件会:
+
+1. **添加标签过滤器**:所有 PromQL 表达式都会添加 `namespace` 和 `job` 过滤条件
+2. **保留原有结构**:保持原模板的告警组织结构
+3. **添加生成信息**:文件头部包含生成参数和时间
+
+**示例对比:**
+
+**原模板:**
+```yaml
+- alert: HttpServerQpsHigh
+ expr: sum by (namespace, job) (rate(http_server_requests_total[1m])) > 1000
+```
+
+**生成后(namespace=prod, job=api-service):**
+```yaml
+- alert: HttpServerQpsHigh
+ expr: sum by (namespace, job) (rate(http_server_requests_total{namespace="prod",job="api-service"}[1m])) > 1000
+```
+
+## 部署到 Prometheus
+
+生成告警规则后,有以下几种部署方式:
+
+### 1. Kubernetes ConfigMap 方式
+
+```bash
+# 创建 ConfigMap
+kubectl create configmap prod-api-alerts \
+ --from-file=docs/prod_api-service_alerts.yaml \
+ -n monitoring
+
+# 在 Prometheus 配置中引用
+# prometheus.yml:
+# rule_files:
+# - '/etc/prometheus/rules/prod_api-service_alerts.yaml'
+```
+
+### 2. 直接文件挂载
+
+```yaml
+# prometheus-deployment.yaml
+volumeMounts:
+ - name: alert-rules
+ mountPath: /etc/prometheus/rules
+volumes:
+ - name: alert-rules
+ configMap:
+ name: prod-api-alerts
+```
+
+### 3. Prometheus Operator 方式
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: prod-api-alerts
+ namespace: monitoring
+spec:
+ groups:
+ - name: box-http-server-alerts
+ interval: 30s
+ rules:
+ # 粘贴生成的告警规则
+```
+
+## 目录结构
+
+```
+scripts/
+├── README.md # 本文档
+└── generate_alerts.sh # 告警生成和验证脚本
+
+docs/
+├── prometheus_alerts_template.yaml # 告警规则模板
+├── prod_api-service_alerts.yaml # 生成的告警文件示例
+└── ...
+```
+
+## 支持的指标
+
+脚本会自动为以下所有指标添加 `namespace` 和 `job` 过滤器:
+
+| 组件 | 指标 |
+|------|------|
+| **HTTP Server** | `http_server_requests_total` |
+| | `http_server_request_duration_seconds_bucket` |
+| | `http_server_request_duration_seconds_count` |
+| **HTTP Client** | `http_client_requests_total` |
+| | `http_client_requests_inflight` |
+| | `http_client_request_duration_seconds_bucket` |
+| **gRPC Server** | `grpc_server_requests_total` |
+| | `grpc_server_panics_total` |
+| **Database** | `db_client_request_duration_seconds_count` |
+| | `db_client_request_duration_seconds_bucket` |
+| | `db_client_connections_in_use` |
+| | `db_client_connections_max_open` |
+| **Redis** | `redis_client_requests_total` |
+| | `redis_client_request_duration_seconds_bucket` |
+| **MongoDB** | `mongo_client_requests_total` |
+| | `mongo_client_request_duration_seconds_bucket` |
+| **Schedule** | `schedule_jobs_total` |
+| **Go Runtime** | `go_goroutines`, `go_threads` |
+| | `go_memstats_sys_bytes`, `go_memstats_heap_alloc_bytes` |
+| | `go_gc_duration_seconds`, `go_gc_duration_seconds_count` |
+| | `go_memstats_gc_cpu_fraction` |
+
+## 使用场景
+
+### 场景 1: 多环境部署
+
+为不同环境生成独立的告警规则:
+
+```bash
+./scripts/generate_alerts.sh prod api-service
+./scripts/generate_alerts.sh staging api-service
+./scripts/generate_alerts.sh dev api-service
+```
+
+### 场景 2: 微服务架构
+
+为每个微服务生成专属告警:
+
+```bash
+for service in user-service order-service payment-service; do
+ ./scripts/generate_alerts.sh prod $service
+done
+```
+
+### 场景 3: 批量生成脚本
+
+创建批量生成脚本:
+
+```bash
+#!/bin/bash
+# batch_generate.sh
+
+NAMESPACE="prod"
+SERVICES=(
+ "api-service"
+ "user-service"
+ "order-service"
+ "payment-service"
+)
+
+for service in "${SERVICES[@]}"; do
+ echo "Generating alerts for $service..."
+ if ./scripts/generate_alerts.sh "$NAMESPACE" "$service"; then
+ echo "✓ $service - OK"
+ else
+ echo "✗ $service - FAILED"
+ exit 1
+ fi
+done
+
+echo "All services processed!"
+```
+
+### 场景 4: CI/CD 集成
+
+#### GitLab CI
+
+```yaml
+generate-alerts:
+ stage: build
+ script:
+ - ./scripts/generate_alerts.sh ${CI_ENVIRONMENT_NAME} ${SERVICE_NAME}
+ artifacts:
+ paths:
+ - docs/*_alerts.yaml
+```
+
+#### GitHub Actions
+
+```yaml
+- name: Generate alerts
+ run: |
+ ./scripts/generate_alerts.sh ${{ vars.NAMESPACE }} ${{ vars.SERVICE }}
+```
+
+## 脚本输出说明
+
+### 标准输出(带验证)
+
+```bash
+$ ./scripts/generate_alerts.sh prod api-service
+
+╔════════════════════════════════════════════════════════════╗
+║ Prometheus Alert Rules Generator & Verifier ║
+╚════════════════════════════════════════════════════════════╝
+
+📋 Configuration:
+ Namespace: prod
+ Job: api-service
+ Template: docs/prometheus_alerts_template.yaml
+ Output: docs/prod_api-service_alerts.yaml
+ Verify: Enabled
+
+🔨 Step 1: Generating alert rules...
+✓ Alert rules generated successfully!
+
+🔍 Step 2: Verifying alert rules...
+✓ All metrics are correctly filtered!
+
+╔════════════════════════════════════════════════════════════╗
+║ Summary ║
+╚════════════════════════════════════════════════════════════╝
+
+✓ Generation completed
+✓ Verification passed
+
+📄 Output file: docs/prod_api-service_alerts.yaml
+
+📝 Next steps:
+ 1. Review the generated file
+ 2. Validate with promtool
+ 3. Deploy to Kubernetes
+```
+
+### 快速模式输出(跳过验证)
+
+```bash
+$ ./scripts/generate_alerts.sh prod api-service --no-verify
+
+╔════════════════════════════════════════════════════════════╗
+║ Prometheus Alert Rules Generator & Verifier ║
+╚════════════════════════════════════════════════════════════╝
+
+📋 Configuration:
+ Namespace: prod
+ Job: api-service
+ Verify: Disabled
+
+🔨 Step 1: Generating alert rules...
+✓ Alert rules generated successfully!
+
+⚠ Verification skipped (--no-verify flag)
+
+✓ Generation completed
+```
+
+## 验证生成的规则
+
+### 自动验证(默认启用)
+
+脚本会自动验证生成的告警规则,确保所有指标都正确添加了过滤器。
+
+如需跳过验证(快速生成):
+
+```bash
+./scripts/generate_alerts.sh prod api-service --no-verify
+```
+
+### 使用 promtool 验证语法
+
+```bash
+# 使用 promtool 验证语法
+promtool check rules docs/prod_api-service_alerts.yaml
+
+# 或使用 Docker
+docker run --rm -v $(pwd):/workspace prom/prometheus:latest \
+ promtool check rules /workspace/docs/prod_api-service_alerts.yaml
+```
+
+### 完整的生成和部署流程
+
+```bash
+# 1. 生成和验证告警规则(自动验证)
+./scripts/generate_alerts.sh prod api-service
+
+# 2. (可选)使用 promtool 验证语法
+promtool check rules docs/prod_api-service_alerts.yaml
+
+# 3. 部署
+kubectl create configmap prod-api-alerts \
+ --from-file=docs/prod_api-service_alerts.yaml \
+ -n monitoring
+```
+
+## 常见问题
+
+### Q1: 如何确保所有指标都添加了过滤器?
+
+脚本默认会自动验证生成的规则:
+
+```bash
+./scripts/generate_alerts.sh prod api-service
+# 会自动验证所有指标
+```
+
+如果有指标遗漏过滤器,脚本会明确指出并返回错误。
+
+### Q2: 生成的规则不生效?
+
+检查以下几点:
+1. Prometheus 配置中是否正确引用了规则文件
+2. 规则文件的 YAML 格式是否正确
+3. Prometheus 是否成功重载了配置(查看日志)
+
+```bash
+# 重载 Prometheus 配置
+curl -X POST http://prometheus:9090/-/reload
+```
+
+### Q3: 如何修改告警阈值?
+
+两种方式:
+1. 修改模板文件 `docs/prometheus_alerts_template.yaml`,然后重新生成
+2. 直接编辑生成的文件(不推荐,因为会在下次生成时被覆盖)
+
+### Q4: 支持批量生成吗?
+
+是的,可以通过循环实现:
+
+```bash
+# 为多个服务批量生成
+for service in api-service user-service order-service; do
+ ./scripts/generate_alerts.sh prod $service
+done
+```
+
+## 快速参考
+
+```bash
+# 基本用法
+./scripts/generate_alerts.sh
+
+# 自定义输出
+./scripts/generate_alerts.sh
+
+# 快速模式(跳过验证)
+./scripts/generate_alerts.sh --no-verify
+
+# 批量生成
+for svc in api user order; do
+ ./scripts/generate_alerts.sh prod ${svc}-service
+done
+
+# 验证语法
+promtool check rules docs/prod_api-service_alerts.yaml
+
+# 部署
+kubectl create configmap \
+ --from-file= \
+ -n monitoring
+```
+
+## 故障排查
+
+### 问题:验证失败
+
+```bash
+✗ Found unfiltered metric: http_server_requests_total
+```
+
+**解决方案:**
+1. 检查模板文件是否被修改
+2. 重新生成文件
+3. 如果问题持续,联系维护者
+
+### 问题:告警未触发
+
+**检查步骤:**
+
+```bash
+# 1. 检查规则是否加载
+curl http://prometheus:9090/api/v1/rules | jq
+
+# 2. 检查指标是否存在
+curl 'http://prometheus:9090/api/v1/query?query=http_server_requests_total{namespace="prod",job="api-service"}' | jq
+
+# 3. 重载 Prometheus
+curl -X POST http://prometheus:9090/-/reload
+```
+
+## 最佳实践
+
+1. **默认使用验证**:生产环境务必验证
+2. **版本控制**:将生成的文件提交到 Git
+3. **定期更新**:模板更新后重新生成所有文件
+4. **命名规范**:使用 `__alerts.yaml` 格式
+5. **测试先行**:在测试环境验证后再部署生产
+
+## 贡献
+
+如需改进脚本或添加新功能,请:
+
+1. Fork 项目
+2. 创建功能分支
+3. 提交 Pull Request
+
+## 相关文档
+
+- [prometheus_alerts_template.yaml](../docs/prometheus_alerts_template.yaml) - 告警规则模板
+- [metric.md](../docs/metric.md) - 指标和看板文档
+
+## 许可
+
+与主项目保持一致。
diff --git a/scripts/generate_alerts.sh b/scripts/generate_alerts.sh
new file mode 100755
index 0000000..39cb297
--- /dev/null
+++ b/scripts/generate_alerts.sh
@@ -0,0 +1,266 @@
+#!/bin/bash
+#
+# Generate and verify Prometheus alert rules for specific namespace and job
+#
+# Usage:
+# ./scripts/generate_alerts.sh [output_file] [--no-verify]
+#
+# Examples:
+# ./scripts/generate_alerts.sh prod api-service
+# ./scripts/generate_alerts.sh prod api-service alerts/prod_api.yaml
+# ./scripts/generate_alerts.sh prod api-service --no-verify
+#
+
+set -e
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Check arguments
+if [ $# -lt 2 ]; then
+ echo -e "${RED}Error: Missing required arguments${NC}"
+ echo ""
+ echo "Usage: $0 [output_file] [--no-verify]"
+ echo ""
+ echo "Examples:"
+ echo " $0 prod api-service"
+ echo " $0 prod api-service alerts/custom.yaml"
+ echo " $0 prod api-service --no-verify"
+ echo ""
+ exit 1
+fi
+
+NAMESPACE="$1"
+JOB="$2"
+TEMPLATE="docs/prometheus_alerts_template.yaml"
+OUTPUT=""
+SKIP_VERIFY=false
+
+# Parse arguments
+shift 2
+while [ $# -gt 0 ]; do
+ case "$1" in
+ --no-verify)
+ SKIP_VERIFY=true
+ ;;
+ *)
+ OUTPUT="$1"
+ ;;
+ esac
+ shift
+done
+
+# Set default output if not specified
+if [ -z "$OUTPUT" ]; then
+ OUTPUT="docs/${NAMESPACE}_${JOB}_alerts.yaml"
+fi
+
+# Check if template exists
+if [ ! -f "$TEMPLATE" ]; then
+ echo -e "${RED}Error: Template file not found: $TEMPLATE${NC}"
+ exit 1
+fi
+
+echo -e "${BLUE}╔════════════════════════════════════════════════════════════╗${NC}"
+echo -e "${BLUE}║ Prometheus Alert Rules Generator & Verifier ║${NC}"
+echo -e "${BLUE}╚════════════════════════════════════════════════════════════╝${NC}"
+echo ""
+echo -e "${YELLOW}📋 Configuration:${NC}"
+echo " Namespace: $NAMESPACE"
+echo " Job: $JOB"
+echo " Template: $TEMPLATE"
+echo " Output: $OUTPUT"
+echo " Verify: $([ "$SKIP_VERIFY" = true ] && echo "Disabled" || echo "Enabled")"
+echo ""
+
+# ============================================================
+# STEP 1: Generate Alert Rules
+# ============================================================
+echo -e "${YELLOW}🔨 Step 1: Generating alert rules...${NC}"
+
+# Create output directory if it doesn't exist
+mkdir -p "$(dirname "$OUTPUT")"
+
+# Generate header
+cat > "$OUTPUT" << EOF
+# Prometheus Alert Rules
+# Generated for namespace: ${NAMESPACE}, job: ${JOB}
+#
+# This file is auto-generated. Do not edit manually.
+# To regenerate, run: ./scripts/generate_alerts.sh ${NAMESPACE} ${JOB}
+
+EOF
+
+# Process the template file
+# Add namespace and job filters to all metric queries
+sed -E \
+ -e "s/http_server_requests_total\{/http_server_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/http_server_requests_total\[/http_server_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/http_server_request_duration_seconds_bucket\{/http_server_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/http_server_request_duration_seconds_bucket\[/http_server_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/http_server_request_duration_seconds_count\{/http_server_request_duration_seconds_count{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/http_server_request_duration_seconds_count\[/http_server_request_duration_seconds_count{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/http_client_requests_total\{/http_client_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/http_client_requests_total\[/http_client_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/http_client_requests_inflight([^{])/http_client_requests_inflight{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ -e "s/http_client_request_duration_seconds_bucket\{/http_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/http_client_request_duration_seconds_bucket\[/http_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/grpc_server_requests_total\{/grpc_server_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/grpc_server_requests_total\[/grpc_server_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/grpc_server_panics_total\[/grpc_server_panics_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/db_client_request_duration_seconds_count\{/db_client_request_duration_seconds_count{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/db_client_request_duration_seconds_count\[/db_client_request_duration_seconds_count{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/db_client_request_duration_seconds_bucket\{/db_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/db_client_request_duration_seconds_bucket\[/db_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/db_client_connections_in_use([^{])/db_client_connections_in_use{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ -e "s/db_client_connections_max_open([^{])/db_client_connections_max_open{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ -e "s/redis_client_requests_total\{/redis_client_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/redis_client_request_duration_seconds_bucket\{/redis_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/redis_client_request_duration_seconds_bucket\[/redis_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/mongo_client_requests_total\{/mongo_client_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/mongo_client_requests_total\[/mongo_client_requests_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/mongo_client_request_duration_seconds_bucket\{/mongo_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/mongo_client_request_duration_seconds_bucket\[/mongo_client_request_duration_seconds_bucket{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/schedule_jobs_total\{/schedule_jobs_total{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/go_goroutines\{/go_goroutines{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/go_goroutines\[/go_goroutines{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/go_goroutines([^{[])/go_goroutines{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ -e "s/go_threads\{/go_threads{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/go_threads([^{])/go_threads{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ -e "s/go_memstats_sys_bytes([^{])/go_memstats_sys_bytes{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ -e "s/go_memstats_heap_alloc_bytes\[/go_memstats_heap_alloc_bytes{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/go_gc_duration_seconds\{/go_gc_duration_seconds{namespace=\"${NAMESPACE}\",job=\"${JOB}\",/g" \
+ -e "s/go_gc_duration_seconds_count\[/go_gc_duration_seconds_count{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}[/g" \
+ -e "s/go_memstats_gc_cpu_fraction([^{])/go_memstats_gc_cpu_fraction{namespace=\"${NAMESPACE}\",job=\"${JOB}\"}\1/g" \
+ "$TEMPLATE" >> "$OUTPUT"
+
+echo -e "${GREEN}✓ Alert rules generated successfully!${NC}"
+echo ""
+
+# ============================================================
+# STEP 2: Verify Alert Rules (if not skipped)
+# ============================================================
+if [ "$SKIP_VERIFY" = true ]; then
+ echo -e "${YELLOW}⚠ Verification skipped (--no-verify flag)${NC}"
+ echo ""
+else
+ echo -e "${YELLOW}🔍 Step 2: Verifying alert rules...${NC}"
+ echo ""
+
+ # List of metrics that should have filters
+ METRICS=(
+ "http_server_requests_total"
+ "http_server_request_duration_seconds_bucket"
+ "http_server_request_duration_seconds_count"
+ "http_client_requests_total"
+ "http_client_requests_inflight"
+ "http_client_request_duration_seconds_bucket"
+ "grpc_server_requests_total"
+ "grpc_server_panics_total"
+ "db_client_request_duration_seconds_count"
+ "db_client_request_duration_seconds_bucket"
+ "db_client_connections_in_use"
+ "db_client_connections_max_open"
+ "redis_client_requests_total"
+ "redis_client_request_duration_seconds_bucket"
+ "mongo_client_requests_total"
+ "mongo_client_request_duration_seconds_bucket"
+ "schedule_jobs_total"
+ "go_goroutines"
+ "go_threads"
+ "go_memstats_sys_bytes"
+ "go_memstats_heap_alloc_bytes"
+ "go_gc_duration_seconds"
+ "go_gc_duration_seconds_count"
+ "go_memstats_gc_cpu_fraction"
+ )
+
+ # Check each metric
+ ERRORS=0
+ WARNINGS=0
+
+ for metric in "${METRICS[@]}"; do
+ # Find lines with this metric
+ if grep -q "$metric" "$OUTPUT"; then
+ # Check if all occurrences have the correct filter
+ unfiltered=$(grep "$metric" "$OUTPUT" | grep -v "{namespace=\"${NAMESPACE}\",job=\"${JOB}\"" || true)
+
+ if [ -n "$unfiltered" ]; then
+ echo -e "${RED}✗ Found unfiltered metric: $metric${NC}"
+ echo "$unfiltered" | head -3
+ echo ""
+ ((ERRORS++))
+ fi
+ fi
+ done
+
+ # Check for any metrics that might have been missed
+ missed=$(grep -E '(http_|grpc_|db_|redis_|mongo_|schedule_|go_)' "$OUTPUT" | \
+ grep -v "^#" | \
+ grep -v "{namespace=\"${NAMESPACE}\",job=\"${JOB}\"" | \
+ grep -v "namespace:" | \
+ grep -v "job:" | \
+ grep -v "sum by" | \
+ grep -v "rate by" | \
+ grep -v "humanize" || true)
+
+ if [ -n "$missed" ]; then
+ echo -e "${YELLOW}⚠ Potentially missed metrics (may be false positives):${NC}"
+ echo "$missed" | head -5
+ echo ""
+ ((WARNINGS++))
+ fi
+
+ # Verification summary
+ if [ $ERRORS -eq 0 ] && [ $WARNINGS -eq 0 ]; then
+ echo -e "${GREEN}✓ All metrics are correctly filtered!${NC}"
+ elif [ $ERRORS -eq 0 ]; then
+ echo -e "${YELLOW}⚠ Verification completed with warnings${NC}"
+ echo -e "${YELLOW} (Warnings may be false positives)${NC}"
+ else
+ echo -e "${RED}✗ Verification failed!${NC}"
+ echo -e "${RED} Found $ERRORS unfiltered metrics${NC}"
+ exit 1
+ fi
+ echo ""
+fi
+
+# ============================================================
+# Final Summary
+# ============================================================
+echo -e "${BLUE}╔════════════════════════════════════════════════════════════╗${NC}"
+echo -e "${BLUE}║ Summary ║${NC}"
+echo -e "${BLUE}╚════════════════════════════════════════════════════════════╝${NC}"
+echo ""
+echo -e "${GREEN}✓ Generation completed${NC}"
+if [ "$SKIP_VERIFY" = false ]; then
+ if [ $ERRORS -eq 0 ]; then
+ echo -e "${GREEN}✓ Verification passed${NC}"
+ fi
+fi
+echo ""
+echo -e "${YELLOW}📄 Output file:${NC} $OUTPUT"
+echo ""
+echo -e "${YELLOW}📝 Next steps:${NC}"
+echo ""
+echo " 1. Review the generated file:"
+echo " cat $OUTPUT"
+echo ""
+echo " 2. (Optional) Validate with promtool:"
+echo " promtool check rules $OUTPUT"
+echo ""
+echo " 3. Deploy to Kubernetes:"
+echo " kubectl create configmap ${NAMESPACE}-${JOB}-alerts \\"
+echo " --from-file=$OUTPUT \\"
+echo " -n monitoring"
+echo ""
+echo " 4. Or add to prometheus.yml:"
+echo " rule_files:"
+echo " - '$OUTPUT'"
+echo ""
+
+exit 0