Skip to content

Commit 5ad6478

Browse files
craig[bot]dhartunianyuzefovichandy-kimballiskettaneh
committed
146225: metric: add essential label to metrics r=angles-n-daemons a=dhartunian This commit adds the `Essential: true` label to metrics that are on the essential metrics docs page. The page can be viewed at: https://www.cockroachlabs.com/docs/stable/essential-metrics-self-hosted Appropriate categories are chosen for metrics where available. `HowToUse` field is populated with text from the docs page. These fields are now available in the generated docs YAML file. Resolves: cockroachdb#142571 Release note: None 146389: sql: propagate TestingKnobs.ForceProductionValues to remote nodes r=yuzefovich a=yuzefovich This commit fixes an oversight in how we handle `eval.Context.TestingKnobs.ForceProductionValues`. Namely, previously we forgot to propagate this information from the gateway to remote nodes, so the latter would not respect this knob. This is now fixed. Fixes: cockroachdb#146350. Release note: None 146511: cspann: cleanup and fixes for quantizers and search set r=drewkimball a=andy-kimball #### cspann: ensure unit vectors in quantizers when required Construct a unit vector centroid in the quantizers when using the InnerProduct or Cosine distance metric. This is required for accurate distances in the Cosine case. In the InnerProduct case, it avoids partition centroids with high magnitudes that "attract" vectors simply because of their magnitude. In test builds, check that all query and data vectors are unit vectors when using the Cosine distance metric. Add more testing for Cosine and InnerProduct distance metrics, as used by the quantizers. #### cspann: remove centroid distances from quantized and search sets Remove GetCentroidDistances() from QuantizedVectorSet and remove CentroidDistance from SearchResult. The centroid distances were never being used. #### cspann: rename SearchResult.QuerySquaredDistance to QueryDistance Now that we're introducing alternative distance metrics, the search result no longer always returns a squared L2 distance. Rename the field to QueryDistance to be agnostic to the distance metric that's used. 146561: kvserver: reduce replica mu contention in updateProposalQuotaRaftMuLocked r=iskettaneh a=iskettaneh This PR refactors updateProposalQuotaRaftMuLocked() to reduce the replica mutex contention, especially in the normal case (QuotaPool is disabled, and the leadership is stable). Changes done by this commit: 1. Create a short path for the case where we are not the leader. 2. Create a short path for the case where QuotaPool is not enabled. We only take the write lock in cases where we already had proposal quota that needs to be release, or if the flow control level causes the replicaFlowControlIntegration.onRaftTicked to be a no-op. 3. Untangle the rest of the cases where the QuotaPool is enabled. benchdiff results: ``` rm -f /tmp/*.pb.gz && env GODEBUG=runtimecontentionstacks=1 benchdiff -b --old e0393de6614ddad81a8209733945be46e967e993 -c 10 ./pkg/sql/tests -r BenchmarkParallelSysbench/SQL/3node/oltp_read_write --mutexprofile -d=3000x name old time/op new time/op delta ParallelSysbench/SQL/3node/oltp_read_write-24 1.15ms ± 7% 1.15ms ± 4% ~ (p=1.000 n=10+9) name old errs/op new errs/op delta ParallelSysbench/SQL/3node/oltp_read_write-24 0.01 ±46% 0.01 ±32% ~ (p=0.469 n=10+10) name old alloc/op new alloc/op delta ParallelSysbench/SQL/3node/oltp_read_write-24 2.05MB ± 1% 2.06MB ± 1% ~ (p=0.052 n=10+10) name old allocs/op new allocs/op delta ParallelSysbench/SQL/3node/oltp_read_write-24 8.18k ± 1% 8.20k ± 1% ~ (p=0.101 n=10+10) ``` Mutex contention graph after running the microbenchmark on my GCE worker: ``` rm -f /tmp/*.pb.gz && env GODEBUG=runtimecontentionstacks=1 go test ./pkg/sql/tests -run - -bench BenchmarkParallelSysbench/SQL/3node/oltp_read_write -v -test.benchtime=3000x -test.outputdir=/tmp -test.mutexprofile=mutex.pb.gz -test.mutexprofilefraction=100 -test.timeout 25s 2>&1 ``` Before: <img width="1906" alt="Screenshot 2025-05-12 at 3 40 08 PM" src="https://github.com/user-attachments/assets/0059db5d-fc17-4344-ae1e-7d37770fba0c" /> After: <img width="1908" alt="Screenshot 2025-05-12 at 3 39 48 PM" src="https://github.com/user-attachments/assets/f3ee3869-8b46-45bc-bf63-ba2af958cb84" /> References: cockroachdb#140235 Release note: None 146594: tests: de-flake a couple of tests due to buffered writes r=yuzefovich a=yuzefovich We recently enabled buffered writes metamorphically in tests. Previously, we also added several lock reliability cluster settings that have defaults being changed metamorphically. We've now seen two failures where an expected contention event didn't occur due to an unfortunate combination of these two facts: namely, if we metamorphically disable `kv.lock_table.unreplicated_lock_reliability.split.enabled` cluster setting while enabling buffered writes, multi-server unit tests might not observe the expected contention. As of right now two tests like this have been identified, so they are adjusted to always override the split reliability setting to `true`. The reasoning is that if we only have a handful of tests that are susceptible to flakes due to the poor combination of metamorphic settings, we'd rather adjust the tests themselves. If we find more cases like this, we'll consider making some settings non-metamorphic. Fixes: cockroachdb#146387. Fixes: cockroachdb#146412. Release note: None 146642: workload: reduce RNG allocations r=mgartner a=mgartner PR cockroachdb#143626 removed `golang.org/x/exp/rand` in favor of `math/rand/v2`. The number of heap allocations grew dramatically due to new calls to `rand.New` and `rand.NewPCG`. This commit eliminates these allocations by reusing allocations of `rand.Rand` and `rand.PCG`. ``` name old time/op new time/op delta InitialData/tpcc/warehouses=1 103ms ± 0% 91ms ± 0% -11.96% (p=0.016 n=4+5) name old speed new speed delta InitialData/tpcc/warehouses=1 2.13GB/s ± 0% 2.50GB/s ± 0% +17.58% (p=0.016 n=4+5) name old alloc/op new alloc/op delta InitialData/tpcc/warehouses=1 10.3MB ± 0% 0.1MB ± 0% -99.24% (p=0.008 n=5+5) name old allocs/op new allocs/op delta InitialData/tpcc/warehouses=1 640k ± 0% 0k ± 0% -99.97% (p=0.008 n=5+5) ``` Release note: None 146645: util/mon,colexec/colexecdisk: reduce allocations r=mgartner a=mgartner #### colexec/colexecdisk: lazily construct monitor name strings Monitor name strings are now built only when an out-of-memory error is encountered, rather than always built when a disk spiller is created. Release note: None #### util/mon: use array for disk spiller monitor names A fixed size array of type is now used instead of a slice for storing `mon.Name`s in `diskSpillerBase`, reducing heap allocations. An empty `mon.Name` is used for the second element in the array if there is only one name. Release note: None Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com> Co-authored-by: Ibrahim Kettaneh <ibrahim.kettaneh@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
8 parents 7803d33 + 2e7118f + f04f48c + ada2236 + eb5193b + d91b3b4 + b17f4b5 + d6abe7d commit 5ad6478

File tree

79 files changed

+2734
-1761
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+2734
-1761
lines changed

docs/generated/metrics/metrics.yaml

Lines changed: 1300 additions & 1072 deletions
Large diffs are not rendered by default.

pkg/backup/schedule_exec.go

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -574,6 +574,7 @@ func init() {
574574
tree.ScheduledBackupExecutor.InternalName(),
575575
func() (jobs.ScheduledJobExecutor, error) {
576576
m := jobs.MakeExecutorMetrics(tree.ScheduledBackupExecutor.UserName())
577+
577578
pm := jobs.MakeExecutorPTSMetrics(tree.ScheduledBackupExecutor.UserName())
578579
return &scheduledBackupExecutor{
579580
metrics: backupMetrics{
@@ -584,6 +585,21 @@ func init() {
584585
Help: "The unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric",
585586
Measurement: "Jobs",
586587
Unit: metric.Unit_TIMESTAMP_SEC,
588+
Essential: true,
589+
Category: metric.Metadata_SQL,
590+
HowToUse: `Monitor this metric to ensure that backups are
591+
meeting the recovery point objective (RPO). Each node
592+
exports the time that it last completed a backup on behalf
593+
of the schedule. If a node is restarted, it will report 0
594+
until it completes a backup. If all nodes are restarted,
595+
max() is 0 until a node completes a backup.
596+
597+
To make use of this metric, first, from each node, take the maximum
598+
over a rolling window equal to or greater than the backup frequency,
599+
and then take the maximum of those values across nodes. For example
600+
with a backup frequency of 60 minutes, monitor time() -
601+
max_across_nodes(max_over_time(schedules_BACKUP_last_completed_time,
602+
60min)).`,
587603
}),
588604
RpoTenantMetric: metric.NewExportedGaugeVec(metric.Metadata{
589605
Name: "schedules.BACKUP.last-completed-time-by-virtual_cluster",

pkg/base/license.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,19 @@ var LicenseTTLMetadata = metric.Metadata{
4141
Help: "Seconds until license expiry (0 if no license present)",
4242
Measurement: "Seconds",
4343
Unit: metric.Unit_SECONDS,
44+
Essential: true,
45+
Category: metric.Metadata_EXPIRATIONS,
46+
HowToUse: "See Description.",
4447
}
4548

4649
var AdditionalLicenseTTLMetadata = metric.Metadata{
4750
Name: "seconds_until_license_expiry",
4851
Help: "Seconds until license expiry (0 if no license present)",
4952
Measurement: "Seconds",
5053
Unit: metric.Unit_SECONDS,
54+
Essential: true,
55+
Category: metric.Metadata_EXPIRATIONS,
56+
HowToUse: "See Description.",
5157
}
5258

5359
// GetLicenseTTL is a function which returns the TTL for the active cluster.

pkg/ccl/changefeedccl/metrics.go

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -707,12 +707,18 @@ var (
707707
Help: "Total retryable errors encountered by all changefeeds",
708708
Measurement: "Errors",
709709
Unit: metric.Unit_COUNT,
710+
Essential: true,
711+
Category: metric.Metadata_CHANGEFEEDS,
712+
HowToUse: `This metric tracks transient changefeed errors. Alert on "too many" errors, such as 50 retries in 15 minutes. For example, during a rolling upgrade this counter will increase because the changefeed jobs will restart following node restarts. There is an exponential backoff, up to 10 minutes. But if there is no rolling upgrade in process or other cluster maintenance, and the error rate is high, investigate the changefeed job.`,
710713
}
711714
metaChangefeedFailures = metric.Metadata{
712715
Name: "changefeed.failures",
713716
Help: "Total number of changefeed jobs which have failed",
714717
Measurement: "Errors",
715718
Unit: metric.Unit_COUNT,
719+
Essential: true,
720+
Category: metric.Metadata_CHANGEFEEDS,
721+
HowToUse: `This metric tracks the permanent changefeed job failures that the jobs system will not try to restart. Any increase in this counter should be investigated. An alert on this metric is recommended.`,
716722
}
717723

718724
metaEventQueueTime = metric.Metadata{
@@ -791,6 +797,9 @@ func newAggregateMetrics(histogramWindow time.Duration, lookup *cidr.Lookup) *Ag
791797
Help: "Messages emitted by all feeds",
792798
Measurement: "Messages",
793799
Unit: metric.Unit_COUNT,
800+
Essential: true,
801+
Category: metric.Metadata_CHANGEFEEDS,
802+
HowToUse: `This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the rate of changes being streamed from the CockroachDB cluster.`,
794803
}
795804
metaChangefeedEmittedBatchSizes := metric.Metadata{
796805
Name: "changefeed.emitted_batch_sizes",
@@ -811,6 +820,9 @@ func newAggregateMetrics(histogramWindow time.Duration, lookup *cidr.Lookup) *Ag
811820
Help: "Bytes emitted by all feeds",
812821
Measurement: "Bytes",
813822
Unit: metric.Unit_BYTES,
823+
Essential: true,
824+
Category: metric.Metadata_CHANGEFEEDS,
825+
HowToUse: `This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the throughput bytes being streamed from the CockroachDB cluster.`,
814826
}
815827
metaChangefeedFlushedBytes := metric.Metadata{
816828
Name: "changefeed.flushed_bytes",
@@ -850,6 +862,9 @@ func newAggregateMetrics(histogramWindow time.Duration, lookup *cidr.Lookup) *Ag
850862
"Excludes latency during backfill",
851863
Measurement: "Nanoseconds",
852864
Unit: metric.Unit_NANOSECONDS,
865+
Essential: true,
866+
Category: metric.Metadata_CHANGEFEEDS,
867+
HowToUse: `This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the end-to-end lag between a committed change and that change applied at the destination.`,
853868
}
854869
metaAdmitLatency := metric.Metadata{
855870
Name: "changefeed.admit_latency",
@@ -878,6 +893,9 @@ func newAggregateMetrics(histogramWindow time.Duration, lookup *cidr.Lookup) *Ag
878893
Help: "Number of currently running changefeeds, including sinkless",
879894
Measurement: "Changefeeds",
880895
Unit: metric.Unit_COUNT,
896+
Essential: true,
897+
Category: metric.Metadata_CHANGEFEEDS,
898+
HowToUse: `This metric tracks the total number of all running changefeeds.`,
881899
}
882900
metaMessageSize := metric.Metadata{
883901
Name: "changefeed.message_size_hist",

pkg/ccl/serverccl/statusccl/tenant_status_test.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,6 @@ func TestTenantStatusAPI(t *testing.T) {
5151
defer s.Close(t)
5252
defer s.SetupSingleFileLogging()()
5353

54-
skip.WithIssue(t, 146387)
5554
// The liveness session might expire before the stress race can finish.
5655
skip.UnderRace(t, "expensive tests")
5756

@@ -73,6 +72,10 @@ func TestTenantStatusAPI(t *testing.T) {
7372
tdb.Exec(t, "SET CLUSTER SETTING kv.closed_timestamp.target_duration = '10ms'")
7473
tdb.Exec(t, "SET CLUSTER SETTING kv.closed_timestamp.side_transport_interval = '10 ms'")
7574
tdb.Exec(t, "SET CLUSTER SETTING kv.rangefeed.closed_timestamp_refresh_interval = '10 ms'")
75+
// If we happen to enable buffered writes metamorphically, we must have the
76+
// split lock reliability enabled (which can be tweaked metamorphically too,
77+
// #146387).
78+
tdb.Exec(t, "SET CLUSTER SETTING kv.lock_table.unreplicated_lock_reliability.split.enabled = true")
7679

7780
t.Run("reset_sql_stats", func(t *testing.T) {
7881
skip.UnderDeadlockWithIssue(t, 99559)
@@ -884,7 +887,6 @@ WHERE tablename = 'test' AND indexname = $1`
884887
requireAfter(t, &resp.Statistics[0].Statistics.Stats.LastRead, &timePreRead)
885888
indexName := resp.Statistics[0].IndexName
886889
createStmt := cluster.TenantConn(0).QueryStr(t, getCreateStmtQuery, indexName)[0][0]
887-
print(createStmt)
888890
require.Equal(t, resp.Statistics[0].CreateStatement, createStmt)
889891
requireBetween(t, timePreCreate, resp.Statistics[0].CreatedAt, timePreRead)
890892
})

pkg/jobs/metrics.go

Lines changed: 115 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -76,18 +76,45 @@ type JobTypeMetrics struct {
7676
// MetricStruct implements the metric.Struct interface.
7777
func (JobTypeMetrics) MetricStruct() {}
7878

79-
func makeMetaCurrentlyRunning(typeStr string) metric.Metadata {
80-
return metric.Metadata{
79+
func typeToString(jobType jobspb.Type) string {
80+
return strings.ToLower(strings.Replace(jobType.String(), " ", "_", -1))
81+
}
82+
83+
func makeMetaCurrentlyRunning(jt jobspb.Type) metric.Metadata {
84+
typeStr := typeToString(jt)
85+
m := metric.Metadata{
8186
Name: fmt.Sprintf("jobs.%s.currently_running", typeStr),
8287
Help: fmt.Sprintf("Number of %s jobs currently running in Resume or OnFailOrCancel state",
8388
typeStr),
8489
Measurement: "jobs",
8590
Unit: metric.Unit_COUNT,
8691
MetricType: io_prometheus_client.MetricType_GAUGE,
8792
}
93+
94+
switch jt {
95+
case jobspb.TypeAutoCreateStats:
96+
m.Essential = true
97+
m.Category = metric.Metadata_SQL
98+
m.HowToUse = `This metric tracks the number of active automatically generated statistics jobs that could also be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.`
99+
case jobspb.TypeCreateStats:
100+
m.Essential = true
101+
m.Category = metric.Metadata_SQL
102+
m.HowToUse = `This metric tracks the number of active create statistics jobs that may be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.`
103+
case jobspb.TypeBackup:
104+
m.Essential = true
105+
m.Category = metric.Metadata_SQL
106+
m.HowToUse = `See Description.`
107+
case jobspb.TypeRowLevelTTL:
108+
m.Essential = true
109+
m.Category = metric.Metadata_TTL
110+
m.HowToUse = `Monitor this metric to ensure there are not too many Row Level TTL jobs running at the same time. Generally, this metric should be in the low single digits.`
111+
}
112+
113+
return m
88114
}
89115

90-
func makeMetaCurrentlyIdle(typeStr string) metric.Metadata {
116+
func makeMetaCurrentlyIdle(jt jobspb.Type) metric.Metadata {
117+
typeStr := typeToString(jt)
91118
return metric.Metadata{
92119
Name: fmt.Sprintf("jobs.%s.currently_idle", typeStr),
93120
Help: fmt.Sprintf("Number of %s jobs currently considered Idle and can be freely shut down",
@@ -98,29 +125,59 @@ func makeMetaCurrentlyIdle(typeStr string) metric.Metadata {
98125
}
99126
}
100127

101-
func makeMetaCurrentlyPaused(typeStr string) metric.Metadata {
102-
return metric.Metadata{
128+
func makeMetaCurrentlyPaused(jt jobspb.Type) metric.Metadata {
129+
typeStr := typeToString(jt)
130+
m := metric.Metadata{
103131
Name: fmt.Sprintf("jobs.%s.currently_paused", typeStr),
104132
Help: fmt.Sprintf("Number of %s jobs currently considered Paused",
105133
typeStr),
106134
Measurement: "jobs",
107135
Unit: metric.Unit_COUNT,
108136
MetricType: io_prometheus_client.MetricType_GAUGE,
109137
}
138+
switch jt {
139+
case jobspb.TypeAutoCreateStats:
140+
m.Essential = true
141+
m.Category = metric.Metadata_SQL
142+
m.HowToUse = `This metric is a high-level indicator that automatically generated statistics jobs are paused which can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.`
143+
case jobspb.TypeBackup:
144+
m.Essential = true
145+
m.Category = metric.Metadata_SQL
146+
m.HowToUse = `Monitor and alert on this metric to safeguard against an inadvertent operational error of leaving a backup job in a paused state for an extended period of time. In functional areas, a paused job can hold resources or have concurrency impact or some other negative consequence. Paused backup may break the recovery point objective (RPO).`
147+
case jobspb.TypeChangefeed:
148+
m.Essential = true
149+
m.Category = metric.Metadata_CHANGEFEEDS
150+
m.HowToUse = `Monitor and alert on this metric to safeguard against an inadvertent operational error of leaving a changefeed job in a paused state for an extended period of time. Changefeed jobs should not be paused for a long time because the protected timestamp prevents garbage collection.`
151+
case jobspb.TypeRowLevelTTL:
152+
m.Essential = true
153+
m.Category = metric.Metadata_TTL
154+
m.HowToUse = `Monitor this metric to ensure the Row Level TTL job does not remain paused inadvertently for an extended period.`
155+
}
156+
return m
110157
}
111158

112-
func makeMetaResumeCompeted(typeStr string) metric.Metadata {
113-
return metric.Metadata{
159+
func makeMetaResumeCompeted(jt jobspb.Type) metric.Metadata {
160+
typeStr := typeToString(jt)
161+
m := metric.Metadata{
114162
Name: fmt.Sprintf("jobs.%s.resume_completed", typeStr),
115163
Help: fmt.Sprintf("Number of %s jobs which successfully resumed to completion",
116164
typeStr),
117165
Measurement: "jobs",
118166
Unit: metric.Unit_COUNT,
119167
MetricType: io_prometheus_client.MetricType_GAUGE,
120168
}
169+
170+
switch jt {
171+
case jobspb.TypeRowLevelTTL:
172+
m.Essential = true
173+
m.Category = metric.Metadata_TTL
174+
m.HowToUse = `If Row Level TTL is enabled, this metric should be nonzero and correspond to the ttl_cron setting that was chosen. If this metric is zero, it means the job is not running`
175+
}
176+
return m
121177
}
122178

123-
func makeMetaResumeRetryError(typeStr string) metric.Metadata {
179+
func makeMetaResumeRetryError(jt jobspb.Type) metric.Metadata {
180+
typeStr := typeToString(jt)
124181
return metric.Metadata{
125182
Name: fmt.Sprintf("jobs.%s.resume_retry_error", typeStr),
126183
Help: fmt.Sprintf("Number of %s jobs which failed with a retriable error",
@@ -131,18 +188,32 @@ func makeMetaResumeRetryError(typeStr string) metric.Metadata {
131188
}
132189
}
133190

134-
func makeMetaResumeFailed(typeStr string) metric.Metadata {
135-
return metric.Metadata{
191+
func makeMetaResumeFailed(jt jobspb.Type) metric.Metadata {
192+
typeStr := typeToString(jt)
193+
m := metric.Metadata{
136194
Name: fmt.Sprintf("jobs.%s.resume_failed", typeStr),
137195
Help: fmt.Sprintf("Number of %s jobs which failed with a non-retriable error",
138196
typeStr),
139197
Measurement: "jobs",
140198
Unit: metric.Unit_COUNT,
141199
MetricType: io_prometheus_client.MetricType_GAUGE,
142200
}
201+
202+
switch jt {
203+
case jobspb.TypeAutoCreateStats:
204+
m.Essential = true
205+
m.Category = metric.Metadata_SQL
206+
m.HowToUse = `This metric is a high-level indicator that automatically generated table statistics is failing. Failed statistic creation can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.`
207+
case jobspb.TypeRowLevelTTL:
208+
m.Essential = true
209+
m.Category = metric.Metadata_TTL
210+
m.HowToUse = `This metric should remain at zero. Repeated errors means the Row Level TTL job is not deleting data.`
211+
}
212+
return m
143213
}
144214

145-
func makeMetaFailOrCancelCompeted(typeStr string) metric.Metadata {
215+
func makeMetaFailOrCancelCompeted(jt jobspb.Type) metric.Metadata {
216+
typeStr := typeToString(jt)
146217
return metric.Metadata{
147218
Name: fmt.Sprintf("jobs.%s.fail_or_cancel_completed", typeStr),
148219
Help: fmt.Sprintf("Number of %s jobs which successfully completed "+
@@ -154,7 +225,8 @@ func makeMetaFailOrCancelCompeted(typeStr string) metric.Metadata {
154225
}
155226
}
156227

157-
func makeMetaFailOrCancelRetryError(typeStr string) metric.Metadata {
228+
func makeMetaFailOrCancelRetryError(jt jobspb.Type) metric.Metadata {
229+
typeStr := typeToString(jt)
158230
return metric.Metadata{
159231
Name: fmt.Sprintf("jobs.%s.fail_or_cancel_retry_error", typeStr),
160232
Help: fmt.Sprintf("Number of %s jobs which failed with a retriable "+
@@ -166,7 +238,8 @@ func makeMetaFailOrCancelRetryError(typeStr string) metric.Metadata {
166238
}
167239
}
168240

169-
func makeMetaFailOrCancelFailed(typeStr string) metric.Metadata {
241+
func makeMetaFailOrCancelFailed(jt jobspb.Type) metric.Metadata {
242+
typeStr := typeToString(jt)
170243
return metric.Metadata{
171244
Name: fmt.Sprintf("jobs.%s.fail_or_cancel_failed", typeStr),
172245
Help: fmt.Sprintf("Number of %s jobs which failed with a "+
@@ -178,7 +251,8 @@ func makeMetaFailOrCancelFailed(typeStr string) metric.Metadata {
178251
}
179252
}
180253

181-
func makeMetaProtectedCount(typeStr string) metric.Metadata {
254+
func makeMetaProtectedCount(jt jobspb.Type) metric.Metadata {
255+
typeStr := typeToString(jt)
182256
return metric.Metadata{
183257
Name: fmt.Sprintf("jobs.%s.protected_record_count", typeStr),
184258
Help: fmt.Sprintf("Number of protected timestamp records held by %s jobs", typeStr),
@@ -188,17 +262,28 @@ func makeMetaProtectedCount(typeStr string) metric.Metadata {
188262
}
189263
}
190264

191-
func makeMetaProtectedAge(typeStr string) metric.Metadata {
192-
return metric.Metadata{
265+
func makeMetaProtectedAge(jt jobspb.Type) metric.Metadata {
266+
typeStr := typeToString(jt)
267+
m := metric.Metadata{
193268
Name: fmt.Sprintf("jobs.%s.protected_age_sec", typeStr),
194269
Help: fmt.Sprintf("The age of the oldest PTS record protected by %s jobs", typeStr),
195270
Measurement: "seconds",
196271
Unit: metric.Unit_SECONDS,
197272
MetricType: io_prometheus_client.MetricType_GAUGE,
198273
}
274+
275+
switch jt {
276+
case jobspb.TypeChangefeed:
277+
m.Essential = true
278+
m.Category = metric.Metadata_CHANGEFEEDS
279+
m.HowToUse = `Changefeeds use protected timestamps to protect the data from being garbage collected. Ensure the protected timestamp age does not significantly exceed the GC TTL zone configuration. Alert on this metric if the protected timestamp age is greater than 3 times the GC TTL.`
280+
}
281+
282+
return m
199283
}
200284

201-
func makeMetaExpiredPTS(typeStr string) metric.Metadata {
285+
func makeMetaExpiredPTS(jt jobspb.Type) metric.Metadata {
286+
typeStr := typeToString(jt)
202287
return metric.Metadata{
203288
Name: fmt.Sprintf("jobs.%s.expired_pts_records", typeStr),
204289
Help: fmt.Sprintf("Number of expired protected timestamp records owned by %s jobs", typeStr),
@@ -271,21 +356,21 @@ func (m *Metrics) init(histogramWindowInterval time.Duration, lookup *cidr.Looku
271356
if jt == jobspb.TypeUnspecified { // do not track TypeUnspecified
272357
continue
273358
}
274-
typeStr := strings.ToLower(strings.Replace(jt.String(), " ", "_", -1))
275359
m.JobMetrics[jt] = &JobTypeMetrics{
276-
CurrentlyRunning: metric.NewGauge(makeMetaCurrentlyRunning(typeStr)),
277-
CurrentlyIdle: metric.NewGauge(makeMetaCurrentlyIdle(typeStr)),
278-
CurrentlyPaused: metric.NewGauge(makeMetaCurrentlyPaused(typeStr)),
279-
ResumeCompleted: metric.NewCounter(makeMetaResumeCompeted(typeStr)),
280-
ResumeRetryError: metric.NewCounter(makeMetaResumeRetryError(typeStr)),
281-
ResumeFailed: metric.NewCounter(makeMetaResumeFailed(typeStr)),
282-
FailOrCancelCompleted: metric.NewCounter(makeMetaFailOrCancelCompeted(typeStr)),
283-
FailOrCancelRetryError: metric.NewCounter(makeMetaFailOrCancelRetryError(typeStr)),
284-
FailOrCancelFailed: metric.NewCounter(makeMetaFailOrCancelFailed(typeStr)),
285-
NumJobsWithPTS: metric.NewGauge(makeMetaProtectedCount(typeStr)),
286-
ExpiredPTS: metric.NewCounter(makeMetaExpiredPTS(typeStr)),
287-
ProtectedAge: metric.NewGauge(makeMetaProtectedAge(typeStr)),
360+
CurrentlyRunning: metric.NewGauge(makeMetaCurrentlyRunning(jt)),
361+
CurrentlyIdle: metric.NewGauge(makeMetaCurrentlyIdle(jt)),
362+
CurrentlyPaused: metric.NewGauge(makeMetaCurrentlyPaused(jt)),
363+
ResumeCompleted: metric.NewCounter(makeMetaResumeCompeted(jt)),
364+
ResumeRetryError: metric.NewCounter(makeMetaResumeRetryError(jt)),
365+
ResumeFailed: metric.NewCounter(makeMetaResumeFailed(jt)),
366+
FailOrCancelCompleted: metric.NewCounter(makeMetaFailOrCancelCompeted(jt)),
367+
FailOrCancelRetryError: metric.NewCounter(makeMetaFailOrCancelRetryError(jt)),
368+
FailOrCancelFailed: metric.NewCounter(makeMetaFailOrCancelFailed(jt)),
369+
NumJobsWithPTS: metric.NewGauge(makeMetaProtectedCount(jt)),
370+
ExpiredPTS: metric.NewCounter(makeMetaExpiredPTS(jt)),
371+
ProtectedAge: metric.NewGauge(makeMetaProtectedAge(jt)),
288372
}
373+
289374
if opts, ok := getRegisterOptions(jt); ok {
290375
if opts.metrics != nil {
291376
m.JobSpecificMetrics[jt] = opts.metrics

0 commit comments

Comments
 (0)