diff --git a/src/disco/metrics/metrics.xml b/src/disco/metrics/metrics.xml index 7b783807b45..469ceb06ea7 100644 --- a/src/disco/metrics/metrics.xml +++ b/src/disco/metrics/metrics.xml @@ -6,20 +6,138 @@ categories. These metrics must be backwards compatible and you should not change existing metric names. Instead they should be deprecated and a new metric introduced. + +============================================================================= + AI WRITTEN METRIC NAMING CONVENTIONS +============================================================================= + +These conventions are written with help from AI mostly to help ensure +consistent naming conventions, or flag bad ones during review. If +you're unsure how to name something, just ask AI to write it following +these conventions. + +The main principle is to align with Prometheus best practices +(https://prometheus.io/docs/practices/naming/). Names in this file use +PascalCase but for export to Prometheus are converted to +lower_snake_case. + +METRIC NAME STRUCTURE +===================== +Metric names follow a consistent pattern of + + + + - Domain: Broad system area within tile (Gre, Log, Pkt, Cpu) + - Subject: Countable domain object being measured (Pkt, Txn, Frag) + - Action: What is being done (Rx, Tx, Abandoned) + - Unit: Physical unit of measurement (Bytes, Seconds) + +Units are optional. Some things are countable objects, are do not need +a unit. For example, transactions, fragments, and packets are countable +and have no units. + +Actions should be past participles describing what happened to the +subject. A past participle is the verb form used in passive voice +("was created", "were dropped") - use the form that fits "It was ___". + + GOOD: ConnectionsCreated, PacketsReceived, TransactionsOverrun + BAD: ConnectionCreates, PacketReceive, TransactionOverruns + +The pattern is " was/were ": + + - "Connections were created" -> ConnectionsCreated + - "Packets were dropped" -> PacketsDropped + - "Transactions were overrun" -> TransactionsOverrun (not Overran) + +The Prometheus exporter automatically appends "_total" to counters on +export. Do not include suffixes like "Total" or "Count" in names in +this file. + +SINGULAR VS PLURAL +================== +Counters and gauges use plural subjects, for example ConnectionsFailed +or StreamsOpened, or ConnectionsActive for a gauge. + +For compound subjects like "connection error" or "handshake error", the +plural goes on the thing being counted: + + - ConnectionErrorsNoSlots (counting connection errors) + - HandshakeErrorsAllocFail (counting handshake errors) + - NOT: ConnectionsErrorNoSlots or HandshakesErrorAllocFail + +Histograms use singular subjects. For example, FileOpenDurationSeconds or +ConnectionLatencyNanos. + +Units should always be plural, Bytes, Seconds, Nanos. + +BASE UNITS (Prometheus Standard) +================================ +Always use Prometheus base units. Never use derived units. Only these +are considered "units" for naming purposes: + + | Family | Base Unit | Notes | + |- - - - |- - - - - -|- - - - - - - - - - - | + | Time | seconds | Use for durations. | + | Data | bytes | Not bits, KB, MB, GB | + +Things like transactions, connections, and packets are NOT units, they +are subjects and follow different rules. + +Exception: Use Nanos when sub-microsecond precision is needed and +converting to seconds would lose meaningful precision. + +ENUMS (labels) +============== +Use enums to differentiate characteristics of the same metric. Do NOT +encode enum values in the metric name. + + GOOD: TransactionResult with enum values {Success, Failed, Timeout} + BAD: TransactionResultSuccess, TransactionResultFailed, ... + +Enum names: PascalCase describing the dimension (TransactionResult) +Enum values: PascalCase, self-descriptive (Success, ParseFailed) + +SUMMARIES +========= +Every metric MUST have a summary that describes what is measured in +plain English and mentions units if not obvious from the name. +Summaries should be simple and concise, and not duplicate information +for example, + + GOOD: "Packets received" + BAD: "Total count of packets received over the network interface" + +Summaries should begin with a capital, and end without a period, and be +a human readable label, not a code identifier. It is allowed to have a +code identifier or reference in parentheses if needed for clarity. + +EXAMPLES +======== + + GOOD: PktRxBytes + BAD: RxBytes, RxPktBytes, ReceivedPktBytes, TotalPktRxBytes + + GOOD: StreamsOpened + BAD: StreamOpenCount, OpenedStreamsTotal, StreamOpened + + GOOD: TxnRx + BAD: TransactionsRx, TxnsRx, ReceivedTxns, RxTxn + +============================================================================= --> - - - - - - - - - + + + + + + + + + @@ -44,55 +162,56 @@ metric introduced. - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + - - - - - - + + + + + + - - - - - + + + + + @@ -105,13 +224,13 @@ metric introduced. - - - - - - - + + + + + + + @@ -200,50 +319,52 @@ metric introduced. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Duration spent in service @@ -252,16 +373,16 @@ metric introduced. Duration spent processing packets - - - - - - - - - - + + + + + + + + + + @@ -309,18 +430,18 @@ metric introduced. - - + + - - - - - + + + + + - + @@ -329,37 +450,37 @@ metric introduced. - - + + - - - - + + + + - - - - + + + + - - - + + + - - + + - - - + + + - +