diff --git a/src/pages/guide/observability/builder-observability.mdx b/src/pages/guide/observability/builder-observability.mdx new file mode 100644 index 0000000..24aca45 --- /dev/null +++ b/src/pages/guide/observability/builder-observability.mdx @@ -0,0 +1,95 @@ +--- +title: Builder Observability +description: Internal tooling and data sources for investigating Tempo block building and validation performance, outliers, and execution breakdowns. +--- + +# Builder Observability + +Tooling and workflows for investigating Tempo block building and validation performance. The goal is to quickly identify outliers (slow builds, timed-out proposals, slow validations) and drill into per-node execution breakdowns. + +## Workflow + +The investigation loop: + +1. **Spot an outlier** — monitor [ValScope](#valscope) or Grafana dashboards for slow proposals, builds, or validations +2. **Inspect the network view** — open the block/view in ValScope to see per-validator timelines across the network +3. **Drill into execution** — jump to [BlockScope](#blockscope) for a detailed per-node execution breakdown (traces, spans, timeline) + +## Tools + +### ValScope + +Real-time validator monitoring dashboard. Ingests consensus and execution logs from all validators, correlates events into per-block timelines, and serves a live web UI. + +- **Repo:** [tempoxyz/valscope](https://github.com/tempoxyz/valscope) +- **Testnet:** `dev-joshie:3004` (Tailscale) +- **Mainnet:** `dev-joshie:3005` (Tailscale) + +**What it shows:** +- Live block and view tables with validator health stats +- Per-block swim-lane timelines showing events across all validators +- Consensus analytics — gas vs quorum scatter, quorum latency, receive delay heatmap +- Execution analytics — gas vs build time, build time dumbbell, persistence metrics +- Nullified (failed) consensus views + +**Key pages:** +| Page | Route | Description | +|---|---|---| +| Overview | `/` | Live block + view tables, validator health | +| Consensus | `/consensus` | Quorum latency, receive delays | +| Execution | `/execution` | Build times, persistence metrics | +| Block Detail | `/blocks/:height` | Full event timeline for a committed block | +| View Detail | `/epoch/:epoch/views/:view` | Full event timeline for a consensus view | + +**Validator configs:** +- [Testnet validators](https://github.com/tempoxyz/valscope/blob/main/apps/api/validators.toml) +- [Mainnet validators](https://github.com/tempoxyz/valscope/blob/main/apps/api/validators-mainnet.toml) + +### BlockScope + +Execution-level dashboard for comparing block processing across clients. Shows per-block trace breakdowns, execution timelines, and mempool overlap analysis. + +- **Repo:** [tempoxyz/blockscope](https://github.com/tempoxyz/blockscope) +- **Current deploy:** `dev-alexey:5173` (Tailscale, port-forwarded — being migrated) + +**What it shows:** +- Block-by-block comparison across execution clients (reth, nethermind, ethrex) +- Per-block execution trace timeline (state root, sub-blocks, EVM execution) +- Mempool overlap analysis — how much of each block was in the local txpool +- Per-builder block history with overlap stats + +**Key pages:** +| Page | Route | Description | +|---|---|---| +| Overview | `/` | Block comparison table across clients | +| Block Detail | `/blocks/:height` | Execution breakdown with trace timeline | +| Mempool | `/mempool` | Gas usage vs overlap scatter plot | +| Builder Detail | `/builder/:name` | Per-builder block history | + +## Data Sources + +All endpoints are internal Tailscale hostnames — requires being on the Tempo tailnet. + +| Service | Env Var | What it does | Testnet | Mainnet | +|---|---|---|---|---| +| External VLogs | `VLOGS_URL` | Logs from partner/external validators (VictoriaLogs) | `dev-euw-vl-partners.tail388b2e.ts.net` | _(none)_ | +| Internal VLogs | `VLOGS_INTERNAL_URL` | Logs from Tempo's own nodes — structured reth output during build/validation (VictoriaLogs) | `stg-nae-vl-internal.tail388b2e.ts.net` | `prd-nae-vl-internal.tail388b2e.ts.net` | +| VM External | `VM_EXTERNAL_URL` | Prometheus-style metrics (block times, gas, peers) from partner nodes (VictoriaMetrics) | `dev-euw-vm-partners.tail388b2e.ts.net` | same (namespace-filtered) | +| VM Internal | `VM_INTERNAL_URL` | Prometheus-style metrics (CPU, memory, block processing) from Tempo's own nodes (VictoriaMetrics) | `stg-nae-vm-internal.tail388b2e.ts.net` | `prd-nae-vm-internal.tail388b2e.ts.net` | +| Tempo Traces | `TEMPO_URL` | Distributed traces/spans — powers execution timeline breakdowns. **Internal nodes only.** (Grafana Tempo) | `stg-nae-grafana-tempo.tail388b2e.ts.net` | `prd-nae-grafana-tempo.tail388b2e.ts.net` | +| Namespace | `NETWORK` | Cluster/namespace selector | `moderato-stable` | `tempo-mainnet-stable` | + +## Known Outlier Patterns + +Issues surfaced through monitoring: + +- **Execution cache mutex contention** — `Updated execution cache` blocked for 400ms+ during fork/reorg scenarios. Tracked in [RETH-498](https://linear.app/tempoxyz/issue/RETH-498) +- **Late build start** — building starts after the view has already begun, reducing available build time. See [tempo#2952](https://github.com/tempoxyz/tempo/pull/2952) +- **Persistence during building** — disk persistence overlapping with block building, observed on memory-constrained machines +- **Long-running newPayload** — inability to cancel an in-progress `newPayload` execution + +## Limitations + +- **External validators have no traces/spans** — only logs and metrics are available for partner nodes. Detailed execution breakdowns (Grafana Tempo) are internal-only. +- **New instrumentation requires a release** — adding new spans or logs to testnet/mainnet requires shipping a new Tempo version. Existing instrumentation must be used until then. +- **ValScope log parsing limitations** — currently parses log lines with regex, which can be slow and sometimes misses events that need timestamp-based correlation. diff --git a/vocs.config.ts b/vocs.config.ts index 9ebdd60..f5ee342 100644 --- a/vocs.config.ts +++ b/vocs.config.ts @@ -551,6 +551,16 @@ export default defineConfig({ }, ], }, + { + text: 'Observability', + collapsed: true, + items: [ + { + text: 'Builder Observability', + link: '/guide/observability/builder-observability', + }, + ], + }, // { // text: 'Infrastructure & Tooling', // items: [