Purpose: Single place for what data goes to which backend, why, and rough sizing when many users share one Grafana stack. Detail on Loki labels and events: docs/GRAFANA-LOGGING.md. Scaling patterns: docs/observability-scaling.md. SDK variable names: docs/IRACING-TELEMETRY.md. When/where iRacing exposes data (live vs replay vs post-results): docs/IRACING-DATA-AVAILABILITY.md. iRacing SDK specific metrics vs events schema: docs/IRACING-OBSERVABILITY-STRATEGY.md.
These are the intended paths for SimSteward observability data. Implementation may lag; this doc is the target architecture.
| Decision | Choice |
|---|---|
| Domain events (incidents, actions, session summaries, digests, lifecycle) | Loki (structured logs / NDJSON → HTTP push). |
Low-rate resource / health (host_resource_sample, plugin lifecycle) |
Loki today; optional Prometheus/Mimir gauges later for native SLO-style alerts on CPU/memory. |
| High-frequency / per-tick car telemetry (e.g. 60 Hz channels, per-car arrays) | Not primary path to Loki. OpenTelemetry metrics (in-process SDK → OTLP) → Prometheus-compatible backend (Prometheus or Grafana Mimir). |
| Traces (spans across plugin + backend) | Optional OTel traces → Tempo/Jaeger — not required for v1. |
OpenTelemetry here means instrumentation and export (SDK + OTLP), not a storage product. Storage is whatever backs metrics (Prometheus, Mimir, etc.) and optionally Tempo/Loki depending on collector config.
Grafana is the visualization layer; it is not the scaling bottleneck.
Local stack (metrics path): The repo’s Docker Compose stack runs OpenTelemetry Collector (OTLP gRPC/HTTP → Prometheus scrape endpoint) and Prometheus alongside Loki and Grafana. The SimHub plugin sends OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT or SIMSTEWARD_OTLP_ENDPOINT is set (no in-process /metrics HTTP server — avoids HttpListener constraints). End-to-end steps and ports: docs/observability-local.md § Metrics.
| Data class | Examples | Target | Rationale |
|---|---|---|---|
| Domain events | incident_detected, action_result, session_end_datapoints_*, session_digest |
Loki | Irregular, rich JSON; query by event + body fields. |
| Low-rate resource / health | host_resource_sample, plugin_ready |
Loki; optional duplicate as gauges in Prom/Mimir | Loki is fine at ~1/min; Prom better for alert rules on % CPU / memory. |
| Car / sim telemetry (high rate) | Per-tick Speed, RPM, tire temps, CarIdx* arrays at 60 Hz |
OTel metrics → Prometheus/Mimir | Time-series shape; aggregation, downsampling, rates — not log lines. |
| Traces | Optional spans | OTel → Tempo | Adds complexity; defer until needed. |
flowchart TB
subgraph clients [Per user machine]
Plugin[SimSteward plugin]
OTelCol[OTel Collector optional]
Plugin -->|structured events NDJSON| Loki[Loki]
Plugin -->|OTLP metrics low agg| OTelCol
end
OTelCol --> MetricsDB[Prometheus or Mimir]
Loki --> Grafana[Grafana]
MetricsDB --> Grafana
Per-user log budget (order of magnitude) — from docs/GRAFANA-LOGGING.md volume table (~2 h session):
- ~0.23 MB per session (event-driven; no per-tick logging).
- If a heavy user runs ~30 sessions/month → ~7 MB/user/month (0.23 × 30).
~1k users (all active, similar usage):
- ~7 GB/month ingested logs (1000 × 7 MB), order-of-magnitude before deduplication, sampling, or varying usage.
- Ingestion rate: Batches are small (< 20 KB typical); total MB/s depends on how many users peak at once — size the tenant (ingestion cap, storage, retention) to users × peak sessions × batch frequency, not average only.
- Streams: With the 4-label schema ( docs/GRAFANA-LOGGING.md ), active streams stay ≪ 5,000; the risk is total GB/month and MB/s, not stream count.
Grafana Cloud free tier (indicative: ~50 GB/month, 5 MB/s, 14-day retention — see docs/GRAFANA-LOGGING.md): ~1k full-time users with the stated per-session budget can exceed “hobby” comfort without a sizing pass. Treat paid Grafana Cloud capacity or self-hosted Loki as a billing/ops decision once user counts and session rates are known.
Risk at scale: Cardinality — not Grafana UI.
- Do not put
user_id,session_id, orcar_idxon every series as labels → millions of series → cost and slow queries. - Do: Low-cardinality labels (
env,region,app,tier); aggregate at OTel collector or edge; put high-cardinality IDs in exemplars, logs (Loki), or recording rules.
| Anti-pattern | Effect |
|---|---|
| Per-tick car telemetry as Loki log lines | Volume and query cost explode; fights Loki’s model. |
| High-cardinality labels on metrics | Prometheus/Mimir series explosion. |
| Assuming free tier limits = production headroom | May need paid or self-hosted before launch at volume. |
Rule of thumb: Anything sampled or emitted every tick (~60 Hz) or per-car per-tick belongs in the OTel metrics → Prometheus/Mimir path when exported — not as a high-rate Loki log stream.
Candidate signals (names are illustrative; align with OTel semantic conventions when implementing):
| Category | SDK variables (see docs/IRACING-TELEMETRY.md; availability docs/IRACING-DATA-AVAILABILITY.md) | Notes |
|---|---|---|
| Motion / driver | Speed, RPM, Throttle, Brake, Clutch, Gear, SteeringWheelAngle, LatAccel, LongAccel, VertAccel, Yaw/Pitch/Roll/YawRate… |
Export aggregated or downsampled (e.g. 1–5 Hz) with low-cardinality labels only. |
| Tires | LFtempCL/CM/CR, …, {c}pressure, {c}wear*, {c}rideHeight |
Per-corner; avoid per-session labels on every series. |
| Engine / fuel | FuelLevel, FuelLevelPct, FuelUsePerHour, OilTemp, WaterTemp, ManifoldPress, … |
Gauges / rates. |
| Lap / position | LapDistPct, Lap, LapCurrentLapTime, LapLastLapTime, LapBestLapTime |
Histograms or gauges depending on use case. |
| Per-car arrays | CarIdxLap, CarIdxPosition, CarIdxRPM, CarIdxGear, CarIdxLapDistPct, … |
High cardinality if labeled per car per user — prefer aggregation (e.g. leader lap, player car only) or sampled subset. |
Implementation guardrails: Max export rate (e.g. 1–5 Hz), bounded metric count, no session_id as a required label on every series.
These stay events or throttled snapshots in structured logs — not a mirror of full 60 Hz telemetry:
| Kind | Examples already in docs/GRAFANA-LOGGING.md |
|---|---|
| Incidents | incident_detected — YAML delta, replay context in JSON body. |
| Session / results | session_end_datapoints_session, session_end_datapoints_results, session_digest, session_summary_captured. |
| Actions | action_dispatched, action_result, dashboard_ui_event (when enabled). |
| Health | host_resource_sample (~1/min). |
| Telemetry in logs | Single snapshot fields on session-end events (e.g. telemetry_* at capture) — not continuous per-tick streams. |
| Telemetry style | Backend |
|---|---|
| Continuous high-rate channels | OTel metrics → Prometheus/Mimir (future). |
| Events, boundaries, chunked session results, resource samples | Loki (current design). |
- Define metric names, units, and max export rate for the first OTel slice (e.g. player car only).
- Choose Mimir vs Prometheus for long-term multi-tenant scale.
- Add budget alerts on Loki ingestion GB/day before stepping up user counts.
- docs/observability-local.md — Local Grafana/Loki/Prometheus compose, OTLP env, smoke queries.
- docs/GRAFANA-LOGGING.md — Loki schema, volume table, events.
- docs/IRACING-OBSERVABILITY-STRATEGY.md — iRacing SDK telemetry mapping to Prometheus metrics & Loki events.
- docs/observability-scaling.md — Many users, central Loki, label rules.
- docs/IRACING-TELEMETRY.md — SDK variables and categories.
- docs/IRACING-DATA-AVAILABILITY.md — when/where data exists (telemetry vs YAML vs REST).
- Grafana: Loki label best practices, Prometheus cardinality.
| Spec | Doc ID |
|---|---|
| iRacing Observability Strategy | c54019c3-4e79-461a-a9b6-eb533a2c5e44 |
| Observability — Scaling | 99bd9e71-2b08-4eea-b2d4-f7bb22b38af0 |
| Grafana Loki (summary) | 58a20aaf-bdde-4318-88f7-1ec8ec44377b |
| iRacing Telemetry — SDK Variable Reference | 42ab06d4-9ed3-43a1-996c-bd0250ecdf6e |
| iRacing Data Availability Reference | TBD after ContextStream project(ingest_local) — update this row with doc ID from index |
| Architecture & Data Structures | c453dd83-dfd9-4002-b8a2-2e0c8a4d032c |
| Troubleshooting | 88274879-cd2d-4d86-9766-c86b88f95cfe |