Data routing: OTel vs Loki vs Prometheus (and scale at ~1k users)

Purpose: Single place for what data goes to which backend, why, and rough sizing when many users share one Grafana stack. Detail on Loki labels and events: docs/GRAFANA-LOGGING.md. Scaling patterns: docs/observability-scaling.md. SDK variable names: docs/IRACING-TELEMETRY.md. When/where iRacing exposes data (live vs replay vs post-results): docs/IRACING-DATA-AVAILABILITY.md. iRacing SDK specific metrics vs events schema: docs/IRACING-OBSERVABILITY-STRATEGY.md.

1. Routing decisions (approved)

These are the intended paths for SimSteward observability data. Implementation may lag; this doc is the target architecture.

Decision	Choice
Domain events (incidents, actions, session summaries, digests, lifecycle)	Loki (structured logs / NDJSON → HTTP push).
Low-rate resource / health (`host_resource_sample`, plugin lifecycle)	Loki today; optional Prometheus/Mimir gauges later for native SLO-style alerts on CPU/memory.
High-frequency / per-tick car telemetry (e.g. 60 Hz channels, per-car arrays)	Not primary path to Loki. OpenTelemetry metrics (in-process SDK → OTLP) → Prometheus-compatible backend (Prometheus or Grafana Mimir).
Traces (spans across plugin + backend)	Optional OTel traces → Tempo/Jaeger — not required for v1.

OpenTelemetry here means instrumentation and export (SDK + OTLP), not a storage product. Storage is whatever backs metrics (Prometheus, Mimir, etc.) and optionally Tempo/Loki depending on collector config.

Grafana is the visualization layer; it is not the scaling bottleneck.

Local stack (metrics path): The repo’s Docker Compose stack runs OpenTelemetry Collector (OTLP gRPC/HTTP → Prometheus scrape endpoint) and Prometheus alongside Loki and Grafana. The SimHub plugin sends OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT or SIMSTEWARD_OTLP_ENDPOINT is set (no in-process /metrics HTTP server — avoids HttpListener constraints). End-to-end steps and ports: docs/observability-local.md § Metrics.

2. Decision matrix

Data class	Examples	Target	Rationale
Domain events	`incident_detected`, `action_result`, `session_end_datapoints_*`, `session_digest`	Loki	Irregular, rich JSON; query by `event` + body fields.
Low-rate resource / health	`host_resource_sample`, `plugin_ready`	Loki; optional duplicate as gauges in Prom/Mimir	Loki is fine at ~1/min; Prom better for alert rules on % CPU / memory.
Car / sim telemetry (high rate)	Per-tick `Speed`, `RPM`, tire temps, `CarIdx*` arrays at 60 Hz	OTel metrics → Prometheus/Mimir	Time-series shape; aggregation, downsampling, rates — not log lines.
Traces	Optional spans	OTel → Tempo	Adds complexity; defer until needed.

3. Architecture sketch

flowchart TB
  subgraph clients [Per user machine]
    Plugin[SimSteward plugin]
    OTelCol[OTel Collector optional]
    Plugin -->|structured events NDJSON| Loki[Loki]
    Plugin -->|OTLP metrics low agg| OTelCol
  end
  OTelCol --> MetricsDB[Prometheus or Mimir]
  Loki --> Grafana[Grafana]
  MetricsDB --> Grafana

4. Rough sizing at ~1k users

4.1 Loki (logs)

Per-user log budget (order of magnitude) — from docs/GRAFANA-LOGGING.md volume table (~2 h session):

~0.23 MB per session (event-driven; no per-tick logging).
If a heavy user runs ~30 sessions/month → ~7 MB/user/month (0.23 × 30).

~1k users (all active, similar usage):

~7 GB/month ingested logs (1000 × 7 MB), order-of-magnitude before deduplication, sampling, or varying usage.
Ingestion rate: Batches are small (< 20 KB typical); total MB/s depends on how many users peak at once — size the tenant (ingestion cap, storage, retention) to users × peak sessions × batch frequency, not average only.
Streams: With the 4-label schema ( docs/GRAFANA-LOGGING.md ), active streams stay ≪ 5,000; the risk is total GB/month and MB/s, not stream count.

Grafana Cloud free tier (indicative: ~50 GB/month, 5 MB/s, 14-day retention — see docs/GRAFANA-LOGGING.md): ~1k full-time users with the stated per-session budget can exceed “hobby” comfort without a sizing pass. Treat paid Grafana Cloud capacity or self-hosted Loki as a billing/ops decision once user counts and session rates are known.

4.2 Prometheus / Mimir (metrics)

Risk at scale: Cardinality — not Grafana UI.

Do not put user_id, session_id, or car_idx on every series as labels → millions of series → cost and slow queries.
Do: Low-cardinality labels (env, region, app, tier); aggregate at OTel collector or edge; put high-cardinality IDs in exemplars, logs (Loki), or recording rules.

4.3 What breaks at ~1k users

Anti-pattern	Effect
Per-tick car telemetry as Loki log lines	Volume and query cost explode; fights Loki’s model.
High-cardinality labels on metrics	Prometheus/Mimir series explosion.
Assuming free tier limits = production headroom	May need paid or self-hosted before launch at volume.

5. Metric taxonomy: car telemetry (OTel vs Loki)

Rule of thumb: Anything sampled or emitted every tick (~60 Hz) or per-car per-tick belongs in the OTel metrics → Prometheus/Mimir path when exported — not as a high-rate Loki log stream.

5.1 OTel metrics path (when implemented)

Candidate signals (names are illustrative; align with OTel semantic conventions when implementing):

Category	SDK variables (see docs/IRACING-TELEMETRY.md; availability docs/IRACING-DATA-AVAILABILITY.md)	Notes
Motion / driver	`Speed`, `RPM`, `Throttle`, `Brake`, `Clutch`, `Gear`, `SteeringWheelAngle`, `LatAccel`, `LongAccel`, `VertAccel`, `Yaw`/`Pitch`/`Roll`/`YawRate`…	Export aggregated or downsampled (e.g. 1–5 Hz) with low-cardinality labels only.
Tires	`LFtempCL/CM/CR`, …, `{c}pressure`, `{c}wear*`, `{c}rideHeight`	Per-corner; avoid per-session labels on every series.
Engine / fuel	`FuelLevel`, `FuelLevelPct`, `FuelUsePerHour`, `OilTemp`, `WaterTemp`, `ManifoldPress`, …	Gauges / rates.
Lap / position	`LapDistPct`, `Lap`, `LapCurrentLapTime`, `LapLastLapTime`, `LapBestLapTime`	Histograms or gauges depending on use case.
Per-car arrays	`CarIdxLap`, `CarIdxPosition`, `CarIdxRPM`, `CarIdxGear`, `CarIdxLapDistPct`, …	High cardinality if labeled per car per user — prefer aggregation (e.g. leader lap, player car only) or sampled subset.

Implementation guardrails: Max export rate (e.g. 1–5 Hz), bounded metric count, no session_id as a required label on every series.

5.2 Loki path (session-scoped / event-only)

These stay events or throttled snapshots in structured logs — not a mirror of full 60 Hz telemetry:

Kind	Examples already in docs/GRAFANA-LOGGING.md
Incidents	`incident_detected` — YAML delta, replay context in JSON body.
Session / results	`session_end_datapoints_session`, `session_end_datapoints_results`, `session_digest`, `session_summary_captured`.
Actions	`action_dispatched`, `action_result`, `dashboard_ui_event` (when enabled).
Health	`host_resource_sample` (~1/min).
Telemetry in logs	Single snapshot fields on session-end events (e.g. `telemetry_` at capture) — not* continuous per-tick streams.

5.3 Summary

Telemetry style	Backend
Continuous high-rate channels	OTel metrics → Prometheus/Mimir (future).
Events, boundaries, chunked session results, resource samples	Loki (current design).

6. Follow-ups (implementation)

Define metric names, units, and max export rate for the first OTel slice (e.g. player car only).
Choose Mimir vs Prometheus for long-term multi-tenant scale.
Add budget alerts on Loki ingestion GB/day before stepping up user counts.

References

docs/observability-local.md — Local Grafana/Loki/Prometheus compose, OTLP env, smoke queries.
docs/GRAFANA-LOGGING.md — Loki schema, volume table, events.
docs/IRACING-OBSERVABILITY-STRATEGY.md — iRacing SDK telemetry mapping to Prometheus metrics & Loki events.
docs/observability-scaling.md — Many users, central Loki, label rules.
docs/IRACING-TELEMETRY.md — SDK variables and categories.
docs/IRACING-DATA-AVAILABILITY.md — when/where data exists (telemetry vs YAML vs REST).
Grafana: Loki label best practices, Prometheus cardinality.

ContextStream KB links

Spec	Doc ID
iRacing Observability Strategy	`c54019c3-4e79-461a-a9b6-eb533a2c5e44`
Observability — Scaling	`99bd9e71-2b08-4eea-b2d4-f7bb22b38af0`
Grafana Loki (summary)	`58a20aaf-bdde-4318-88f7-1ec8ec44377b`
iRacing Telemetry — SDK Variable Reference	`42ab06d4-9ed3-43a1-996c-bd0250ecdf6e`
iRacing Data Availability Reference	TBD after ContextStream `project(ingest_local)` — update this row with doc ID from index
Architecture & Data Structures	`c453dd83-dfd9-4002-b8a2-2e0c8a4d032c`
Troubleshooting	`88274879-cd2d-4d86-9766-c86b88f95cfe`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data routing: OTel vs Loki vs Prometheus (and scale at ~1k users)

1. Routing decisions (approved)

2. Decision matrix

3. Architecture sketch

4. Rough sizing at ~1k users

4.1 Loki (logs)

4.2 Prometheus / Mimir (metrics)

4.3 What breaks at ~1k users

5. Metric taxonomy: car telemetry (OTel vs Loki)

5.1 OTel metrics path (when implemented)

5.2 Loki path (session-scoped / event-only)

5.3 Summary

6. Follow-ups (implementation)

References

ContextStream KB links

FilesExpand file tree

DATA-ROUTING-OBSERVABILITY.md

Latest commit

History

DATA-ROUTING-OBSERVABILITY.md

File metadata and controls

Data routing: OTel vs Loki vs Prometheus (and scale at ~1k users)

1. Routing decisions (approved)

2. Decision matrix

3. Architecture sketch

4. Rough sizing at ~1k users

4.1 Loki (logs)

4.2 Prometheus / Mimir (metrics)

4.3 What breaks at ~1k users

5. Metric taxonomy: car telemetry (OTel vs Loki)

5.1 OTel metrics path (when implemented)

5.2 Loki path (session-scoped / event-only)

5.3 Summary

6. Follow-ups (implementation)

References

ContextStream KB links