Structured logging for SimSteward (Grafana Loki / Grafana Cloud or local Docker). All logs are event-driven; no per-tick logging in production. Implemented today: Plugin → PluginLogger.Structured() → plugin-structured.jsonl (NDJSON on disk) and the same entries are mirrored to the dashboard over WebSocket. This repository’s plugin does not yet HTTP POST log lines to Loki — SIMSTEWARD_LOKI_URL is used for routing metadata in JSON (loki_push_target), optional read paths (e.g. data-capture suite verification), and scripts: deploy.ps1 posts a single deploy_marker via send-deploy-loki-marker.ps1 when the URL is set. To see full plugin logs in Loki today, run an external shipper that tails plugin-structured.jsonl into your stack, or add in-process batch POST later (see docs/observability-scaling.md). If WebSocket log sends fail, the plugin writes to broadcast-errors.log (see docs/TROUBLESHOOTING.md §4b). Explore, custom panels, and AI tooling use the 4-label schema and fixed event taxonomy below. Data routing: docs/DATA-ROUTING-OBSERVABILITY.md. Local stack: docs/observability-local.md.
Loki stream (when ingested): Do not filter at the push path — ship the same lines you write to plugin-structured.jsonl.
Filtering is dashboard-only. The web dashboard receives the full stream via WebSocket and applies level/event visibility filters for display only (checkboxes and hiddenLevels / hiddenEvents). Toggling "hide DEBUG" or hiding specific event types in the dashboard shows or hides entries that are already in the stream; nothing is dropped at the plugin.
Env label: Set SIMSTEWARD_LOG_ENV=local for local dev or SIMSTEWARD_LOG_ENV=production (default); this flows into JSON as log_env / routing hints. The dashboard WebSocket stream is full; Loki reflects whatever you ingest from plugin-structured.jsonl (or future in-process POST). Volume is controlled by event-driven logging (no per-tick logs) and by the dashboard display filter.
| Limit | Value | Impact |
|---|---|---|
| Ingestion rate | 5 MB/s per user | Batches are typically < 20 KB; stay well below. |
| Active streams | 5,000 | Our 4-label schema yields < 32 streams. |
| Retention | 14 days | Two weeks of sessions queryable. |
| Max line size | 256 KB (hard) | Target < 800 bytes per line; self-impose 8 KB max. |
| Label names per series | 15 | We use 4 labels. |
| Label value length | 2,048 chars | No concern with static label values. |
Volume allowance: free tier ~50 GB/month; our budget is < 1 GB/month.
Stream count and labels stay bounded (four labels only; no session_id or driver_id as labels). Session-end results with 100–200+ drivers use chunked session_end_datapoints_results (35 drivers per line); merge chunks in Grafana. Many SimSteward users can send to one central Loki (each instance ships to the same endpoint); use an optional bounded instance_id label if you need tenancy in queries. Do not log per-driver per-tick in Loki; use metrics (OTel) for high-frequency telemetry. Full stream/volume math, label rules, and query patterns: docs/observability-scaling.md.
| Source | Logs / session | Bytes / entry | MB / session |
|---|---|---|---|
| Action commands (2 lines per action) | ~440 | ~400 B | ~0.18 |
| Incidents | ~50 | ~600 B | ~0.03 |
| Lifecycle / iRacing | ~15 | ~300 B | < 0.01 |
| WS client connect/disconnect | ~10 | ~300 B | < 0.01 |
| Errors / warnings | ~10 | ~350 B | < 0.01 |
| Total | ~525 | — | ~0.23 MB |
At 30 sessions/month: ~7 MB. Never log on a tick; DataUpdate() runs at 60 Hz.
Four labels only. Do not put high-cardinality values (session_id, car_number, action, correlation_id) in labels—they stay in the JSON body.
app |
Audience | What it covers |
|---|---|---|
sim-steward |
Product / runtime | C# plugin, dashboard, deploy |
claude-dev-logging |
Dev tooling observability | Claude Code hooks, MCP server instrumentation |
| Label | Values | Rationale |
|---|---|---|
app |
sim-steward |
Product namespace. |
env |
production or local |
From SIMSTEWARD_LOG_ENV. |
component |
simhub-plugin, bridge, tracker, dashboard, deploy |
Subsystem. |
level |
INFO, WARN, ERROR, DEBUG |
Severity. DEBUG only when SIMSTEWARD_LOG_DEBUG=1. |
| Label | Values | Rationale |
|---|---|---|
app |
claude-dev-logging |
Dev tooling namespace. |
env |
local or dev |
From SIMSTEWARD_LOG_ENV. |
component |
tool, mcp-contextstream, mcp-sentry, mcp-ollama, lifecycle, agent, user, other |
Subsystem. |
level |
INFO, WARN, ERROR |
Severity. |
The hook logger (~/.claude/hooks/loki-log.js) buckets by hook type: tool hooks for non-MCP tools use component=tool; MCP tools use component=mcp-<service>; session/compact/stop use lifecycle; subagent/task use agent; prompt/notification/permission use user. MCP service is also in the JSON body service field: {app="claude-dev-logging"} | json | service="contextstream".
Derived from stdin + temp-file timestamp correlation ($TMPDIR/claude-hook-timing/):
| Field | Hook types | Description |
|---|---|---|
duration_ms |
PostToolUse, PostToolUseFailure | Wall-clock tool execution time (Pre→Post diff). |
tool_input_bytes |
PreToolUse, PostToolUse, PostToolUseFailure | Buffer.byteLength(JSON.stringify(tool_input)). |
tool_response_bytes |
PostToolUse, PostToolUseFailure | Buffer.byteLength(JSON.stringify(tool_response)). |
is_retry |
PreToolUse | true when same tool+input hash seen within 10 s. |
retry_of |
PreToolUse | tool_use_id of the previous identical call. |
error_type |
PostToolUseFailure | timeout, permission_denied, not_found, connection_refused, rate_limited, unknown. |
agent_depth |
SubagentStart | Count of concurrently open agents in the session. |
agent_duration_ms |
SubagentStop | Wall-clock subagent lifetime. |
session_duration_ms |
SessionEnd | Wall-clock session lifetime. |
compaction_count |
PreCompact, SessionEnd | Number of compactions in this session. |
user_think_time_ms |
UserPromptSubmit | Time since last tool completion. |
Stale files (>5 min) are cleaned on each PreToolUse and SessionStart. Retry markers expire after 10 s.
On SessionEnd, the hook reads transcript_path (JSONL) and writes aggregated metrics to {cwd}/logs/claude-session-metrics.jsonl (tailed by Alloy). No conversation content — only: total_input_tokens, total_output_tokens, total_cache_creation_tokens, total_cache_read_tokens, assistant_turns, tool_use_count, model, session_duration_ms, compaction_count.
Every log line has an event field. Key events:
| Event | Component | Key fields | Notes |
|---|---|---|---|
logging_ready |
simhub-plugin | — | First log after logger creation; init continues. |
settings_saved |
simhub-plugin | — | UI settings persisted. |
file_tail_ready |
simhub-plugin | path |
Structured log file path ready for Loki ingestion (outside plugin). |
plugin_started |
simhub-plugin | — | SimSteward plugin starting; tracker callback set. |
actions_registered |
simhub-plugin | — | SimHub properties and actions registered. |
bridge_starting |
simhub-plugin | bind, port |
WebSocket bridge starting. |
bridge_start_failed |
simhub-plugin | bind, port, error |
WebSocket server failed to start (WARN). |
plugin_ready |
simhub-plugin | ws_port, env |
Lifecycle readiness. |
deploy_marker |
simhub-plugin | deploy_status (ok | failed), post_deploy_warn, detail, machine, simhub_path |
Not from the in-process plugin — one line at end of deploy.ps1 via scripts/send-deploy-loki-marker.ps1 when SIMSTEWARD_LOKI_URL is set. WARN level if post_deploy_warn (post-deploy tests/*.ps1 failed after retry). Use Grafana dashboard Sim Steward — Deploy health (simsteward-deploy-health). |
host_resource_sample |
simhub-plugin | process_cpu_pct, process_working_set_mb, process_private_mb, gc_heap_mb, process_threads, disk_root, disk_total_gb, disk_free_gb, disk_used_pct, ws_clients, sample_interval_sec |
~1/min (default): SimHub process CPU (share of all logical CPUs), memory, managed heap, and usage of the drive that hosts plugin data. Tune interval with SIMSTEWARD_RESOURCE_SAMPLE_SEC (15–3600). Use Explore time series on numeric fields to spot spikes; rising process_working_set_mb / gc_heap_mb over hours suggests growth (not necessarily a leak—correlate with sessions). |
log_streaming_subscribed |
simhub-plugin | — | Dashboard log streaming attached. |
irsdk_started |
simhub-plugin | — | iRacing SDK started. |
replay_incident_index_sdk_ready |
simhub-plugin | irsdk_connected, update_interval_ms, log_env, loki_push_target |
Milestone 1 (TR-001): IRSDK memory map connected; emitted on OnConnected after iracing_connected. |
replay_incident_index_session_context |
simhub-plugin | sim_mode, subsession_id, parent_session_id, session_num, track_display_name, is_replay_mode, session_yaml_fingerprint_sha256_16 (first 16 hex chars of SHA-256 of raw SessionInfoYaml), session_yaml_length, session_info_update (IRSDK SessionInfoUpdate), log_env, loki_push_target |
Milestone 1 (TR-002/003): parsed WeekendInfo from session YAML on OnSessionInfo; throttled per (SubSessionID, SessionNum, SimMode). WARN when subsession_id is set and is_replay_mode is false (loaded session but not replay). |
replay_incident_index_started |
simhub-plugin | saved_replay_frame_before_seek, target_play_speed, spine fields |
Milestone 2 (TR-004): replay_incident_index_build action start queued; on next OnTelemetryData tick, ReplaySearch(ToStart) issued. |
replay_incident_index_baseline_ready |
simhub-plugin | replay_frame_num_end, car_idx_session_flags (int[64] TR-005), player_car_my_incident_count_baseline (TR-006), spine fields |
Milestone 2: baseline at stable ReplayFrameNum==0 before fast-forward. |
replay_incident_index_fast_forward_started |
simhub-plugin | replay_play_speed_requested, replay_play_speed_telemetry, effective_sample_hz_vs_session_time (NFR-008), sdk_update_interval_ms, spine fields |
Milestone 2 (TR-008/009): ReplaySetPlaySpeed applied. |
replay_incident_index_fast_forward_complete |
simhub-plugin | index_build_time_ms, fast_forward_telemetry_samples, completion_reason (replay_finished | paused_or_stopped), replay_play_speed, effective_sample_hz_vs_session_time, replay_frame_num_at_end, replay_frame_num_end, replay_session_time, detected_incident_samples, fast_repair_delta_events, spine fields |
Milestone 2 (TR-010/011): IsReplayPlaying became false; playback restored to 1×. M3: counts reflect ReplayIncidentIndexDetector output during fast-forward (TR-012–TR-018). |
replay_incident_index_detection |
simhub-plugin | fingerprint (TR-020 v1 hex, same as JSON row), car_idx, session_time_ms, detection_source (repair_flag | furled_flag | player_incident_count), incident_points (int or null), replay_frame, replay_session_time, spine fields |
Milestone 5 (TR-028): one line per primary detection during fast-forward; not emitted on every 60Hz tick. |
replay_incident_index_build_error |
simhub-plugin | error (seek_start_timeout, …), spine fields |
Milestone 2: seek timeout or speed command failure (WARN). |
replay_incident_index_build_cancelled |
simhub-plugin | reason, spine fields |
Milestone 2: replay_incident_index_build cancel or disconnect during build. |
replay_incident_index_validation_summary |
simhub-plugin | output_path, index_build_time_ms_total, detected_incident_rows, yaml_results_available, yaml_session_num_used, yaml_parse_error (when parse fails), discrepancy_count, camera_seek_attempted, camera_seek_matches, camera_seek_match_percent, spine fields |
Milestone 4 (TR-023–TR-025): after fast-forward and optional per-row ReplaySearchSessionTime + cooldown, JSON index written (TR-019); YAML vs detection discrepancies; camera match rate. |
replay_incident_index_record_started |
simhub-plugin | record_file, subsession_id, spine fields |
Milestone 6 (TR-038): dashboard replay_incident_index_record on; 60Hz samples go to NDJSON (not per-tick Loki). |
replay_incident_index_record_stopped |
simhub-plugin | reason (user_off | iracing_disconnected | plugin_end), spine fields |
Record mode ended; writer closed. |
replay_incident_index_record_window |
simhub-plugin | telemetry_ticks (=60), record_file, subsession_id, spine fields |
~1/s wall time while record mode on: confirms high-frequency file writes for TR-040 volume panels (not 60Hz Loki lines). |
plugin_stopped |
simhub-plugin | — | Emitted from End(). |
iracing_connected / iracing_disconnected |
simhub-plugin | — | IRSDK connection state. |
ws_client_connected / ws_client_disconnected |
bridge | client_ip, client_count |
Each connect/disconnect. |
dashboard_opened |
bridge | client_ip, client_count |
When a dashboard client connects (page load or refresh). |
ws_client_rejected |
bridge | client_ip, reason |
Token missing or invalid. |
action_received |
bridge | action, arg, client_ip, correlation_id |
Logged before DispatchAction. In production, omitted by default; enable "Log all action traffic" in settings or SIMSTEWARD_LOG_ALL_ACTIONS=1 to keep. |
action_dispatched |
simhub-plugin | action, arg, correlation_id, subsession_id, parent_session_id, session_num, track_display_name, lap, log_env, loki_push_target, plus spine session_id / replay_frame when set; session_yaml_fingerprint_sha256_16 when session YAML is available (SHA-256 prefix of SessionInfoYaml, recomputed when SessionInfoUpdate changes) |
Start of every command. subsession_id = iRacing WeekendInfo.SubSessionID when > 0, else "not in session". parent_session_id = WeekendInfo.SessionID when > 0, else "not in session". session_num = telemetry SessionNum when connected, else "not in session". lap = telemetry CarIdxLap for the focus car (CamCarIdx if valid, else PlayerCarIdx); -1 when unknown/disconnected. log_env = SIMSTEWARD_LOG_ENV or unset. loki_push_target = disabled | grafana_cloud | local_or_custom from SIMSTEWARD_LOKI_URL (same env the Loki sink uses — set before SimHub starts, e.g. launcher loading .env). In production, omitted by default; enable "Log all action traffic" to keep. |
action_result |
simhub-plugin | Same session/routing fields as action_dispatched where applicable, plus success, result, error, duration_ms |
End of command. |
plugin_ui_changed |
simhub-plugin | element, value |
Settings panel interaction (omit level/event, data API endpoint, log all action traffic). |
dashboard_ui_event |
bridge | client_ip, element_id, event_type, value, plus same subsession_id, parent_session_id, session_num, track_display_name, lap, log_env, loki_push_target as actions (and session_yaml_fingerprint_sha256_16 when YAML available) |
Dashboard UI-only interaction (panel toggles, log filter checkboxes, filter chips, view buttons, results drawer, etc.). |
replay_control |
simhub-plugin | mode, speed, search_mode |
Replay buttons. |
session_snapshot_recorded |
simhub-plugin | path |
Writable snapshot log. |
session_end_fingerprint |
simhub-plugin | session_num, results_ready, results_positions_count, replay_frame_num, session_time |
Emitted when RecordSessionSnapshot is called with a trigger containing "session_end" (e.g. session_end:2). Fingerprint of what data is available at session end. |
checkered_detected |
simhub-plugin | session_state |
Emitted when replay/live crosses the line (SessionState ≥ 5); before attempting capture. |
checkered_retry |
simhub-plugin | session_state |
Emitted when running the 2s-delayed retry after checkered. |
session_capture_skipped |
simhub-plugin | trigger, error, details, will_retry |
When capture is attempted but ResultsPositions is empty (e.g. at checkered). |
session_capture_incident_mismatch |
simhub-plugin | results_incidents, tracker_incidents, player_car_idx |
WARN when player's ResultsPositions incident count ≠ IncidentTracker count (wrong session or SDK mapping). |
session_summary_captured |
simhub-plugin | trigger, session_num, driver_count, wanted_session_num, selected_session_num, session_match_exact, results_incident_sample |
When TryCaptureAndEmitSessionSummary succeeds. Use session_match_exact to see when fallback session was used; results_incident_sample = first 3 drivers' car_idx, position, incidents for SDK verification. |
session_end_datapoints_session |
simhub-plugin | trigger, session_id, session_num, session-level fields (track, series_id, session_name, incident_limit, …), telemetry_* at capture, results_driver_count |
Emitted once per successful session summary capture. Session metadata and telemetry snapshot only; no results array. Use with session_end_datapoints_results chunks to get full data. Scales to hundreds of drivers. |
session_end_datapoints_results |
simhub-plugin | session_id, session_num, chunk_index, chunk_total, results_driver_count, results (array of up to 35 driver rows: pos, car_idx, driver, abbrev, car, class, laps, incidents, reason_out, user_id, team, irating, etc.) |
One log line per chunk (35 drivers per chunk). Merge chunks by session_id and sort by chunk_index for full results table. See docs/observability-scaling.md and § LogQL reference below. |
finalize_capture_started / complete / timeout |
simhub-plugin | target_frame, duration_ms |
Debug / automation. |
incident_detected |
tracker | incident_type, car_number, driver_name, unique_user_id (iRacing CustID), delta, session_time, session_num, lap (per-car CarIdxLap for the incident car), replay_frame, replay_frame_end, start_frame / end_frame (same window as replay frames; aliases for ingestion), camera_view (compact string, e.g. cam_car_idx=N;group=Name), cause, other_car_number, subsession_id, parent_session_id, track_display_name, cam_car_idx / camera_group (when available), log_env, loki_push_target |
Canonical rule name: iracing_incident — emitted JSON event: incident_detected (use this string in LogQL until code renames). Each YAML delta from OnSessionInfo (per-car CurDriverIncidentCount). Global uniqueness: see § Global incident uniqueness signature below. Use "not in session" for subsession_id / parent_session_id when iRacing has no loaded session. |
baseline_established |
tracker | driver_count |
When tracker baseline is ready. |
session_reset |
tracker | old_session, new_session |
When SessionNum changes. |
seek_backward_detected |
tracker | from_frame, to_frame, session_time |
Replay seek. |
yaml_update |
tracker | session_info_update, session_num, session_time |
Debug-only. |
session_digest |
simhub-plugin | session_id, session_num, track, duration_minutes, total_incidents, results_incident_sum, incident_summary, incident_summary_truncated, results_table, results_driver_count, actions_dispatched, … |
Single-row session summary. total_incidents = count of incident_detected events (plugin); results_incident_sum = sum of iRacing per-driver incident points; results_table = authoritative ResultsPositions (pos, car, driver, incidents, laps, class, reason_out per driver). |
PR checklist and domain taxonomy: docs/RULES-ActionCoverage.md. Full-field checklist below for dashboards, PR review, and Loki queries. Labels stay the four-label schema; everything else is JSON body.
| Field | Required for completeness | Notes |
|---|---|---|
action |
yes | Command name |
arg |
yes | Argument payload (may be empty) |
correlation_id |
yes | Trace id for paired action_result |
subsession_id, parent_session_id, session_num, track_display_name |
yes | Use "not in session" when offline |
lap |
yes | Integer; -1 when unknown (same as SessionLogging.LapUnknown) |
session_id, replay_frame |
when applicable | Spine / replay context |
log_env, loki_push_target |
recommended | Observability metadata |
Volume: In production, often omitted unless plugin setting “Log all action traffic” or SIMSTEWARD_LOG_ALL_ACTIONS=1. Same row in the table above lists additional bridge fields when action_received is enabled.
| Field | Required for completeness | Notes |
|---|---|---|
success |
yes | Boolean outcome |
result / error |
as applicable | Payload or error detail |
duration_ms |
yes | Handler duration |
action, arg, correlation_id |
yes | Same command identity as action_dispatched |
| Session + routing fields | yes | Same set as action_dispatched (subsession_id, parent_session_id, session_num, track_display_name, spine fields, log_env, loki_push_target), plus session_yaml_fingerprint_sha256_16 when YAML is available |
| Field | Required for completeness | Notes |
|---|---|---|
unique_user_id |
yes | iRacing CustID |
driver_name |
yes | Display name in JSONL (display_name in coding rules = same concept) |
session_time |
yes | Time of detection |
subsession_id, parent_session_id, session_num, track_display_name |
yes | "not in session" when no session |
replay_frame |
yes | Start frame; use replay_frame_end if the event spans a window |
start_frame, end_frame |
yes | Same values as replay_frame / replay_frame_end for stable downstream keys |
lap |
yes | CarIdxLap for the car incurring the incident; -1 if unknown |
cam_car_idx / camera_group |
when available | Camera / view context |
camera_view |
recommended | Single string combining camera car and group (see table above) |
incident_type, delta, car_number, … |
per implementation | Taxonomy / YAML delta fields |
Use this tuple to dedupe and join incidents across splits, drivers, and time (Loki / warehouse / replay tools). All fields are in the JSON body (not Loki labels).
- Split / event:
subsession_id(iRacingWeekendInfo.SubSessionID) +parent_session_id(WeekendInfo.SessionID) +session_num(practice / qual / race phase). - Driver:
unique_user_id(CustID) +driver_name(display name in the log line). - When / where:
session_time+start_frame+end_frame+track_display_name. - Perspective:
camera_view(orcam_car_idx+camera_group).
If iRacing is not in a loaded session, subsession_id, parent_session_id, and session_num use the same "not in session" fallback as action logs.
LogQL today: filter on event = "incident_detected". If the emitter is renamed to iracing_incident, update queries and this doc in the same change.
incident_detected and session_digest are the main LogQL entry points for incidents and session-level summaries (use Explore or any Loki panel you add).
| Setting | Local Docker | Grafana Cloud |
|---|---|---|
SIMSTEWARD_LOKI_URL |
http://localhost:3100 |
https://logs-prod-us-east-0.grafana.net |
SIMSTEWARD_LOKI_USER |
(blank) | Your instance user ID |
SIMSTEWARD_LOKI_TOKEN |
(blank) | Your log-write token |
SIMSTEWARD_LOG_ENV |
local |
production |
SIMSTEWARD_LOG_DEBUG |
1 (optional) |
0 or unset |
SIMSTEWARD_LOG_ALL_ACTIONS |
1 to keep action_received and action_dispatched in logs |
unset (production omits them by default) |
SIMSTEWARD_RESOURCE_SAMPLE_SEC |
Interval in seconds for host_resource_sample (15–3600) |
60 |
Log all action traffic: In production, action_received and action_dispatched are omitted at source to reduce volume. To capture every command (e.g. for debugging or full click/event visibility), enable Log all action traffic in the plugin settings (Observability / Log filters), or set SIMSTEWARD_LOG_ALL_ACTIONS=1 before starting SimHub.
The plugin reads these once at Init(). To switch environment, edit .env and restart SimHub.
Important: SimHub does not load a .env file. The plugin only sees environment variables that are set in the process that starts SimHub. To get logs into Grafana you must either:
- Local Loki: Start SimHub with the script so env vars are set before launch:
- From the plugin repo root:
.\scripts\run-simhub-local-observability.ps1 - This sets
SIMSTEWARD_LOKI_URL=http://localhost:3100andSIMSTEWARD_LOG_ENV=local, then starts SimHub.
- From the plugin repo root:
- Grafana Cloud: Set
SIMSTEWARD_LOKI_URL,SIMSTEWARD_LOKI_USER, andSIMSTEWARD_LOKI_TOKENin your user or system environment, then start SimHub (or use a launcher that sets them).
If Explore (or any Loki-backed panel) shows no logs for {app="sim-steward"}:
- Loki URL not set — The plugin pushes to Loki only when
SIMSTEWARD_LOKI_URLis set. If you start SimHub by double‑clicking (or from the Start menu), that variable is usually unset.- Fix: Start SimHub via
.\scripts\run-simhub-local-observability.ps1for local Loki, or set the Loki env vars before starting SimHub. - Check: Open
%LOCALAPPDATA%\SimHubWpf\PluginsData\SimSteward\plugin.logand look for a line withevent=loki_status. If it says "Loki logging disabled", the URL was not set when the plugin started.
- Fix: Start SimHub via
- Local stack not running — For local Docker Loki, ensure the stack is up:
cd observability/local && docker compose up -d. Grafana should be at http://localhost:3000 and Loki at http://localhost:3100. - Wrong query — In Grafana Explore, select the Loki datasource and use LogQL:
{app="sim-steward"}or{app="sim-steward", env="local"}. Use a time range that includes when the plugin was running.
After fixing, restart SimHub (using the script for local) and trigger some activity (e.g. open the dashboard, connect iRacing, or run a replay); logs should appear within a few seconds to a minute depending on flush interval.
If Explore works but a dashboard panel shows "No data", fix the panel’s datasource (use UID loki_local for local provisioning, or your Cloud Loki datasource) and time range. If Explore is also empty, follow the checklist below.
Use this checklist so button presses (Play, etc.) show up in Grafana:
- Start SimHub with env set — Use
.\scripts\run-simhub-local-observability.ps1for local Loki, or setSIMSTEWARD_LOKI_URL(and optional user/token) in your environment before starting SimHub. - Or enable Loki in the plugin — In SimSteward plugin settings, enable Enable Loki logging so the plugin sets
SIMSTEWARD_LOKI_URL=http://localhost:3100for this run (persists for next start). - Local stack running — For local Loki:
cd observability/local && docker compose up -d; confirm Loki at http://localhost:3100. - Query and time range — In Grafana Explore, select the Loki datasource, query
{app="sim-steward"}(optionallyenv="local"), and set time range to Last 5 minutes or Last 15 minutes. - Wait for flush — After pressing Play or other buttons (or opening the dashboard, connecting iRacing, or an incident firing), wait 1–2 seconds; these events trigger a debounced flush so logs appear quickly. The periodic timer can still take up to 5 s for other events.
- Check plugin.log — Look for
loki_status("Loki logging enabled" vs "disabled") andloki_first_push_ok(confirms at least one batch reached Loki). If you see push failure warnings, Loki is unreachable (stack down or wrong URL).
Events that trigger prompt flush (1–2 s): Button actions (action_result, action_dispatched), incidents (incident_detected), session lifecycle (checkered_detected, checkered_retry, session_summary_captured, session_digest, session_end_datapoints_session, session_end_datapoints_results, session_capture_skipped), dashboard and iRacing (dashboard_opened, iracing_connected, iracing_disconnected), tracker (baseline_established, session_reset), and bridge/plugin readiness (plugin_ready, bridge_starting, ws_client_connected, ws_client_disconnected). All other events are sent on the next batch-size or 5 s timer.
Set SIMSTEWARD_LOG_DEBUG=1 for local debugging only. When enabled:
- PluginLogger.Debug() emits
DEBUG-level entries (still sent to Loki; filter in Explore/AI). - In-process Loki push (when enabled) uses relaxed flush rules: 2 s timer, 500-entry batch, 5,000-entry queue, no line-size enforcement.
- Additional log events are emitted:
tick_statsevery 60 ticks (≈1 s): running averagedata_update_ms,frames_dropped.yaml_update: eachSessionInfoUpdaterefresh.ws_message_raw: every WebSocket message (raw JSON) for debugging dashboard commands.incident_detectedincludes asnapshotfield with the currentPluginSnapshot.
Never enable debug in production. For AI or assistant queries, filter with | level != "DEBUG".
SessionStats accumulates per session (reset on iracing_connected):
| Metric | What it tracks |
|---|---|
ActionsDispatched |
Total actions processed this session. |
ActionFailures |
Count of actions that returned success = false. |
PluginErrors / PluginWarns |
From _sessionStats.IncrementErrors() / IncrementWarns(). |
WsPeakClients |
Peak WebSocket client count per session. |
ActionLatenciesMs |
Rolling sample for P50/P95. |
Incidents |
Incident summaries (e.g. for digest). |
session_digest is emitted at most once per session (guarded by _sessionDigestEmitted). It caps incident_summary to 20 entries (highest severity first) and sets incident_summary_truncated: true when truncated. Trigger the digest manually (CaptureSessionSummaryNow, FinalizeThenCaptureSessionSummary, or checkered flag) so downstream AI and Grafana panels see the session as complete.
Incident semantics: total_incidents is the count of incident_detected events (from IncidentTracker CurDriverIncidentCount deltas). The results_table incidents column and results_incident_sum are iRacing’s per-driver incident points at session end (from ResultsPositions). So total_incidents (e.g. 12 events) and results_incident_sum (e.g. 24 points) can both be correct but differ. For what iRacing can supply when (live vs replay vs YAML vs REST), see docs/IRACING-DATA-AVAILABILITY.md.
There are no provisioned dashboard JSON files in the repo at the moment. Use Grafana → Explore with the queries in § LogQL reference. To add dashboards again, put files under observability/local/grafana/provisioning/dashboards/ as a single object per file: { "dashboard": { ... }, "overwrite": true }, with panels using datasource { "type": "loki", "uid": "loki_local" } locally, or a variable such as DS_LOKI on Grafana Cloud.
Validate events over a time range in docs/observability-testing.md (Explore / LogQL checks).
The provisioned Loki datasource (observability/local/grafana/provisioning/datasources/loki.yml) defines derived fields so Grafana extracts and shows key JSON fields when viewing log lines:
| Derived field | Extracted from | Notes |
|---|---|---|
| Event | event |
Event type (e.g. action_result, incident_detected). |
| Correlation ID | correlation_id |
Clickable: runs LogQL filter by this correlation (trace). |
| Session ID | session_id |
Session identifier. |
| Action | action |
Command/action name. |
| Success | success |
Action outcome (true / false). |
| Trigger | trigger |
Session summary trigger (e.g. checkered, finalize). |
| Incident type | incident_type |
Incident classification. |
| Driver name | driver_name |
Driver name in incident/session context. |
| Car number | car_number |
Car number (when serialized as string). |
In Explore or any Loki log panel, these appear as parsed columns/links alongside the raw line.
# Persistent storage (host path must exist before Docker)
New-Item -ItemType Directory -Force "S:\sim-steward-grafana-storage"
# In project .env, for local:
# SIMSTEWARD_LOKI_URL=http://localhost:3100
# SIMSTEWARD_LOKI_USER=
# SIMSTEWARD_LOKI_TOKEN=
# SIMSTEWARD_LOG_ENV=local
# SIMSTEWARD_LOG_DEBUG=1
cd observability/local
docker compose up -d
# Grafana: http://localhost:3000 | Loki: http://localhost:3100 (no auth for direct push)
# Optional: local **loki-gateway** (token-protected push) — see observability-local.mdSee docs/observability-local.md for Grafana + Loki + gateway setup and LOKI_PUSH_TOKEN.
Dashboards: Delete in the Grafana UI or via the HTTP API using an existing editor/admin token. Do not delete the Loki datasource or change stack URLs, SIMSTEWARD_LOKI_USER, or SIMSTEWARD_LOKI_TOKEN.
Stored logs: Rely on plan retention, or use Grafana Cloud / Loki documented deletion flows for your tier. Clearing data must not require rotating credentials.
Local Docker: Wipe Loki (and optionally Grafana) bind-mount data with docs/observability-local.md § Housekeeping (scripts/obs-wipe-local-data.ps1).
With .env containing SIMSTEWARD_LOKI_URL, SIMSTEWARD_LOKI_USER, and SIMSTEWARD_LOKI_TOKEN (optional override: LOKI_QUERY_URL):
| Command | Purpose |
|---|---|
pnpm run loki:query |
One-off GET .../loki/api/v1/query_range via scripts/query-loki-once.mjs. Flags: --query (LogQL), --limit, --lookback (seconds). |
pnpm run env:run -- <command> |
Load .env into the child process (e.g. pnpm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1). |
pnpm run obs:poll |
Tail-style poll, direct Loki (default). |
pnpm run obs:poll:grafana |
Same, but -ViaGrafana (Bearer → Grafana proxy → Loki). |
pnpm run obs:poll:grafana:env |
Same as obs:poll:grafana but injects .env with dotenv-cli first (secrets only in the child process). |
Path A (direct Loki): SIMSTEWARD_LOKI_* + pnpm run loki:query or pnpm run obs:poll.
Path B (Grafana Cloud, elevated glsa_* Bearer): Set GRAFANA_URL to your stack (https://<slug>.grafana.net — not logs-prod-*.grafana.net). Set GRAFANA_LOKI_DATASOURCE_UID to the Loki datasource UID in that stack (Connections → Data sources). Set GRAFANA_API_TOKEN or CURSOR_ELEVATED_GRAFANA_TOKEN (service account token with permission to query the Loki datasource via the proxy). Then:
pnpm run obs:poll:grafana:env
# or: pnpm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1 -ViaGrafanapoll-loki.ps1 reads .env from disk; *:env pnpm scripts add dotenv -e .env so variables are also loaded for the child process without exporting them in the shell.
401/403: On Path A, the glc_* policy may lack Loki read. On Path B, ensure the Bearer token can query datasources; check datasource UID and stack URL.
| Purpose | LogQL |
|---|---|
| Command audit | {app="sim-steward", component="simhub-plugin"} | json | event = "action_result" |
| Failed commands | {app="sim-steward", component="simhub-plugin"} | json | event = "action_result" | success = "false" |
| Action volume (timeseries) | count_over_time({app="sim-steward", component="simhub-plugin"} | json | event = "action_result" [$__interval]) |
| Incident timeline | {app="sim-steward", component="tracker"} | json | event = "incident_detected" |
| Plugin lifecycle | {app="sim-steward", component="simhub-plugin"} | json | event =~ "plugin_started|plugin_ready|iracing_connected|iracing_disconnected|plugin_stopped" |
| Session digests | {app="sim-steward", component="simhub-plugin"} | json | event = "session_digest" |
| Session end metadata | {app="sim-steward", component="simhub-plugin"} | json | event = "session_end_datapoints_session" |
| Session end results (all chunks for one session) | {app="sim-steward", component="simhub-plugin"} | json | event = "session_end_datapoints_results" | session_id = "<id>" — merge in panels by sorting on chunk_index and flattening results. |
| WS peak (stat) | max_over_time({app="sim-steward"} | json | event = "session_digest" | unwrap ws_peak_clients [24h]) |
| Trace by correlation | {app="sim-steward"} | json | correlation_id = "<id>" |
| All errors | {app="sim-steward", level="ERROR"} |
| Purpose | LogQL |
|---|---|
| Replay control by speed | {app="sim-steward", component="simhub-plugin"} | json | event = "replay_control" | speed != "" — filter by speed (e.g. 16, 8, 1) to see which replay speed was used. |
| Replay control (all) | {app="sim-steward", component="simhub-plugin"} | json | event = "replay_control" — seek/play/pause and speed. |
| Incident count (timeseries) | count_over_time({app="sim-steward", component="tracker"} | json | event = "incident_detected" [$__interval]) |
| Incident count (total in range) | count_over_time({app="sim-steward", component="tracker"} | json | event = "incident_detected" [$__range]) |
session_summary_captured is emitted only when the plugin successfully captures the session results table (ResultsPositions). It does not fire when results are not yet available (e.g. before checkered, or in a short replay clip that never reaches session end). Use session_capture_skipped to see when capture was attempted but results were empty; use session_end_fingerprint (when implemented) to see what data was available at session end.
End-of-session driver results are emitted as session_end_datapoints_session (metadata once) plus session_end_datapoints_results (one log line per chunk of 35 drivers). To show a full results table for a session in Grafana log/table panels:
- Query:
{app="sim-steward", component="simhub-plugin"} | json | event = "session_end_datapoints_results" | session_id = "<session_id>". - In the panel transform: sort by
chunk_index, then use a transform that flattens theresultsarray from each chunk into a single table (e.g. "Merge" / "Flatten" or a custom transformation that concatenatesresultsin order).
Exact driver count is in session_end_datapoints_session (results_driver_count); use that event when you need session metadata without parsing result chunks.
- Grafana Sift — Pattern grouping for errors: use
{app="sim-steward", level="ERROR"}. - Grafana Assistant — Start with
session_digest; then drill down bysession_idorcorrelation_id. Filter| level != "DEBUG"for production. - Natural-language LogQL — In Explore, use field names like
action,success,duration_ms,incident_type,session_id,correlation_id. - MCP — Optional: Grafana MCP connector (e.g. github.com/grafana/mcp-grafana) so Cursor or other tools query logs; point at Grafana Cloud with a token.
After Phase 1 is in use and logs are observed in production, Phase 2 can add OpenTelemetry metrics (e.g. via Grafana.OpenTelemetry NuGet) for faster dashboards, longer retention, and metric-based alerting. Metric names and env vars are in the Grafana Loki logging plan.