Grafana Loki Structured Logging

Structured logging for SimSteward (Grafana Loki / Grafana Cloud or local Docker). All logs are event-driven; no per-tick logging in production. Implemented today: Plugin → PluginLogger.Structured() → plugin-structured.jsonl (NDJSON on disk) and the same entries are mirrored to the dashboard over WebSocket. This repository’s plugin does not yet HTTP POST log lines to Loki — SIMSTEWARD_LOKI_URL is used for routing metadata in JSON (loki_push_target), optional read paths (e.g. data-capture suite verification), and scripts: deploy.ps1 posts a single deploy_marker via send-deploy-loki-marker.ps1 when the URL is set. To see full plugin logs in Loki today, run an external shipper that tails plugin-structured.jsonl into your stack, or add in-process batch POST later (see docs/observability-scaling.md). If WebSocket log sends fail, the plugin writes to broadcast-errors.log (see docs/TROUBLESHOOTING.md §4b). Explore, custom panels, and AI tooling use the 4-label schema and fixed event taxonomy below. Data routing: docs/DATA-ROUTING-OBSERVABILITY.md. Local stack: docs/observability-local.md.

Loki stream (when ingested): Do not filter at the push path — ship the same lines you write to plugin-structured.jsonl.

Filtering is dashboard-only. The web dashboard receives the full stream via WebSocket and applies level/event visibility filters for display only (checkboxes and hiddenLevels / hiddenEvents). Toggling "hide DEBUG" or hiding specific event types in the dashboard shows or hides entries that are already in the stream; nothing is dropped at the plugin.

Local vs prod (same pipeline; env label only)

Env label: Set SIMSTEWARD_LOG_ENV=local for local dev or SIMSTEWARD_LOG_ENV=production (default); this flows into JSON as log_env / routing hints. The dashboard WebSocket stream is full; Loki reflects whatever you ingest from plugin-structured.jsonl (or future in-process POST). Volume is controlled by event-driven logging (no per-tick logs) and by the dashboard display filter.

Grafana Cloud free tier limits

Limit	Value	Impact
Ingestion rate	5 MB/s per user	Batches are typically < 20 KB; stay well below.
Active streams	5,000	Our 4-label schema yields < 32 streams.
Retention	14 days	Two weeks of sessions queryable.
Max line size	256 KB (hard)	Target < 800 bytes per line; self-impose 8 KB max.
Label names per series	15	We use 4 labels.
Label value length	2,048 chars	No concern with static label values.

Volume allowance: free tier ~50 GB/month; our budget is < 1 GB/month.

Scale: hundreds of drivers / many users

Stream count and labels stay bounded (four labels only; no session_id or driver_id as labels). Session-end results with 100–200+ drivers use chunked session_end_datapoints_results (35 drivers per line); merge chunks in Grafana. Many SimSteward users can send to one central Loki (each instance ships to the same endpoint); use an optional bounded instance_id label if you need tenancy in queries. Do not log per-driver per-tick in Loki; use metrics (OTel) for high-frequency telemetry. Full stream/volume math, label rules, and query patterns: docs/observability-scaling.md.

Volume budget (per session, ~2 h)

Source	Logs / session	Bytes / entry	MB / session
Action commands (2 lines per action)	~440	~400 B	~0.18
Incidents	~50	~600 B	~0.03
Lifecycle / iRacing	~15	~300 B	< 0.01
WS client connect/disconnect	~10	~300 B	< 0.01
Errors / warnings	~10	~350 B	< 0.01
Total	~525	—	~0.23 MB

At 30 sessions/month: ~7 MB. Never log on a tick; DataUpdate() runs at 60 Hz.

Label schema

Four labels only. Do not put high-cardinality values (session_id, car_number, action, correlation_id) in labels—they stay in the JSON body.

Two `app` namespaces

`app`	Audience	What it covers
`sim-steward`	Product / runtime	C# plugin, dashboard, deploy
`claude-dev-logging`	Dev tooling observability	Claude Code hooks, MCP server instrumentation

`app=sim-steward` (product)

Label	Values	Rationale
`app`	`sim-steward`	Product namespace.
`env`	`production` or `local`	From `SIMSTEWARD_LOG_ENV`.
`component`	`simhub-plugin`, `bridge`, `tracker`, `dashboard`, `deploy`	Subsystem.
`level`	`INFO`, `WARN`, `ERROR`, `DEBUG`	Severity. `DEBUG` only when `SIMSTEWARD_LOG_DEBUG=1`.

`app=claude-dev-logging` (dev tooling)

Label	Values	Rationale
`app`	`claude-dev-logging`	Dev tooling namespace.
`env`	`local` or `dev`	From `SIMSTEWARD_LOG_ENV`.
`component`	`tool`, `mcp-contextstream`, `mcp-sentry`, `mcp-ollama`, `lifecycle`, `agent`, `user`, `other`	Subsystem.
`level`	`INFO`, `WARN`, `ERROR`	Severity.

The hook logger (~/.claude/hooks/loki-log.js) buckets by hook type: tool hooks for non-MCP tools use component=tool; MCP tools use component=mcp-<service>; session/compact/stop use lifecycle; subagent/task use agent; prompt/notification/permission use user. MCP service is also in the JSON body service field: {app="claude-dev-logging"} | json | service="contextstream".

Enrichment fields (JSON body, not labels)

Derived from stdin + temp-file timestamp correlation ($TMPDIR/claude-hook-timing/):

Field	Hook types	Description
`duration_ms`	PostToolUse, PostToolUseFailure	Wall-clock tool execution time (Pre→Post diff).
`tool_input_bytes`	PreToolUse, PostToolUse, PostToolUseFailure	`Buffer.byteLength(JSON.stringify(tool_input))`.
`tool_response_bytes`	PostToolUse, PostToolUseFailure	`Buffer.byteLength(JSON.stringify(tool_response))`.
`is_retry`	PreToolUse	`true` when same tool+input hash seen within 10 s.
`retry_of`	PreToolUse	`tool_use_id` of the previous identical call.
`error_type`	PostToolUseFailure	`timeout`, `permission_denied`, `not_found`, `connection_refused`, `rate_limited`, `unknown`.
`agent_depth`	SubagentStart	Count of concurrently open agents in the session.
`agent_duration_ms`	SubagentStop	Wall-clock subagent lifetime.
`session_duration_ms`	SessionEnd	Wall-clock session lifetime.
`compaction_count`	PreCompact, SessionEnd	Number of compactions in this session.
`user_think_time_ms`	UserPromptSubmit	Time since last tool completion.

Stale files (>5 min) are cleaned on each PreToolUse and SessionStart. Retry markers expire after 10 s.

Session token sidecar

On SessionEnd, the hook reads transcript_path (JSONL) and writes aggregated metrics to {cwd}/logs/claude-session-metrics.jsonl (tailed by Alloy). No conversation content — only: total_input_tokens, total_output_tokens, total_cache_creation_tokens, total_cache_read_tokens, assistant_turns, tool_use_count, model, session_duration_ms, compaction_count.

Event taxonomy

Every log line has an event field. Key events:

Event	Component	Key fields	Notes
`logging_ready`	simhub-plugin	—	First log after logger creation; init continues.
`settings_saved`	simhub-plugin	—	UI settings persisted.
`file_tail_ready`	simhub-plugin	`path`	Structured log file path ready for Loki ingestion (outside plugin).
`plugin_started`	simhub-plugin	—	SimSteward plugin starting; tracker callback set.
`actions_registered`	simhub-plugin	—	SimHub properties and actions registered.
`bridge_starting`	simhub-plugin	`bind`, `port`	WebSocket bridge starting.
`bridge_start_failed`	simhub-plugin	`bind`, `port`, `error`	WebSocket server failed to start (WARN).
`plugin_ready`	simhub-plugin	`ws_port`, `env`	Lifecycle readiness.
`deploy_marker`	simhub-plugin	`deploy_status` (`ok` \| `failed`), `post_deploy_warn`, `detail`, `machine`, `simhub_path`	Not from the in-process plugin — one line at end of `deploy.ps1` via scripts/send-deploy-loki-marker.ps1 when `SIMSTEWARD_LOKI_URL` is set. WARN level if `post_deploy_warn` (post-deploy `tests/.ps1` failed after retry). Use Grafana dashboard Sim Steward — Deploy health* (`simsteward-deploy-health`).
`host_resource_sample`	simhub-plugin	`process_cpu_pct`, `process_working_set_mb`, `process_private_mb`, `gc_heap_mb`, `process_threads`, `disk_root`, `disk_total_gb`, `disk_free_gb`, `disk_used_pct`, `ws_clients`, `sample_interval_sec`	~1/min (default): SimHub process CPU (share of all logical CPUs), memory, managed heap, and usage of the drive that hosts plugin data. Tune interval with `SIMSTEWARD_RESOURCE_SAMPLE_SEC` (15–3600). Use Explore time series on numeric fields to spot spikes; rising `process_working_set_mb` / `gc_heap_mb` over hours suggests growth (not necessarily a leak—correlate with sessions).
`log_streaming_subscribed`	simhub-plugin	—	Dashboard log streaming attached.
`irsdk_started`	simhub-plugin	—	iRacing SDK started.
`replay_incident_index_sdk_ready`	simhub-plugin	`irsdk_connected`, `update_interval_ms`, `log_env`, `loki_push_target`	Milestone 1 (TR-001): IRSDK memory map connected; emitted on `OnConnected` after `iracing_connected`.
`replay_incident_index_session_context`	simhub-plugin	`sim_mode`, `subsession_id`, `parent_session_id`, `session_num`, `track_display_name`, `is_replay_mode`, `session_yaml_fingerprint_sha256_16` (first 16 hex chars of SHA-256 of raw `SessionInfoYaml`), `session_yaml_length`, `session_info_update` (IRSDK `SessionInfoUpdate`), `log_env`, `loki_push_target`	Milestone 1 (TR-002/003): parsed `WeekendInfo` from session YAML on `OnSessionInfo`; throttled per `(SubSessionID, SessionNum, SimMode)`. WARN when `subsession_id` is set and `is_replay_mode` is false (loaded session but not replay).
`replay_incident_index_started`	simhub-plugin	`saved_replay_frame_before_seek`, `target_play_speed`, spine fields	Milestone 2 (TR-004): `replay_incident_index_build` action `start` queued; on next `OnTelemetryData` tick, `ReplaySearch(ToStart)` issued.
`replay_incident_index_baseline_ready`	simhub-plugin	`replay_frame_num_end`, `car_idx_session_flags` (int[64] TR-005), `player_car_my_incident_count_baseline` (TR-006), spine fields	Milestone 2: baseline at stable `ReplayFrameNum==0` before fast-forward.
`replay_incident_index_fast_forward_started`	simhub-plugin	`replay_play_speed_requested`, `replay_play_speed_telemetry`, `effective_sample_hz_vs_session_time` (NFR-008), `sdk_update_interval_ms`, spine fields	Milestone 2 (TR-008/009): `ReplaySetPlaySpeed` applied.
`replay_incident_index_fast_forward_complete`	simhub-plugin	`index_build_time_ms`, `fast_forward_telemetry_samples`, `completion_reason` (`replay_finished` \| `paused_or_stopped`), `replay_play_speed`, `effective_sample_hz_vs_session_time`, `replay_frame_num_at_end`, `replay_frame_num_end`, `replay_session_time`, `detected_incident_samples`, `fast_repair_delta_events`, spine fields	Milestone 2 (TR-010/011): `IsReplayPlaying` became false; playback restored to 1×. M3: counts reflect `ReplayIncidentIndexDetector` output during fast-forward (TR-012–TR-018).
`replay_incident_index_detection`	simhub-plugin	`fingerprint` (TR-020 v1 hex, same as JSON row), `car_idx`, `session_time_ms`, `detection_source` (`repair_flag` \| `furled_flag` \| `player_incident_count`), `incident_points` (int or null), `replay_frame`, `replay_session_time`, spine fields	Milestone 5 (TR-028): one line per primary detection during fast-forward; not emitted on every 60Hz tick.
`replay_incident_index_build_error`	simhub-plugin	`error` (`seek_start_timeout`, …), spine fields	Milestone 2: seek timeout or speed command failure (WARN).
`replay_incident_index_build_cancelled`	simhub-plugin	`reason`, spine fields	Milestone 2: `replay_incident_index_build` `cancel` or disconnect during build.
`replay_incident_index_validation_summary`	simhub-plugin	`output_path`, `index_build_time_ms_total`, `detected_incident_rows`, `yaml_results_available`, `yaml_session_num_used`, `yaml_parse_error` (when parse fails), `discrepancy_count`, `camera_seek_attempted`, `camera_seek_matches`, `camera_seek_match_percent`, spine fields	Milestone 4 (TR-023–TR-025): after fast-forward and optional per-row `ReplaySearchSessionTime` + cooldown, JSON index written (TR-019); YAML vs detection discrepancies; camera match rate.
`replay_incident_index_record_started`	simhub-plugin	`record_file`, `subsession_id`, spine fields	Milestone 6 (TR-038): dashboard `replay_incident_index_record` `on`; 60Hz samples go to NDJSON (not per-tick Loki).
`replay_incident_index_record_stopped`	simhub-plugin	`reason` (`user_off` \| `iracing_disconnected` \| `plugin_end`), spine fields	Record mode ended; writer closed.
`replay_incident_index_record_window`	simhub-plugin	`telemetry_ticks` (=60), `record_file`, `subsession_id`, spine fields	~1/s wall time while record mode on: confirms high-frequency file writes for TR-040 volume panels (not 60Hz Loki lines).
`plugin_stopped`	simhub-plugin	—	Emitted from `End()`.
`iracing_connected` / `iracing_disconnected`	simhub-plugin	—	IRSDK connection state.
`ws_client_connected` / `ws_client_disconnected`	bridge	`client_ip`, `client_count`	Each connect/disconnect.
`dashboard_opened`	bridge	`client_ip`, `client_count`	When a dashboard client connects (page load or refresh).
`ws_client_rejected`	bridge	`client_ip`, `reason`	Token missing or invalid.
`action_received`	bridge	`action`, `arg`, `client_ip`, `correlation_id`	Logged before `DispatchAction`. In production, omitted by default; enable "Log all action traffic" in settings or `SIMSTEWARD_LOG_ALL_ACTIONS=1` to keep.
`action_dispatched`	simhub-plugin	`action`, `arg`, `correlation_id`, `subsession_id`, `parent_session_id`, `session_num`, `track_display_name`, `lap`, `log_env`, `loki_push_target`, plus spine `session_id` / `replay_frame` when set; `session_yaml_fingerprint_sha256_16` when session YAML is available (SHA-256 prefix of `SessionInfoYaml`, recomputed when `SessionInfoUpdate` changes)	Start of every command. subsession_id = iRacing `WeekendInfo.SubSessionID` when > 0, else `"not in session"`. parent_session_id = `WeekendInfo.SessionID` when > 0, else `"not in session"`. session_num = telemetry `SessionNum` when connected, else `"not in session"`. lap = telemetry `CarIdxLap` for the focus car (CamCarIdx if valid, else PlayerCarIdx); `-1` when unknown/disconnected. log_env = `SIMSTEWARD_LOG_ENV` or `unset`. loki_push_target = `disabled` \| `grafana_cloud` \| `local_or_custom` from `SIMSTEWARD_LOKI_URL` (same env the Loki sink uses — set before SimHub starts, e.g. launcher loading `.env`). In production, omitted by default; enable "Log all action traffic" to keep.
`action_result`	simhub-plugin	Same session/routing fields as `action_dispatched` where applicable, plus `success`, `result`, `error`, `duration_ms`	End of command.
`plugin_ui_changed`	simhub-plugin	`element`, `value`	Settings panel interaction (omit level/event, data API endpoint, log all action traffic).
`dashboard_ui_event`	bridge	`client_ip`, `element_id`, `event_type`, `value`, plus same `subsession_id`, `parent_session_id`, `session_num`, `track_display_name`, `lap`, `log_env`, `loki_push_target` as actions (and `session_yaml_fingerprint_sha256_16` when YAML available)	Dashboard UI-only interaction (panel toggles, log filter checkboxes, filter chips, view buttons, results drawer, etc.).
`replay_control`	simhub-plugin	`mode`, `speed`, `search_mode`	Replay buttons.
`session_snapshot_recorded`	simhub-plugin	`path`	Writable snapshot log.
`session_end_fingerprint`	simhub-plugin	`session_num`, `results_ready`, `results_positions_count`, `replay_frame_num`, `session_time`	Emitted when RecordSessionSnapshot is called with a trigger containing "session_end" (e.g. session_end:2). Fingerprint of what data is available at session end.
`checkered_detected`	simhub-plugin	`session_state`	Emitted when replay/live crosses the line (SessionState ≥ 5); before attempting capture.
`checkered_retry`	simhub-plugin	`session_state`	Emitted when running the 2s-delayed retry after checkered.
`session_capture_skipped`	simhub-plugin	`trigger`, `error`, `details`, `will_retry`	When capture is attempted but ResultsPositions is empty (e.g. at checkered).
`session_capture_incident_mismatch`	simhub-plugin	`results_incidents`, `tracker_incidents`, `player_car_idx`	WARN when player's ResultsPositions incident count ≠ IncidentTracker count (wrong session or SDK mapping).
`session_summary_captured`	simhub-plugin	`trigger`, `session_num`, `driver_count`, `wanted_session_num`, `selected_session_num`, `session_match_exact`, `results_incident_sample`	When `TryCaptureAndEmitSessionSummary` succeeds. Use `session_match_exact` to see when fallback session was used; `results_incident_sample` = first 3 drivers' car_idx, position, incidents for SDK verification.
`session_end_datapoints_session`	simhub-plugin	`trigger`, `session_id`, `session_num`, session-level fields (track, series_id, session_name, incident_limit, …), `telemetry_*` at capture, `results_driver_count`	Emitted once per successful session summary capture. Session metadata and telemetry snapshot only; no results array. Use with `session_end_datapoints_results` chunks to get full data. Scales to hundreds of drivers.
`session_end_datapoints_results`	simhub-plugin	`session_id`, `session_num`, `chunk_index`, `chunk_total`, `results_driver_count`, `results` (array of up to 35 driver rows: pos, car_idx, driver, abbrev, car, class, laps, incidents, reason_out, user_id, team, irating, etc.)	One log line per chunk (35 drivers per chunk). Merge chunks by `session_id` and sort by `chunk_index` for full results table. See docs/observability-scaling.md and § LogQL reference below.
`finalize_capture_started` / `complete` / `timeout`	simhub-plugin	`target_frame`, `duration_ms`	Debug / automation.
`incident_detected`	tracker	`incident_type`, `car_number`, `driver_name`, `unique_user_id` (iRacing CustID), `delta`, `session_time`, `session_num`, `lap` (per-car `CarIdxLap` for the incident car), `replay_frame`, `replay_frame_end`, `start_frame` / `end_frame` (same window as replay frames; aliases for ingestion), `camera_view` (compact string, e.g. `cam_car_idx=N;group=Name`), `cause`, `other_car_number`, `subsession_id`, `parent_session_id`, `track_display_name`, `cam_car_idx` / `camera_group` (when available), `log_env`, `loki_push_target`	Canonical rule name: `iracing_incident` — emitted JSON `event`: `incident_detected` (use this string in LogQL until code renames). Each YAML delta from `OnSessionInfo` (per-car `CurDriverIncidentCount`). Global uniqueness: see § Global incident uniqueness signature below. Use `"not in session"` for `subsession_id` / `parent_session_id` when iRacing has no loaded session.
`baseline_established`	tracker	`driver_count`	When tracker baseline is ready.
`session_reset`	tracker	`old_session`, `new_session`	When `SessionNum` changes.
`seek_backward_detected`	tracker	`from_frame`, `to_frame`, `session_time`	Replay seek.
`yaml_update`	tracker	`session_info_update`, `session_num`, `session_time`	Debug-only.
`session_digest`	simhub-plugin	`session_id`, `session_num`, `track`, `duration_minutes`, `total_incidents`, `results_incident_sum`, `incident_summary`, `incident_summary_truncated`, `results_table`, `results_driver_count`, `actions_dispatched`, …	Single-row session summary. total_incidents = count of `incident_detected` events (plugin); results_incident_sum = sum of iRacing per-driver incident points; results_table = authoritative ResultsPositions (pos, car, driver, incidents, laps, class, reason_out per driver).

Schema reference: `action_dispatched`, `action_result`, `iracing_incident`

PR checklist and domain taxonomy: docs/RULES-ActionCoverage.md. Full-field checklist below for dashboards, PR review, and Loki queries. Labels stay the four-label schema; everything else is JSON body.

`action_dispatched`

Field	Required for completeness	Notes
`action`	yes	Command name
`arg`	yes	Argument payload (may be empty)
`correlation_id`	yes	Trace id for paired `action_result`
`subsession_id`, `parent_session_id`, `session_num`, `track_display_name`	yes	Use `"not in session"` when offline
`lap`	yes	Integer; `-1` when unknown (same as `SessionLogging.LapUnknown`)
`session_id`, `replay_frame`	when applicable	Spine / replay context
`log_env`, `loki_push_target`	recommended	Observability metadata

Volume: In production, often omitted unless plugin setting “Log all action traffic” or SIMSTEWARD_LOG_ALL_ACTIONS=1. Same row in the table above lists additional bridge fields when action_received is enabled.

`action_result`

Field	Required for completeness	Notes
`success`	yes	Boolean outcome
`result` / `error`	as applicable	Payload or error detail
`duration_ms`	yes	Handler duration
`action`, `arg`, `correlation_id`	yes	Same command identity as `action_dispatched`
Session + routing fields	yes	Same set as `action_dispatched` (`subsession_id`, `parent_session_id`, `session_num`, `track_display_name`, spine fields, `log_env`, `loki_push_target`), plus `session_yaml_fingerprint_sha256_16` when YAML is available

`iracing_incident` (canonical) / `incident_detected` (emitted)

Field	Required for completeness	Notes
`unique_user_id`	yes	iRacing CustID
`driver_name`	yes	Display name in JSONL (`display_name` in coding rules = same concept)
`session_time`	yes	Time of detection
`subsession_id`, `parent_session_id`, `session_num`, `track_display_name`	yes	`"not in session"` when no session
`replay_frame`	yes	Start frame; use `replay_frame_end` if the event spans a window
`start_frame`, `end_frame`	yes	Same values as `replay_frame` / `replay_frame_end` for stable downstream keys
`lap`	yes	`CarIdxLap` for the car incurring the incident; `-1` if unknown
`cam_car_idx` / `camera_group`	when available	Camera / view context
`camera_view`	recommended	Single string combining camera car and group (see table above)
`incident_type`, `delta`, `car_number`, …	per implementation	Taxonomy / YAML delta fields

Global incident uniqueness signature

Use this tuple to dedupe and join incidents across splits, drivers, and time (Loki / warehouse / replay tools). All fields are in the JSON body (not Loki labels).

Split / event: subsession_id (iRacing WeekendInfo.SubSessionID) + parent_session_id (WeekendInfo.SessionID) + session_num (practice / qual / race phase).
Driver: unique_user_id (CustID) + driver_name (display name in the log line).
When / where: session_time + start_frame + end_frame + track_display_name.
Perspective: camera_view (or cam_car_idx + camera_group).

If iRacing is not in a loaded session, subsession_id, parent_session_id, and session_num use the same "not in session" fallback as action logs.

LogQL today: filter on event = "incident_detected". If the emitter is renamed to iracing_incident, update queries and this doc in the same change.

incident_detected and session_digest are the main LogQL entry points for incidents and session-level summaries (use Explore or any Loki panel you add).

Local vs. cloud configuration

Setting	Local Docker	Grafana Cloud
`SIMSTEWARD_LOKI_URL`	`http://localhost:3100`	`https://logs-prod-us-east-0.grafana.net`
`SIMSTEWARD_LOKI_USER`	(blank)	Your instance user ID
`SIMSTEWARD_LOKI_TOKEN`	(blank)	Your log-write token
`SIMSTEWARD_LOG_ENV`	`local`	`production`
`SIMSTEWARD_LOG_DEBUG`	`1` (optional)	`0` or unset
`SIMSTEWARD_LOG_ALL_ACTIONS`	`1` to keep `action_received` and `action_dispatched` in logs	unset (production omits them by default)
`SIMSTEWARD_RESOURCE_SAMPLE_SEC`	Interval in seconds for `host_resource_sample` (15–3600)	`60`

Log all action traffic: In production, action_received and action_dispatched are omitted at source to reduce volume. To capture every command (e.g. for debugging or full click/event visibility), enable Log all action traffic in the plugin settings (Observability / Log filters), or set SIMSTEWARD_LOG_ALL_ACTIONS=1 before starting SimHub.

The plugin reads these once at Init(). To switch environment, edit .env and restart SimHub.

Important: SimHub does not load a .env file. The plugin only sees environment variables that are set in the process that starts SimHub. To get logs into Grafana you must either:

Local Loki: Start SimHub with the script so env vars are set before launch:
- From the plugin repo root: .\scripts\run-simhub-local-observability.ps1
- This sets SIMSTEWARD_LOKI_URL=http://localhost:3100 and SIMSTEWARD_LOG_ENV=local, then starts SimHub.
Grafana Cloud: Set SIMSTEWARD_LOKI_URL, SIMSTEWARD_LOKI_USER, and SIMSTEWARD_LOKI_TOKEN in your user or system environment, then start SimHub (or use a launcher that sets them).

No data in Grafana

If Explore (or any Loki-backed panel) shows no logs for {app="sim-steward"}:

Loki URL not set — The plugin pushes to Loki only when SIMSTEWARD_LOKI_URL is set. If you start SimHub by double‑clicking (or from the Start menu), that variable is usually unset.
- Fix: Start SimHub via .\scripts\run-simhub-local-observability.ps1 for local Loki, or set the Loki env vars before starting SimHub.
- Check: Open %LOCALAPPDATA%\SimHubWpf\PluginsData\SimSteward\plugin.log and look for a line with event = loki_status. If it says "Loki logging disabled", the URL was not set when the plugin started.
Local stack not running — For local Docker Loki, ensure the stack is up: cd observability/local && docker compose up -d. Grafana should be at http://localhost:3000 and Loki at http://localhost:3100.
Wrong query — In Grafana Explore, select the Loki datasource and use LogQL: {app="sim-steward"} or {app="sim-steward", env="local"}. Use a time range that includes when the plugin was running.

After fixing, restart SimHub (using the script for local) and trigger some activity (e.g. open the dashboard, connect iRacing, or run a replay); logs should appear within a few seconds to a minute depending on flush interval.

Panels show no data

If Explore works but a dashboard panel shows "No data", fix the panel’s datasource (use UID loki_local for local provisioning, or your Cloud Loki datasource) and time range. If Explore is also empty, follow the checklist below.

Logs not in Grafana? Checklist

Use this checklist so button presses (Play, etc.) show up in Grafana:

Start SimHub with env set — Use .\scripts\run-simhub-local-observability.ps1 for local Loki, or set SIMSTEWARD_LOKI_URL (and optional user/token) in your environment before starting SimHub.
Or enable Loki in the plugin — In SimSteward plugin settings, enable Enable Loki logging so the plugin sets SIMSTEWARD_LOKI_URL=http://localhost:3100 for this run (persists for next start).
Local stack running — For local Loki: cd observability/local && docker compose up -d; confirm Loki at http://localhost:3100.
Query and time range — In Grafana Explore, select the Loki datasource, query {app="sim-steward"} (optionally env="local"), and set time range to Last 5 minutes or Last 15 minutes.
Wait for flush — After pressing Play or other buttons (or opening the dashboard, connecting iRacing, or an incident firing), wait 1–2 seconds; these events trigger a debounced flush so logs appear quickly. The periodic timer can still take up to 5 s for other events.
Check plugin.log — Look for loki_status ("Loki logging enabled" vs "disabled") and loki_first_push_ok (confirms at least one batch reached Loki). If you see push failure warnings, Loki is unreachable (stack down or wrong URL).

Events that trigger prompt flush (1–2 s): Button actions (action_result, action_dispatched), incidents (incident_detected), session lifecycle (checkered_detected, checkered_retry, session_summary_captured, session_digest, session_end_datapoints_session, session_end_datapoints_results, session_capture_skipped), dashboard and iRacing (dashboard_opened, iracing_connected, iracing_disconnected), tracker (baseline_established, session_reset), and bridge/plugin readiness (plugin_ready, bridge_starting, ws_client_connected, ws_client_disconnected). All other events are sent on the next batch-size or 5 s timer.

Debug mode

Set SIMSTEWARD_LOG_DEBUG=1 for local debugging only. When enabled:

PluginLogger.Debug() emits DEBUG-level entries (still sent to Loki; filter in Explore/AI).
In-process Loki push (when enabled) uses relaxed flush rules: 2 s timer, 500-entry batch, 5,000-entry queue, no line-size enforcement.
Additional log events are emitted:
- tick_stats every 60 ticks (≈1 s): running average data_update_ms, frames_dropped.
- yaml_update: each SessionInfoUpdate refresh.
- ws_message_raw: every WebSocket message (raw JSON) for debugging dashboard commands.
- incident_detected includes a snapshot field with the current PluginSnapshot.

Never enable debug in production. For AI or assistant queries, filter with | level != "DEBUG".

SessionStats and session_digest

SessionStats accumulates per session (reset on iracing_connected):

Metric	What it tracks
`ActionsDispatched`	Total actions processed this session.
`ActionFailures`	Count of actions that returned `success = false`.
`PluginErrors` / `PluginWarns`	From `_sessionStats.IncrementErrors()` / `IncrementWarns()`.
`WsPeakClients`	Peak WebSocket client count per session.
`ActionLatenciesMs`	Rolling sample for P50/P95.
`Incidents`	Incident summaries (e.g. for digest).

session_digest is emitted at most once per session (guarded by _sessionDigestEmitted). It caps incident_summary to 20 entries (highest severity first) and sets incident_summary_truncated: true when truncated. Trigger the digest manually (CaptureSessionSummaryNow, FinalizeThenCaptureSessionSummary, or checkered flag) so downstream AI and Grafana panels see the session as complete.

Incident semantics: total_incidents is the count of incident_detected events (from IncidentTracker CurDriverIncidentCount deltas). The results_table incidents column and results_incident_sum are iRacing’s per-driver incident points at session end (from ResultsPositions). So total_incidents (e.g. 12 events) and results_incident_sum (e.g. 24 points) can both be correct but differ. For what iRacing can supply when (live vs replay vs YAML vs REST), see docs/IRACING-DATA-AVAILABILITY.md.

Grafana dashboards (repo)

There are no provisioned dashboard JSON files in the repo at the moment. Use Grafana → Explore with the queries in § LogQL reference. To add dashboards again, put files under observability/local/grafana/provisioning/dashboards/ as a single object per file: { "dashboard": { ... }, "overwrite": true }, with panels using datasource { "type": "loki", "uid": "loki_local" } locally, or a variable such as DS_LOKI on Grafana Cloud.

Validate events over a time range in docs/observability-testing.md (Explore / LogQL checks).

Derived fields (Loki datasource)

The provisioned Loki datasource (observability/local/grafana/provisioning/datasources/loki.yml) defines derived fields so Grafana extracts and shows key JSON fields when viewing log lines:

Derived field	Extracted from	Notes
Event	`event`	Event type (e.g. `action_result`, `incident_detected`).
Correlation ID	`correlation_id`	Clickable: runs LogQL filter by this correlation (trace).
Session ID	`session_id`	Session identifier.
Action	`action`	Command/action name.
Success	`success`	Action outcome (`true` / `false`).
Trigger	`trigger`	Session summary trigger (e.g. `checkered`, `finalize`).
Incident type	`incident_type`	Incident classification.
Driver name	`driver_name`	Driver name in incident/session context.
Car number	`car_number`	Car number (when serialized as string).

In Explore or any Loki log panel, these appear as parsed columns/links alongside the raw line.

Local quickstart

# Persistent storage (host path must exist before Docker)
New-Item -ItemType Directory -Force "S:\sim-steward-grafana-storage"

# In project .env, for local:
# SIMSTEWARD_LOKI_URL=http://localhost:3100
# SIMSTEWARD_LOKI_USER=
# SIMSTEWARD_LOKI_TOKEN=
# SIMSTEWARD_LOG_ENV=local
# SIMSTEWARD_LOG_DEBUG=1

cd observability/local
docker compose up -d
# Grafana: http://localhost:3000  |  Loki: http://localhost:3100 (no auth for direct push)

# Optional: local **loki-gateway** (token-protected push) — see observability-local.md

See docs/observability-local.md for Grafana + Loki + gateway setup and LOKI_PUSH_TOKEN.

Housekeeping (Grafana Cloud)

Dashboards: Delete in the Grafana UI or via the HTTP API using an existing editor/admin token. Do not delete the Loki datasource or change stack URLs, SIMSTEWARD_LOKI_USER, or SIMSTEWARD_LOKI_TOKEN.

Stored logs: Rely on plan retention, or use Grafana Cloud / Loki documented deletion flows for your tier. Clearing data must not require rotating credentials.

Local Docker: Wipe Loki (and optionally Grafana) bind-mount data with docs/observability-local.md § Housekeeping (scripts/obs-wipe-local-data.ps1).

CLI: direct Loki query (repo)

With .env containing SIMSTEWARD_LOKI_URL, SIMSTEWARD_LOKI_USER, and SIMSTEWARD_LOKI_TOKEN (optional override: LOKI_QUERY_URL):

Command	Purpose
`pnpm run loki:query`	One-off `GET .../loki/api/v1/query_range` via `scripts/query-loki-once.mjs`. Flags: `--query` (LogQL), `--limit`, `--lookback` (seconds).
`pnpm run env:run -- <command>`	Load `.env` into the child process (e.g. `pnpm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1`).
`pnpm run obs:poll`	Tail-style poll, direct Loki (default).
`pnpm run obs:poll:grafana`	Same, but `-ViaGrafana` (Bearer → Grafana proxy → Loki).
`pnpm run obs:poll:grafana:env`	Same as `obs:poll:grafana` but injects `.env` with `dotenv-cli` first (secrets only in the child process).

Path A (direct Loki): SIMSTEWARD_LOKI_* + pnpm run loki:query or pnpm run obs:poll.

Path B (Grafana Cloud, elevated glsa_* Bearer): Set GRAFANA_URL to your stack (https://<slug>.grafana.net — not logs-prod-*.grafana.net). Set GRAFANA_LOKI_DATASOURCE_UID to the Loki datasource UID in that stack (Connections → Data sources). Set GRAFANA_API_TOKEN or CURSOR_ELEVATED_GRAFANA_TOKEN (service account token with permission to query the Loki datasource via the proxy). Then:

pnpm run obs:poll:grafana:env
# or: pnpm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1 -ViaGrafana

poll-loki.ps1 reads .env from disk; *:env pnpm scripts add dotenv -e .env so variables are also loaded for the child process without exporting them in the shell.

401/403: On Path A, the glc_* policy may lack Loki read. On Path B, ensure the Bearer token can query datasources; check datasource UID and stack URL.

LogQL reference

Purpose	LogQL
Command audit	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "action_result"`
Failed commands	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "action_result" \| success = "false"`
Action volume (timeseries)	`count_over_time({app="sim-steward", component="simhub-plugin"} \| json \| event = "action_result" [$__interval])`
Incident timeline	`{app="sim-steward", component="tracker"} \| json \| event = "incident_detected"`
Plugin lifecycle	`{app="sim-steward", component="simhub-plugin"} \| json \| event =~ "plugin_started\|plugin_ready\|iracing_connected\|iracing_disconnected\|plugin_stopped"`
Session digests	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "session_digest"`
Session end metadata	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "session_end_datapoints_session"`
Session end results (all chunks for one session)	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "session_end_datapoints_results" \| session_id = "<id>"` — merge in panels by sorting on `chunk_index` and flattening `results`.
WS peak (stat)	`max_over_time({app="sim-steward"} \| json \| event = "session_digest" \| unwrap ws_peak_clients [24h])`
Trace by correlation	`{app="sim-steward"} \| json \| correlation_id = "<id>"`
All errors	`{app="sim-steward", level="ERROR"}`

Replay control and incident counts

Purpose	LogQL
Replay control by speed	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "replay_control" \| speed != ""` — filter by `speed` (e.g. 16, 8, 1) to see which replay speed was used.
Replay control (all)	`{app="sim-steward", component="simhub-plugin"} \| json \| event = "replay_control"` — seek/play/pause and speed.
Incident count (timeseries)	`count_over_time({app="sim-steward", component="tracker"} \| json \| event = "incident_detected" [$__interval])`
Incident count (total in range)	`count_over_time({app="sim-steward", component="tracker"} \| json \| event = "incident_detected" [$__range])`

session_summary_captured is emitted only when the plugin successfully captures the session results table (ResultsPositions). It does not fire when results are not yet available (e.g. before checkered, or in a short replay clip that never reaches session end). Use session_capture_skipped to see when capture was attempted but results were empty; use session_end_fingerprint (when implemented) to see what data was available at session end.

Chunked session results (hundreds of drivers)

End-of-session driver results are emitted as session_end_datapoints_session (metadata once) plus session_end_datapoints_results (one log line per chunk of 35 drivers). To show a full results table for a session in Grafana log/table panels:

Query: {app="sim-steward", component="simhub-plugin"} | json | event = "session_end_datapoints_results" | session_id = "<session_id>".
In the panel transform: sort by chunk_index, then use a transform that flattens the results array from each chunk into a single table (e.g. "Merge" / "Flatten" or a custom transformation that concatenates results in order).

Exact driver count is in session_end_datapoints_session (results_driver_count); use that event when you need session metadata without parsing result chunks.

AI integrations (Grafana Cloud)

Grafana Sift — Pattern grouping for errors: use {app="sim-steward", level="ERROR"}.
Grafana Assistant — Start with session_digest; then drill down by session_id or correlation_id. Filter | level != "DEBUG" for production.
Natural-language LogQL — In Explore, use field names like action, success, duration_ms, incident_type, session_id, correlation_id.
MCP — Optional: Grafana MCP connector (e.g. github.com/grafana/mcp-grafana) so Cursor or other tools query logs; point at Grafana Cloud with a token.

Phase 2 (future): OTel metrics

After Phase 1 is in use and logs are observed in production, Phase 2 can add OpenTelemetry metrics (e.g. via Grafana.OpenTelemetry NuGet) for faster dashboards, longer retention, and metric-based alerting. Metric names and env vars are in the Grafana Loki logging plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana Loki Structured Logging

Local vs prod (same pipeline; env label only)

Grafana Cloud free tier limits

Scale: hundreds of drivers / many users

Volume budget (per session, ~2 h)

Label schema

Two `app` namespaces

`app=sim-steward` (product)

`app=claude-dev-logging` (dev tooling)

Enrichment fields (JSON body, not labels)

Session token sidecar

Event taxonomy

Schema reference: `action_dispatched`, `action_result`, `iracing_incident`

`action_dispatched`

`action_result`

`iracing_incident` (canonical) / `incident_detected` (emitted)

Global incident uniqueness signature

Local vs. cloud configuration

No data in Grafana

Panels show no data

Logs not in Grafana? Checklist

Debug mode

SessionStats and session_digest

Grafana dashboards (repo)

Derived fields (Loki datasource)

Local quickstart

Housekeeping (Grafana Cloud)

CLI: direct Loki query (repo)

LogQL reference

Replay control and incident counts

Chunked session results (hundreds of drivers)

AI integrations (Grafana Cloud)

Phase 2 (future): OTel metrics

FilesExpand file tree

GRAFANA-LOGGING.md

Latest commit

History

GRAFANA-LOGGING.md

File metadata and controls

Grafana Loki Structured Logging

Local vs prod (same pipeline; env label only)

Grafana Cloud free tier limits

Scale: hundreds of drivers / many users

Volume budget (per session, ~2 h)

Label schema

Two app namespaces

app=sim-steward (product)

app=claude-dev-logging (dev tooling)

Enrichment fields (JSON body, not labels)

Session token sidecar

Event taxonomy

Schema reference: action_dispatched, action_result, iracing_incident

action_dispatched

action_result

iracing_incident (canonical) / incident_detected (emitted)

Global incident uniqueness signature

Local vs. cloud configuration

No data in Grafana

Panels show no data

Logs not in Grafana? Checklist

Debug mode

SessionStats and session_digest

Grafana dashboards (repo)

Derived fields (Loki datasource)

Local quickstart

Housekeeping (Grafana Cloud)

CLI: direct Loki query (repo)

LogQL reference

Replay control and incident counts

Chunked session results (hundreds of drivers)

AI integrations (Grafana Cloud)

Phase 2 (future): OTel metrics

Two `app` namespaces

`app=sim-steward` (product)

`app=claude-dev-logging` (dev tooling)

Schema reference: `action_dispatched`, `action_result`, `iracing_incident`

`action_dispatched`

`action_result`

`iracing_incident` (canonical) / `incident_detected` (emitted)