⚡ Bolt: Implement telemetry caching and optimize status helpers by heidi-dang · Pull Request #82 · heidi-dang/heidi-engine

heidi-dang · 2026-02-21T09:24:24Z

💡 What: Implemented three levels of caching in heidi_engine/telemetry.py:

StateCache singleton to cache state.json content using file metadata validation.
TTL-based cache for get_gpu_summary (5s) to avoid expensive nvidia-smi subprocess calls.
TTL-based cache for get_last_event_ts (1s) to reduce redundant file reads.

🎯 Why: The /status endpoint and TUI dashboard poll these functions frequently (2Hz+), leading to significant redundant IO and CPU overhead from JSON parsing and subprocess execution.

📊 Impact:

get_state: 0.09ms -> 0.04ms (~56% faster).
nvidia-smi: Significant reduction in subprocess overhead (cached for 5s).

🔬 Measurement: Verified with benchmark scripts (bench_telemetry.py, bench_gpu.py, bench_events.py) and existing tests.

Other: Handled edge cases like small log files in get_last_event_ts and used deepcopy in StateCache for safety.

PR created automatically by Jules for task 6381428177306477881 started by @heidi-dang

Implemented a thread-safe `StateCache` for `state.json` and added TTL-based caching for `get_gpu_summary` and `get_last_event_ts`. - `get_state` is now ~56% faster by avoiding redundant JSON parsing. - `nvidia-smi` calls are reduced by caching (TTL 5s). - `get_last_event_ts` is safer and more efficient. - Moved status helpers to module level for better maintainability.

google-labs-jules · 2026-02-21T09:24:25Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

gemini-code-assist · 2026-02-21T09:24:39Z

Summary of Changes

Hello @heidi-dang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of telemetry and status endpoints by introducing a multi-layered caching strategy. The changes aim to reduce redundant disk I/O and CPU overhead caused by frequent polling, making the system more responsive and efficient, particularly for dashboards and TUI applications. The optimizations focus on caching file reads and subprocess call results, leading to faster data retrieval for critical status information.

Highlights

State.json Caching: Implemented a StateCache singleton that caches state.json content, validating it using file metadata (mtime/size) and a configurable TTL, resulting in a ~56% speed improvement for state reads.
GPU Summary Caching: Introduced a 5-second TTL-based cache for get_gpu_summary to significantly reduce expensive nvidia-smi subprocess calls, which are frequently polled by status endpoints.
Last Event Timestamp Caching: Added a 1-second TTL-based cache for get_last_event_ts to minimize redundant file reads when determining the timestamp of the last event, including robust handling for small log files.

Changelog

.jules/bolt.md
- Added a new learning entry about the benefits of metadata-validated and TTL-based caching for status endpoints.
heidi_engine/telemetry.py
- Imported copy and time modules for caching functionality.
- Introduced HEIDI_STATUS_TTL_S environment variable for configuring status cache TTL.
- Implemented a new StateCache class as a thread-safe singleton for caching state.json based on file metadata and TTL.
- Modified get_state function to utilize the StateCache for current run status queries, improving read performance.
- Moved get_gpu_summary, get_last_event_ts, and redact_state functions to the top-level scope.
- Integrated a 5-second TTL cache into get_gpu_summary to optimize nvidia-smi calls.
- Integrated a 1-second TTL cache into get_last_event_ts to reduce event log file reads, with improved handling for small files.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several caching optimizations to heidi_engine/telemetry.py to improve the performance of status polling, which is a great enhancement. The implementation of StateCache and the TTL-based caches for GPU summary and event timestamps are well-structured. My review focuses on improving the robustness of the new caching mechanisms, particularly around error handling and ensuring failures are also cached to prevent redundant I/O. I've suggested changes to log exceptions instead of ignoring them silently and to refine the caching logic in get_last_event_ts.

gemini-code-assist · 2026-02-21T09:27:24Z

heidi_engine/telemetry.py

+            except Exception:
+                return None


Swallowing all exceptions silently can make it difficult to diagnose problems with reading or parsing state.json. For instance, a malformed JSON file would cause this to return None without any indication of the root cause. It's better to log the exception to stderr for debugging, which is consistent with other error handling in this file.

Suggested change

except Exception:

return None

except Exception as e:

print(f"[WARN] StateCache failed to read or parse state file {state_file}: {e}", file=sys.stderr)

return None

gemini-code-assist · 2026-02-21T09:27:25Z

heidi_engine/telemetry.py

+    except Exception:
+        pass


Silently passing on all exceptions can hide underlying issues with nvidia-smi (e.g., it's not installed, command fails, parsing error). It's better to log the exception to stderr for easier debugging.

Suggested change

except Exception:

pass

except Exception as e:

print(f"[WARN] Failed to get GPU summary: {e}", file=sys.stderr)

gemini-code-assist · 2026-02-21T09:27:25Z

heidi_engine/telemetry.py

+    if (
+        _last_event_ts_cache["data"]
+        and (now - _last_event_ts_cache["last_polled"]) < LAST_EVENT_TS_TTL
+    ):
+        return _last_event_ts_cache["data"]
+
+    try:
+        events_file = get_events_path()
+        if events_file.exists():
+            file_size = events_file.stat().st_size
+            if file_size > 0:
+                with open(events_file, "rb") as f:
+                    # BOLT OPTIMIZATION: Safely seek even for small files
+                    f.seek(-min(500, file_size), 2)
+                    lines = f.read().decode().strip().split("\n")
+                    if lines:
+                        last_line = lines[-1]
+                        event = json.loads(last_line)
+                        ts = event.get("ts")
+                        _last_event_ts_cache = {"data": ts, "last_polled": now}
+                        return ts
+    except Exception:
+        pass
+
+    return None


This function can be improved in two ways:

Cache None results: The current implementation doesn't cache None results (e.g., when the event file is empty or an error occurs). This leads to repeated file I/O on every call in these scenarios. Caching None for the TTL duration would be more efficient.

Error logging: The except Exception: pass silently swallows errors, making it hard to debug issues with reading the event file.

This refactoring addresses both points by modifying the cache check to be time-based, ensuring None results are cached, adding logging for exceptions, and safely handling potentially partial JSON lines.

if ( _last_event_ts_cache["last_polled"] > 0 and (now - _last_event_ts_cache["last_polled"]) < LAST_EVENT_TS_TTL ): return _last_event_ts_cache["data"] ts = None try: events_file = get_events_path() if events_file.exists(): file_size = events_file.stat().st_size if file_size > 0: with open(events_file, "rb") as f: # BOLT OPTIMIZATION: Safely seek even for small files f.seek(-min(500, file_size), 2) lines = f.read().decode(errors="ignore").strip().split("\n") if lines and lines[-1]: try: event = json.loads(lines[-1]) ts = event.get("ts") except json.JSONDecodeError: # Last line might be partial, ignore. pass except Exception as e: print(f"[WARN] Failed to get last event timestamp: {e}", file=sys.stderr) _last_event_ts_cache = {"data": ts, "last_polled": now} return ts

gemini-code-assist bot reviewed Feb 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

⚡ Bolt: Implement telemetry caching and optimize status helpers#82

⚡ Bolt: Implement telemetry caching and optimize status helpers#82
heidi-dang wants to merge 1 commit intofeat/bootstrap-scaffoldfrom
bolt/telemetry-caching-6381428177306477881

heidi-dang commented Feb 21, 2026

Uh oh!

google-labs-jules bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 21, 2026

Uh oh!

gemini-code-assist bot Feb 21, 2026

Uh oh!

gemini-code-assist bot Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    except Exception:
-        pass
+    except Exception as e:
+        print(f"[WARN] Failed to get GPU summary: {e}", file=sys.stderr)

Comments

Conversation

heidi-dang commented Feb 21, 2026

Uh oh!

google-labs-jules bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant