⚡ Bolt: Implement telemetry caching and optimize status helpers#82
⚡ Bolt: Implement telemetry caching and optimize status helpers#82heidi-dang wants to merge 1 commit intofeat/bootstrap-scaffoldfrom
Conversation
Implemented a thread-safe `StateCache` for `state.json` and added TTL-based caching for `get_gpu_summary` and `get_last_event_ts`. - `get_state` is now ~56% faster by avoiding redundant JSON parsing. - `nvidia-smi` calls are reduced by caching (TTL 5s). - `get_last_event_ts` is safer and more efficient. - Moved status helpers to module level for better maintainability.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Summary of ChangesHello @heidi-dang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of telemetry and status endpoints by introducing a multi-layered caching strategy. The changes aim to reduce redundant disk I/O and CPU overhead caused by frequent polling, making the system more responsive and efficient, particularly for dashboards and TUI applications. The optimizations focus on caching file reads and subprocess call results, leading to faster data retrieval for critical status information. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several caching optimizations to heidi_engine/telemetry.py to improve the performance of status polling, which is a great enhancement. The implementation of StateCache and the TTL-based caches for GPU summary and event timestamps are well-structured. My review focuses on improving the robustness of the new caching mechanisms, particularly around error handling and ensuring failures are also cached to prevent redundant I/O. I've suggested changes to log exceptions instead of ignoring them silently and to refine the caching logic in get_last_event_ts.
| except Exception: | ||
| return None |
There was a problem hiding this comment.
Swallowing all exceptions silently can make it difficult to diagnose problems with reading or parsing state.json. For instance, a malformed JSON file would cause this to return None without any indication of the root cause. It's better to log the exception to stderr for debugging, which is consistent with other error handling in this file.
| except Exception: | |
| return None | |
| except Exception as e: | |
| print(f"[WARN] StateCache failed to read or parse state file {state_file}: {e}", file=sys.stderr) | |
| return None |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silently passing on all exceptions can hide underlying issues with nvidia-smi (e.g., it's not installed, command fails, parsing error). It's better to log the exception to stderr for easier debugging.
| except Exception: | |
| pass | |
| except Exception as e: | |
| print(f"[WARN] Failed to get GPU summary: {e}", file=sys.stderr) |
| if ( | ||
| _last_event_ts_cache["data"] | ||
| and (now - _last_event_ts_cache["last_polled"]) < LAST_EVENT_TS_TTL | ||
| ): | ||
| return _last_event_ts_cache["data"] | ||
|
|
||
| try: | ||
| events_file = get_events_path() | ||
| if events_file.exists(): | ||
| file_size = events_file.stat().st_size | ||
| if file_size > 0: | ||
| with open(events_file, "rb") as f: | ||
| # BOLT OPTIMIZATION: Safely seek even for small files | ||
| f.seek(-min(500, file_size), 2) | ||
| lines = f.read().decode().strip().split("\n") | ||
| if lines: | ||
| last_line = lines[-1] | ||
| event = json.loads(last_line) | ||
| ts = event.get("ts") | ||
| _last_event_ts_cache = {"data": ts, "last_polled": now} | ||
| return ts | ||
| except Exception: | ||
| pass | ||
|
|
||
| return None |
There was a problem hiding this comment.
This function can be improved in two ways:
- Cache
Noneresults: The current implementation doesn't cacheNoneresults (e.g., when the event file is empty or an error occurs). This leads to repeated file I/O on every call in these scenarios. CachingNonefor the TTL duration would be more efficient. - Error logging: The
except Exception: passsilently swallows errors, making it hard to debug issues with reading the event file.
This refactoring addresses both points by modifying the cache check to be time-based, ensuring None results are cached, adding logging for exceptions, and safely handling potentially partial JSON lines.
if (
_last_event_ts_cache["last_polled"] > 0
and (now - _last_event_ts_cache["last_polled"]) < LAST_EVENT_TS_TTL
):
return _last_event_ts_cache["data"]
ts = None
try:
events_file = get_events_path()
if events_file.exists():
file_size = events_file.stat().st_size
if file_size > 0:
with open(events_file, "rb") as f:
# BOLT OPTIMIZATION: Safely seek even for small files
f.seek(-min(500, file_size), 2)
lines = f.read().decode(errors="ignore").strip().split("\n")
if lines and lines[-1]:
try:
event = json.loads(lines[-1])
ts = event.get("ts")
except json.JSONDecodeError:
# Last line might be partial, ignore.
pass
except Exception as e:
print(f"[WARN] Failed to get last event timestamp: {e}", file=sys.stderr)
_last_event_ts_cache = {"data": ts, "last_polled": now}
return ts
💡 What: Implemented three levels of caching in
heidi_engine/telemetry.py:StateCachesingleton to cachestate.jsoncontent using file metadata validation.get_gpu_summary(5s) to avoid expensivenvidia-smisubprocess calls.get_last_event_ts(1s) to reduce redundant file reads.🎯 Why: The
/statusendpoint and TUI dashboard poll these functions frequently (2Hz+), leading to significant redundant IO and CPU overhead from JSON parsing and subprocess execution.📊 Impact:
get_state: 0.09ms -> 0.04ms (~56% faster).nvidia-smi: Significant reduction in subprocess overhead (cached for 5s).🔬 Measurement: Verified with benchmark scripts (
bench_telemetry.py,bench_gpu.py,bench_events.py) and existing tests.Other: Handled edge cases like small log files in
get_last_event_tsand useddeepcopyinStateCachefor safety.PR created automatically by Jules for task 6381428177306477881 started by @heidi-dang