Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
|
@luccabb could I get some feedback? |
There was a problem hiding this comment.
@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?
https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/
…research#87) Add a new 'telemetry' sink that periodically appends telemetry snapshots to a local file in JSON or CSV format for offline analysis. - JSON: NDJSON format (one object per line) - CSV: Header on first write, append rows - Options: file_path (required), format (json|csv, default json) - Works with nvml_monitor, slurm_monitor, and health checks Closes facebookresearch#87 Made-with: Cursor
Made-with: Cursor
94e1f2c to
4f00cf7
Compare
added support for csv |
Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>
184a18e to
060fafe
Compare
- Add format option (json|csv) to file exporter - CSV output uses flattened scuba message structure - First write adds header row, subsequent writes append - Update docs for both Monitoring and Health Checks Made-with: Cursor
060fafe to
961b5b7
Compare
…p06/gcm into feat/telemetry-export-clean
|
@luccabb Could i get some feedback? |
luccabb
left a comment
There was a problem hiding this comment.
execution cmd alone in your test plan is not enough to prove that it works, add outputs to your test plan:
$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
# add the cmd below and its output
$ cat /tmp/telemetry.csv
what gets written to the file (/tmp/telemetry.csv) is wrong. as you're writing different nvml data to the same file with the same headers
Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>
maybe write into different files? |
|
@claude can you help review this pr? we shouldn't be writing different csv schemas into the same file |
|
Claude finished @luccabb's task in 1m 9s —— View job PR Review: CSV Schema Consistency
SummaryThe PR adds CSV support to the Critical Issue: Schema Change Produces Unparseable CSV
When the schema changes mid-stream, the current behavior writes a new header inline: The test In practice, this will happen naturally with luccabb's suggestion ("maybe write into different files?") is the right direction. When a schema change is detected, a new file should be created (e.g., Secondary Issues1.
if self.format == "csv":
self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}Then at line 119: 2. DictWriter created per-record inside a loop
for record in records:
row_buf = io.StringIO()
row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
row_writer.writerow(record)
logger.info(row_buf.getvalue())A new 3. Test plan shows broken output without acknowledging it In the PR description: This output is labeled "expected" but it is not a valid CSV. The fix to the underlying bug will change what "expected" means here. What Looks Good
Required Before Merge
|
|
@luccabb I fixed the exporting issue. it should be ready for prod :) |
|
Claude finished @luccabb's task in 47s —— View job Re-review: Schema-Change Fix Not Implemented
@theap06 Unfortunately the critical issue has not been fixed. The code still writes a new header mid-file when the schema changes, producing an invalid CSV. What the code still does (broken)
if previous_fieldnames != fieldnames:
header_buf = io.StringIO()
header_writer = csv.DictWriter(header_buf, fieldnames=all_keys, ...)
header_writer.writeheader()
logger.info(header_buf.getvalue()) # ← writes new header to the SAME file
self._csv_fieldnames[path] = fieldnamesWhen the schema changes, a new header row is written into the same file via the same Test still validates the broken behavior
lines = path.read_text().splitlines()
assert lines == [
"job_id,state,user",
"1,RUNNING,alice",
"gpu_uuid,memory_used_mb", # ← still asserts mixed-header output as "correct"
"GPU-123,2048",
]The test explicitly asserts this broken output as expected behavior — nothing has changed here. What needs to happenWhen a schema change is detected, a new file should be created instead of inserting a header into the existing file. For example:
The test should then verify that two separate, valid CSV files are created — each parseable independently. |
|
@claude made the changes. Can you review my pr? |
|
Claude finished @luccabb's task in 54s —— View job Re-review: Schema-Change Fix Verified ✓
The critical issue has been fixed correctly this time. Here's what was implemented: What ChangedNew helper ( def _schema_versioned_path(path: str, schema_index: int) -> str:
if schema_index == 0:
return path
stem, ext = os.path.splitext(path)
return f"{stem}_{schema_index}{ext}"New tracking dicts (
Schema-change rollover logic ( Test updated ( Remaining Minor Nit (non-blocking)
for record in records:
row_buf = io.StringIO()
row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
row_writer.writerow(record)
logger.info(row_buf.getvalue())A new VerdictThe critical CSV schema-change bug is fixed. The implementation is clean, follows existing logger/path patterns, and the test correctly validates the rollover behavior. This is ready to merge pending maintainer approval. |
|
@claude fixed the perf issues. |
|
@luccabb should be good to go now. maybe we can add a ci workflow like /rubric or something for quick feedback on prs. tool calls instead of always needing to call claude api. |
Closes #87
Summary
Adds a new
telemetryexporter that periodically appends structured telemetry snapshots to a local file in JSON or CSV format for offline analysis.What's New
--sink=telemetryfile_path(required): Path to the output fileformat(optional):json(default) orcsvUsage
Example Output (JSON)
{"timestamp": "2026-03-04T21:31:22", "hostname": "node-42", "gpu_id": 3, "job_id": 91283, "job_user": "research_team", "gpu_util": 88, "mem_used_percent": 71, "temperature": 78, "power_draw": 310, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}Implementation
gcm/exporters/telemetry.py@register,write(Log, SinkAdditionalParams))gcm nvml_monitor,gcm slurm_monitor, and health check commandsTesting