Feat/telemetry export clean by theap06 · Pull Request #89 · facebookresearch/gcm

theap06 · 2026-03-04T22:43:42Z

Closes #87

Summary

Adds a new telemetry exporter that periodically appends structured telemetry snapshots to a local file in JSON or CSV format for offline analysis.

What's New

New sink: --sink=telemetry
Output formats: JSON (NDJSON) or CSV
Options:
- file_path (required): Path to the output file
- format (optional): json (default) or csv

Usage

# JSON (NDJSON, one object per line)
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/var/log/gcm/telemetry.json --once

# CSV
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/var/log/gcm/telemetry.csv --sink-opt format=csv --once

Example Output (JSON)

{"timestamp": "2026-03-04T21:31:22", "hostname": "node-42", "gpu_id": 3, "job_id": 91283, "job_user": "research_team", "gpu_util": 88, "mem_used_percent": 71, "temperature": 78, "power_draw": 310, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}

Implementation

~60 lines of code in gcm/exporters/telemetry.py
Follows existing exporter conventions (@register, write(Log, SinkAdditionalParams))
Works with gcm nvml_monitor, gcm slurm_monitor, and health check commands
Auto-creates parent directories if needed

Testing

pytest gcm/tests/test_telemetry_exporter.py -v

$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ cat /tmp/telemetry.csv

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

github-actions · 2026-03-04T22:43:51Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

theap06 · 2026-03-08T01:01:28Z

@luccabb could I get some feedback?

luccabb

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

…research#87) Add a new 'telemetry' sink that periodically appends telemetry snapshots to a local file in JSON or CSV format for offline analysis. - JSON: NDJSON format (one object per line) - CSV: Header on first write, append rows - Options: file_path (required), format (json|csv, default json) - Works with nvml_monitor, slurm_monitor, and health checks Closes facebookresearch#87 Made-with: Cursor

Made-with: Cursor

theap06 · 2026-03-08T07:29:24Z

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

added support for csv

gcm/exporters/telemetry.py

gcm/exporters/file.py

Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>

- Add format option (json|csv) to file exporter - CSV output uses flattened scuba message structure - First write adds header row, subsequent writes append - Update docs for both Monitoring and Health Checks Made-with: Cursor

…p06/gcm into feat/telemetry-export-clean

gcm/exporters/file.py

theap06 · 2026-03-10T23:13:36Z

@luccabb Could i get some feedback?

luccabb

execution cmd alone in your test plan is not enough to prove that it works, add outputs to your test plan:

$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
# add the cmd below and its output
$ cat /tmp/telemetry.csv

what gets written to the file (/tmp/telemetry.csv) is wrong. as you're writing different nvml data to the same file with the same headers

gcm/exporters/file.py

Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>

luccabb · 2026-03-11T22:00:43Z

@theap06

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

maybe write into different files?

luccabb · 2026-03-11T22:01:33Z

@claude can you help review this pr? we shouldn't be writing different csv schemas into the same file

claude · 2026-03-11T22:01:48Z

Claude finished @luccabb's task in 1m 9s —— View job

PR Review: CSV Schema Consistency

Summary

The PR adds CSV support to the file exporter. The implementation is generally clean and follows existing patterns, but has a critical correctness issue: when the CSV schema changes between writes, the current code writes a new header row mid-file, producing an invalid CSV that no standard parser can read.

Critical Issue: Schema Change Produces Unparseable CSV

gcm/exporters/file.py lines 119–131 and gcm/tests/test_file_exporter.py lines 45–67

When the schema changes mid-stream, the current behavior writes a new header inline:

normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb   ← second header, breaks CSV structure
GPU-123,2048

The test test_file_exporter_csv_rewrites_header_on_schema_change explicitly asserts this broken output as the expected behavior — meaning the test is validating invalid behavior rather than guarding against it.

In practice, this will happen naturally with gcm nvml_monitor because multiple GPU payloads with different fields (e.g. NVML node data vs. job data) may be routed to the same file.

luccabb's suggestion ("maybe write into different files?") is the right direction. When a schema change is detected, a new file should be created (e.g., telemetry.csv, telemetry_1.csv, telemetry_2.csv). This approach preserves each segment as a valid standalone CSV.

Fix this →

Secondary Issues

1. _csv_fieldnames initialized conditionally but accessed unconditionally

gcm/exporters/file.py lines 57–58:

if self.format == "csv":
    self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}

Then at line 119: self._csv_fieldnames.get(path) — this will raise AttributeError if somehow format is mutated after __init__. This is defensive, but _csv_fieldnames should just be initialized unconditionally as {} (or guarded by the if self.format == "csv" branch in write(), which it already is — so this is minor).

2. DictWriter created per-record inside a loop

gcm/exporters/file.py lines 133–142:

for record in records:
    row_buf = io.StringIO()
    row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
    row_writer.writerow(record)
    logger.info(row_buf.getvalue())

A new StringIO and DictWriter is created for every single record. The DictWriter can be created once outside the loop and reused by resetting/recreating the buffer only. Minor performance issue, but worth cleaning up.

3. Test plan shows broken output without acknowledging it

In the PR description:

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

This output is labeled "expected" but it is not a valid CSV. The fix to the underlying bug will change what "expected" means here.

What Looks Good

The _flatten_for_csv helper is clean and correctly reuses asdict_recursive(to_scuba_message(...)) — consistent with the JSON path.
The else: raise ValueError(...) for unsupported formats was added as requested.
_data_identifier_to_path tracking is a clean way to support multiple output files.
The format parameter defaulting to "json" preserves backwards compatibility.
The format == 'json' branch correctly follows the existing logger pattern.
Tests cover the happy path well.

Required Before Merge

Fix the schema-change behavior — write to a new file rather than inserting a second header row into the same file
Update test_file_exporter_csv_rewrites_header_on_schema_change — it currently asserts the broken output; it should verify that two separate, valid CSV files are produced

theap06 · 2026-03-12T00:06:18Z

@luccabb I fixed the exporting issue. it should be ready for prod :)

luccabb · 2026-03-12T01:08:27Z

@claude

claude · 2026-03-12T01:08:43Z

Claude finished @luccabb's task in 47s —— View job

Re-review: Schema-Change Fix Not Implemented

@theap06 Unfortunately the critical issue has not been fixed. The code still writes a new header mid-file when the schema changes, producing an invalid CSV.

What the code still does (broken)

gcm/exporters/file.py lines 119–131:

if previous_fieldnames != fieldnames:
    header_buf = io.StringIO()
    header_writer = csv.DictWriter(header_buf, fieldnames=all_keys, ...)
    header_writer.writeheader()
    logger.info(header_buf.getvalue())   # ← writes new header to the SAME file
    self._csv_fieldnames[path] = fieldnames

When the schema changes, a new header row is written into the same file via the same logger. This produces the same invalid output as before:

job_id,state,user
1,RUNNING,alice
gpu_uuid,memory_used_mb    ← second header inserted mid-file
GPU-123,2048

Test still validates the broken behavior

gcm/tests/test_file_exporter.py lines 62–67:

lines = path.read_text().splitlines()
assert lines == [
    "job_id,state,user",
    "1,RUNNING,alice",
    "gpu_uuid,memory_used_mb",   # ← still asserts mixed-header output as "correct"
    "GPU-123,2048",
]

The test explicitly asserts this broken output as expected behavior — nothing has changed here.

What needs to happen

When a schema change is detected, a new file should be created instead of inserting a header into the existing file. For example:

First schema → telemetry.csv
Second schema → telemetry_1.csv
Third schema → telemetry_2.csv

The test should then verify that two separate, valid CSV files are created — each parseable independently.

Fix this →

theap06 · 2026-03-12T05:33:54Z

@claude made the changes. Can you review my pr?

luccabb · 2026-03-12T05:38:45Z

@claude

claude · 2026-03-12T05:41:31Z

Claude finished @luccabb's task in 54s —— View job

Re-review: Schema-Change Fix Verified ✓

Read gcm/exporters/file.py
Read gcm/tests/test_file_exporter.py
Verify the schema-change fix
Post review feedback

The critical issue has been fixed correctly this time. Here's what was implemented:

What Changed

New helper (file.py:29-33):

def _schema_versioned_path(path: str, schema_index: int) -> str:
    if schema_index == 0:
        return path
    stem, ext = os.path.splitext(path)
    return f"{stem}_{schema_index}{ext}"

New tracking dicts (file.py:64-67):

_data_identifier_to_base_path — remembers the original path for generating versioned names
_csv_schema_index — tracks the current rollover counter per DataIdentifier

Schema-change rollover logic (file.py:139-158): When the schema changes, a new versioned file is created (e.g., data.csv → data_1.csv), a new logger is initialized pointing to it, and the path/logger maps are updated. The subsequent if previous_fieldnames != fieldnames check then correctly writes the header to the new file.

Test updated (test_file_exporter_csv_rolls_over_on_schema_change, lines 45-72): Verifies two separate, valid CSV files are produced — data.csv with the first schema and data_1.csv with the second schema. Each file is independently parseable.

Remaining Minor Nit (non-blocking)

DictWriter created per-record inside the loop (file.py:173-182):

for record in records:
    row_buf = io.StringIO()
    row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
    row_writer.writerow(record)
    logger.info(row_buf.getvalue())

A new StringIO and DictWriter is allocated for every single record. The DictWriter could be created once outside the loop and only the StringIO reset per record. Minor performance issue, low priority.

Verdict

The critical CSV schema-change bug is fixed. The implementation is clean, follows existing logger/path patterns, and the test correctly validates the rollover behavior. This is ready to merge pending maintainer approval.

theap06 · 2026-03-12T05:56:06Z

@claude fixed the perf issues.

theap06 · 2026-03-12T06:18:31Z

@luccabb should be good to go now. maybe we can add a ci workflow like /rubric or something for quick feedback on prs. tool calls instead of always needing to call claude api.

theap06 requested review from calebho, giongto35, jj10306 and luccabb as code owners March 4, 2026 22:43

meta-cla bot added the cla signed label Mar 4, 2026

luccabb requested changes Mar 8, 2026

View reviewed changes

theap06 added 4 commits March 7, 2026 23:27

fix: correct expected UTC timestamp in telemetry exporter tests

d16392f

Made-with: Cursor

fix: satisfy nox format and typecheck

e3f37d8

changed to supporting csv

4f00cf7

theap06 force-pushed the feat/telemetry-export-clean branch from 94e1f2c to 4f00cf7 Compare March 8, 2026 07:28

theap06 requested a review from luccabb March 8, 2026 07:29

added the shutdown for the nox checking

3f82227

luccabb requested changes Mar 8, 2026

View reviewed changes

gcm/exporters/telemetry.py Outdated Show resolved Hide resolved

gcm/exporters/file.py Outdated Show resolved Hide resolved

gcm/exporters/file.py Show resolved Hide resolved

theap06 and others added 2 commits March 8, 2026 16:59

Update gcm/exporters/file.py

1b51ad7

Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>

fixed the telemetry file and added the final else

184a18e

theap06 requested a review from luccabb March 9, 2026 00:03

theap06 force-pushed the feat/telemetry-export-clean branch from 184a18e to 060fafe Compare March 9, 2026 09:03

theap06 requested a review from A-Kokolis as a code owner March 9, 2026 09:03

feat: add CSV format support to file exporter

961b5b7

- Add format option (json|csv) to file exporter - CSV output uses flattened scuba message structure - First write adds header row, subsequent writes append - Update docs for both Monitoring and Health Checks Made-with: Cursor

theap06 force-pushed the feat/telemetry-export-clean branch from 060fafe to 961b5b7 Compare March 9, 2026 09:43

Merge branch 'feat/telemetry-export-clean' of https://github.com/thea…

dc4c714

…p06/gcm into feat/telemetry-export-clean

luccabb requested changes Mar 10, 2026

View reviewed changes

gcm/exporters/file.py Show resolved Hide resolved

luccabb reviewed Mar 10, 2026

View reviewed changes

gcm/exporters/file.py Outdated Show resolved Hide resolved

fixed the logic in file.py

20ec9d8

theap06 requested a review from luccabb March 10, 2026 09:02

luccabb requested changes Mar 11, 2026

View reviewed changes

gcm/exporters/file.py Outdated Show resolved Hide resolved

theap06 and others added 2 commits March 11, 2026 12:02

Update gcm/exporters/file.py

f659f0a

Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>

fixed the issues with style checker

c8ec6a2

theap06 requested a review from luccabb March 11, 2026 21:32

theap06 added 2 commits March 11, 2026 16:27

Clean PR scope and clarify CSV docs

67af3df

Fix file exporter CSV test expectations

489555d

Roll CSV output to new file on schema change

a151464

Reuse CSV writer buffer per batch

a2dec03

Conversation

theap06 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's New

Usage

Example Output (JSON)

Implementation

Testing

Uh oh!

github-actions bot commented Mar 4, 2026

CI Commands

Uh oh!

theap06 commented Mar 8, 2026

Uh oh!

luccabb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

theap06 commented Mar 10, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luccabb commented Mar 11, 2026

Uh oh!

luccabb commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: CSV Schema Consistency

Summary

Critical Issue: Schema Change Produces Unparseable CSV

Secondary Issues

What Looks Good

Required Before Merge

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Schema-Change Fix Not Implemented

What the code still does (broken)

Test still validates the broken behavior

What needs to happen

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Schema-Change Fix Verified ✓

What Changed

Remaining Minor Nit (non-blocking)

Verdict

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theap06 commented Mar 4, 2026 •

edited

Loading

luccabb left a comment •

edited

Loading

claude bot commented Mar 11, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading