Skip to content

Feat/telemetry export clean#89

Open
theap06 wants to merge 16 commits intofacebookresearch:mainfrom
theap06:feat/telemetry-export-clean
Open

Feat/telemetry export clean#89
theap06 wants to merge 16 commits intofacebookresearch:mainfrom
theap06:feat/telemetry-export-clean

Conversation

@theap06
Copy link
Contributor

@theap06 theap06 commented Mar 4, 2026

Closes #87

Summary

Adds a new telemetry exporter that periodically appends structured telemetry snapshots to a local file in JSON or CSV format for offline analysis.

What's New

  • New sink: --sink=telemetry
  • Output formats: JSON (NDJSON) or CSV
  • Options:
    • file_path (required): Path to the output file
    • format (optional): json (default) or csv

Usage

# NDJSON
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-a.ndjson --once

# CSV

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-b.csv --sink-opt format=csv --once
cat /tmp/telemetry-b.csv

CSV Append

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-b.csv --sink-opt format=csv --once
cat /tmp/telemetry-b.csv

##CSV Output

Output 1:
hostname,gpu_id,gpu_util,mem_used_percent
...,0,...,...
...,1,...,...

Output 2: 
hostname,gpu_id,gpu_util,mem_used_percent
...,0,...,...
...,1,...,...
...,0,...,...
...,1,...,...

Example Output (JSON)

JSON Command
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-a.ndjson --once
cat /tmp/telemetry-a.ndjson

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-c.ndjson --once
cat /tmp/telemetry-c.ndjson

Output 1:
{"hostname":"...","gpu_id":0,"gpu_util":...}
{"hostname":"...","gpu_id":1,"gpu_util":...}

Output 2: 
{"hostname":"...","gpu_id":0,"gpu_util":...}

Implementation

  • ~60 lines of code in gcm/exporters/telemetry.py
  • Follows existing exporter conventions (@register, write(Log, SinkAdditionalParams))
  • Works with gcm nvml_monitor, gcm slurm_monitor, and health check commands
  • Auto-creates parent directories if needed

Testing

pytest gcm/tests/test_telemetry_exporter.py -v

============================= test session starts ==============================
collected 10 items
...
============================== 10 passed in <time> =============================


$ gcm nvml_monitor --sink=telemetry--sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ cat /tmp/telemetry.csv

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user


1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048
Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@theap06
Copy link
Contributor Author

theap06 commented Mar 8, 2026

@luccabb could I get some feedback?

Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

theap06 added 4 commits March 7, 2026 23:27
…research#87)

Add a new 'telemetry' sink that periodically appends telemetry snapshots
to a local file in JSON or CSV format for offline analysis.

- JSON: NDJSON format (one object per line)
- CSV: Header on first write, append rows
- Options: file_path (required), format (json|csv, default json)
- Works with nvml_monitor, slurm_monitor, and health checks

Closes facebookresearch#87

Made-with: Cursor
@theap06 theap06 force-pushed the feat/telemetry-export-clean branch from 94e1f2c to 4f00cf7 Compare March 8, 2026 07:28
@theap06
Copy link
Contributor Author

theap06 commented Mar 8, 2026

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

added support for csv

@theap06 theap06 requested a review from luccabb March 8, 2026 07:29
theap06 and others added 2 commits March 8, 2026 16:59
Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>
@theap06 theap06 requested a review from luccabb March 9, 2026 00:03
@theap06 theap06 force-pushed the feat/telemetry-export-clean branch from 184a18e to 060fafe Compare March 9, 2026 09:03
@theap06 theap06 requested a review from A-Kokolis as a code owner March 9, 2026 09:03
- Add format option (json|csv) to file exporter
- CSV output uses flattened scuba message structure
- First write adds header row, subsequent writes append
- Update docs for both Monitoring and Health Checks

Made-with: Cursor
@theap06 theap06 force-pushed the feat/telemetry-export-clean branch from 060fafe to 961b5b7 Compare March 9, 2026 09:43
@theap06 theap06 requested a review from luccabb March 10, 2026 09:02
@theap06
Copy link
Contributor Author

theap06 commented Mar 10, 2026

@luccabb Could i get some feedback?

Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execution cmd alone in your test plan is not enough to prove that it works, add outputs to your test plan:

$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
# add the cmd below and its output
$ cat /tmp/telemetry.csv 

what gets written to the file (/tmp/telemetry.csv) is wrong. as you're writing different nvml data to the same file with the same headers

theap06 and others added 2 commits March 11, 2026 12:02
Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>
@theap06 theap06 requested a review from luccabb March 11, 2026 21:32
@luccabb
Copy link
Member

luccabb commented Mar 11, 2026

@theap06

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

maybe write into different files?

@luccabb
Copy link
Member

luccabb commented Mar 11, 2026

@claude can you help review this pr? we shouldn't be writing different csv schemas into the same file

@claude
Copy link

claude bot commented Mar 11, 2026

Claude finished @luccabb's task in 1m 9s —— View job


PR Review: CSV Schema Consistency

  • Read CLAUDE.md and repo context
  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Analyze CSV schema issue
  • Post review feedback

Summary

The PR adds CSV support to the file exporter. The implementation is generally clean and follows existing patterns, but has a critical correctness issue: when the CSV schema changes between writes, the current code writes a new header row mid-file, producing an invalid CSV that no standard parser can read.


Critical Issue: Schema Change Produces Unparseable CSV

gcm/exporters/file.py lines 119–131 and gcm/tests/test_file_exporter.py lines 45–67

When the schema changes mid-stream, the current behavior writes a new header inline:

normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb   ← second header, breaks CSV structure
GPU-123,2048

The test test_file_exporter_csv_rewrites_header_on_schema_change explicitly asserts this broken output as the expected behavior — meaning the test is validating invalid behavior rather than guarding against it.

In practice, this will happen naturally with gcm nvml_monitor because multiple GPU payloads with different fields (e.g. NVML node data vs. job data) may be routed to the same file.

luccabb's suggestion ("maybe write into different files?") is the right direction. When a schema change is detected, a new file should be created (e.g., telemetry.csv, telemetry_1.csv, telemetry_2.csv). This approach preserves each segment as a valid standalone CSV.

Fix this →


Secondary Issues

1. _csv_fieldnames initialized conditionally but accessed unconditionally

gcm/exporters/file.py lines 57–58:

if self.format == "csv":
    self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}

Then at line 119: self._csv_fieldnames.get(path) — this will raise AttributeError if somehow format is mutated after __init__. This is defensive, but _csv_fieldnames should just be initialized unconditionally as {} (or guarded by the if self.format == "csv" branch in write(), which it already is — so this is minor).

2. DictWriter created per-record inside a loop

gcm/exporters/file.py lines 133–142:

for record in records:
    row_buf = io.StringIO()
    row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
    row_writer.writerow(record)
    logger.info(row_buf.getvalue())

A new StringIO and DictWriter is created for every single record. The DictWriter can be created once outside the loop and reused by resetting/recreating the buffer only. Minor performance issue, but worth cleaning up.

3. Test plan shows broken output without acknowledging it

In the PR description:

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

This output is labeled "expected" but it is not a valid CSV. The fix to the underlying bug will change what "expected" means here.


What Looks Good

  • The _flatten_for_csv helper is clean and correctly reuses asdict_recursive(to_scuba_message(...)) — consistent with the JSON path.
  • The else: raise ValueError(...) for unsupported formats was added as requested.
  • _data_identifier_to_path tracking is a clean way to support multiple output files.
  • The format parameter defaulting to "json" preserves backwards compatibility.
  • The format == 'json' branch correctly follows the existing logger pattern.
  • Tests cover the happy path well.

Required Before Merge

  1. Fix the schema-change behavior — write to a new file rather than inserting a second header row into the same file
  2. Update test_file_exporter_csv_rewrites_header_on_schema_change — it currently asserts the broken output; it should verify that two separate, valid CSV files are produced

@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@luccabb I fixed the exporting issue. it should be ready for prod :)

@luccabb
Copy link
Member

luccabb commented Mar 12, 2026

@claude

@claude
Copy link

claude bot commented Mar 12, 2026

Claude finished @luccabb's task in 47s —— View job


Re-review: Schema-Change Fix Not Implemented

  • Read CLAUDE.md
  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Verify schema-change fix
  • Post review feedback

@theap06 Unfortunately the critical issue has not been fixed. The code still writes a new header mid-file when the schema changes, producing an invalid CSV.

What the code still does (broken)

gcm/exporters/file.py lines 119–131:

if previous_fieldnames != fieldnames:
    header_buf = io.StringIO()
    header_writer = csv.DictWriter(header_buf, fieldnames=all_keys, ...)
    header_writer.writeheader()
    logger.info(header_buf.getvalue())   # ← writes new header to the SAME file
    self._csv_fieldnames[path] = fieldnames

When the schema changes, a new header row is written into the same file via the same logger. This produces the same invalid output as before:

job_id,state,user
1,RUNNING,alice
gpu_uuid,memory_used_mb    ← second header inserted mid-file
GPU-123,2048

Test still validates the broken behavior

gcm/tests/test_file_exporter.py lines 62–67:

lines = path.read_text().splitlines()
assert lines == [
    "job_id,state,user",
    "1,RUNNING,alice",
    "gpu_uuid,memory_used_mb",   # ← still asserts mixed-header output as "correct"
    "GPU-123,2048",
]

The test explicitly asserts this broken output as expected behavior — nothing has changed here.

What needs to happen

When a schema change is detected, a new file should be created instead of inserting a header into the existing file. For example:

  • First schema → telemetry.csv
  • Second schema → telemetry_1.csv
  • Third schema → telemetry_2.csv

The test should then verify that two separate, valid CSV files are created — each parseable independently.

Fix this →

@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@claude made the changes. Can you review my pr?

@luccabb
Copy link
Member

luccabb commented Mar 12, 2026

@claude

@claude
Copy link

claude bot commented Mar 12, 2026

Claude finished @luccabb's task in 54s —— View job


Re-review: Schema-Change Fix Verified ✓

  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Verify the schema-change fix
  • Post review feedback

The critical issue has been fixed correctly this time. Here's what was implemented:

What Changed

New helper (file.py:29-33):

def _schema_versioned_path(path: str, schema_index: int) -> str:
    if schema_index == 0:
        return path
    stem, ext = os.path.splitext(path)
    return f"{stem}_{schema_index}{ext}"

New tracking dicts (file.py:64-67):

  • _data_identifier_to_base_path — remembers the original path for generating versioned names
  • _csv_schema_index — tracks the current rollover counter per DataIdentifier

Schema-change rollover logic (file.py:139-158): When the schema changes, a new versioned file is created (e.g., data.csvdata_1.csv), a new logger is initialized pointing to it, and the path/logger maps are updated. The subsequent if previous_fieldnames != fieldnames check then correctly writes the header to the new file.

Test updated (test_file_exporter_csv_rolls_over_on_schema_change, lines 45-72): Verifies two separate, valid CSV files are produced — data.csv with the first schema and data_1.csv with the second schema. Each file is independently parseable.


Remaining Minor Nit (non-blocking)

DictWriter created per-record inside the loop (file.py:173-182):

for record in records:
    row_buf = io.StringIO()
    row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
    row_writer.writerow(record)
    logger.info(row_buf.getvalue())

A new StringIO and DictWriter is allocated for every single record. The DictWriter could be created once outside the loop and only the StringIO reset per record. Minor performance issue, low priority.


Verdict

The critical CSV schema-change bug is fixed. The implementation is clean, follows existing logger/path patterns, and the test correctly validates the rollover behavior. This is ready to merge pending maintainer approval.

@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@claude fixed the perf issues.

@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@luccabb should be good to go now. maybe we can add a ci workflow like /rubric or something for quick feedback on prs. tool calls instead of always needing to call claude api.

Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theap06 keep the test plan updated as you make changes

@theap06 theap06 requested a review from luccabb March 12, 2026 20:02
@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@luccabb done

Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test plan is wrong. add the execution cmd (cat filename) and output from that cmd for the following lines.

you should likely be passing more than 1 filename, given that you'd want data to be written to different files. please test this thoroughly and ensure test plans are fully complete before sending back for review

@theap06 theap06 requested a review from luccabb March 13, 2026 06:07
@theap06
Copy link
Contributor Author

theap06 commented Mar 13, 2026

test plan is wrong. add the execution cmd (cat filename) and output from that cmd for the following lines.

you should likely be passing more than 1 filename, given that you'd want data to be written to different files. please test this thoroughly and ensure test plans are fully complete before sending back for review

Sorry about that! I just updated the the test plan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀[Feature Request]: Structured Telemetry Export (CSV/JSON) for Offline Analysis

2 participants