Skip to content

Conversation

Copy link

Copilot AI commented Dec 17, 2025

OpenTelemetry Tracing Model Implementation Plan

  • Create documentation structure for OpenTelemetry tracing
  • Document OpenTelemetry tracing architecture for GPU jobs
  • Provide example trace instrumentation code for GPU job lifecycle
  • Create configuration examples for trace collection in ACM/MCO
  • Document trace storage and visualization setup
  • Provide examples of linking traces with GPU metrics
  • Create troubleshooting and best practices guide
  • Add example trace queries and dashboards
Original prompt

This section details on the original issue you should resolve

<issue_title>Add optional OpenTelemetry tracing model</issue_title>
<issue_description>## Opening Summary

This issue is the follow-up from issue #1336.

One main result there was that we learned how to create our own metrics and integrate them into the normal ACM/MCO observability pipeline.
This is important for our GPU and student workloads, because the default metrics are not enough for what we need.

But we also saw that metrics alone cannot always show the full lifecycle of a GPU job.
So here we continue with the idea to add traces as an additional layer.

For GPU jobs we could record spans like:

  • queue
  • gpu_allocate
  • compute
  • cleanup

Each span can have timestamps and attributes (job ID, user/student, GPU type, node, etc.).
This gives us:

  • better debugging on per-job level
  • a clear execution timeline
  • the possibility to link metrics and traces (exemplars)
  • more context than metrics can provide (in details and granularity)

The goal of this issue is to check how we can create, collect, store and visualize traces in our ACM setup, and how this works together with the custom GPU metrics work from issue #1336.

Tasks

  • See what trace data we can already get
  • Combine traces with our own GPU metrics
  • Make a small PoC/demo if possible

This issue will track all progress for the traces topic.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@schwesig
Copy link
Member

sorry, this opened by accident when I wanted to add CoPilot to the reviewers.
My bad.
Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants