-
Notifications
You must be signed in to change notification settings - Fork 1
OpenAI/LLM integration: capture time_to_first_token_ms from streaming responses #31
Copy link
Copy link
Open
Description
Summary
The GenerationOutputMeta dataclass has a time_to_first_token_ms field, but the OpenAI integration never populates it. The integration times the full round-trip (t0 to response) but has no way to measure TTFT from a completed non-streaming response.
Current behavior
build_output_meta in integrations/openai.py constructs GenerationOutputMeta with tokens, cached tokens, reasoning tokens, tps, and stop reason — but time_to_first_token_ms is always None.
Streaming calls (stream=True) are detected but skipped entirely — no inference event is recorded.
Expected behavior
When stream=True, the integration should:
- Wrap the returned iterator
- Record the elapsed time when the first chunk arrives as
time_to_first_token_ms - Accumulate token counts from chunks
- Record the inference event when the stream is exhausted
Notes
- The field and serialization path already exist — this is purely a capture gap in the integration
- Non-streaming calls cannot produce a meaningful TTFT (the whole response arrives at once), so this only applies to streamed requests
- The same gap exists in the Transformers integration, but that's harder to address without a streaming hook
Workaround
Users can pass TTFT manually today:
output_meta = GenerationOutputMeta(
tokens_out=n,
time_to_first_token_ms=ttft_ms,
)
wildedge.track(model, duration_ms=total_ms, output_meta=output_meta)Environment
- SDK version:
0.1.1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels