-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Summary
OVAI's OpenAI-compatible streaming endpoint (/v1/chat/completions) returns empty content deltas when streaming is enabled. This issue affects all Gemini models regardless of thinking mode configuration.
Key Finding: This is a bug in OVAI's OpenAI compatibility layer. The Ollama-compatible endpoint (/api/chat) streams correctly with the same backend, confirming the issue is isolated to the /v1/ endpoint implementation.
Environment
- OVAI Version: 0.20.0 (image:
prantlf/ovai:latest) - Model:
gemini-2.5-flash-lite(no reasoning) andgemini-2.5-flash - API: OpenAI-compatible streaming endpoint (
/v1/chat/completions) - Docker Command:
docker run -dt -p 22434:22434 --name ovai \
--add-host host.docker.internal:host-gateway \
-e OLLAMA_ORIGIN=http://host.docker.internal:11434 \
-v /path/to/google-account.json:/google-account.json \
-v /path/to/model-defaults.json:/model-defaults.json \
prantlf/ovai
Configuration
model-defaults.json:
{
"apiLocation": "us-central1",
"apiEndpoint": "us-central1-aiplatform.googleapis.com",
"scope": "https://www.googleapis.com/auth/cloud-platform",
"geminiDefaults": {
"generationConfig": {
"thinkingConfig": {
"includeThoughts": true
}
},
"safetySettings": [
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_ONLY_HIGH"
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_ONLY_HIGH"
}
]
}
}
Issue Description
Expected Behavior
When streaming is enabled ("stream": true), each Server-Sent Event (SSE) chunk should contain incremental content in the delta.content field, allowing clients to progressively display the response as it's generated.
Actual Behavior
All streaming chunks from /v1/chat/completions contain empty content fields ("content":""), despite evidence that content generation is occurring:
- Container logs show data bytes being received from Gemini API
- Final usage chunk reports significant completion tokens (e.g., 2,302 tokens)
- Stream duration matches expected generation time (~23 seconds)
- Non-streaming mode (
"stream": false) returns full content correctly - Ollama endpoint (
/api/chat) streams the same content progressively
This indicates the issue is in how OVAI's OpenAI compatibility layer transforms streaming responses, not in content generation itself.
Reproduction Steps
Test Case 1: OpenAI Endpoint Streaming (FAILS)
curl localhost:22434/v1/chat/completions -d '{
"model": "gemini-2.5-flash",
"messages": [
{
"role": "system",
"content": "You are an expert on Dungeons and Dragons."
},
{
"role": "user",
"content": "What race is the best for a barbarian?"
}
],
"stream": true,
"stream_options": {
"include_usage": true
}
}'
Result: 46 chunks received, all with empty content:
data: {"model":"gemini-2.5-flash","created":1759687856,"id":"2025-10-05T18:10:56Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"model":"gemini-2.5-flash","created":1759687857,"id":"2025-10-05T18:10:57Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
...
data: {"model":"gemini-2.5-flash","created":1759687879,"id":"2025-10-05T18:11:19Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}
data: {"model":"gemini-2.5-flash","created":1759687879,"id":"2025-10-05T18:11:19Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[],"usage":{"completion_tokens":2302,"prompt_tokens":18,"total_tokens":2320}}
data: [DONE]
Test Case 2: OpenAI Endpoint Non-Streaming (WORKS)
curl localhost:22434/v1/chat/completions -d '{
"model": "gemini-2.5-flash-lite",
"messages": [
{
"role": "system",
"content": "You are an expert on Dungeons and Dragons."
},
{
"role": "user",
"content": "What race is the best for a barbarian?"
}
],
"stream": false,
"stream_options": {
"include_usage": false
},
"reasoning_effort": "medium",
"max_completion_tokens": 8192,
"temperature": 1,
"top_p": 0.95,
"thinking_budget": null
}'
Result: Complete response with full content (2,017 tokens) returned successfully.
Container Logs Analysis
During the streaming request, the container logs show data being received from Gemini:
2025/10/05 18:10:54.822393 request POST /v1/chat/completions
2025/10/05 18:10:54.822487 > ask with 2 messages using gemini-2.5-flash
2025/10/05 18:10:56.109145 < 1160 bytes
2025/10/05 18:10:57.803591 < 1295 bytes
2025/10/05 18:10:59.702463 < 1361 bytes
...
2025/10/05 18:11:19.103053 < 1165 bytes
2025/10/05 18:11:19.353739 < 1370 bytes
2025/10/05 18:11:19.354352 respond 200: POST /v1/chat/completions
This confirms that:
- OVAI is receiving data chunks from Gemini API
- The response is being processed over ~23 seconds
- The HTTP request completes successfully (200 status)
- However, the content is not being propagated to the streaming response chunks
Diagnostic Evidence
Cross-Endpoint Comparison
Comprehensive testing across all OVAI endpoints reveals the issue is isolated to OpenAI streaming:
| Endpoint | Mode | Status | Evidence |
|---|---|---|---|
/v1/chat/completions |
Streaming | ❌ FAILS | Empty delta.content in all chunks |
/v1/chat/completions |
Non-streaming | ✅ WORKS | Full content returned correctly |
/api/chat (Ollama) |
Streaming | ✅ WORKS | Progressive content chunks delivered |
/api/chat (Ollama) |
Non-streaming | ✅ WORKS | Full content returned correctly |
Detailed Test Results
OpenAI Non-Streaming (Working)
curl http://localhost:22434/v1/chat/completions -d '{
"model": "gemini-2.5-flash-lite",
"messages": [{"role": "user", "content": "Write a haiku about programming"}],
"stream": false
}'
Response (success):
{
"model": "gemini-2.5-flash-lite",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Lines of logic flow,\nBuilding worlds with careful thought,\nCode runs, problems solved."
},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 8, "completion_tokens": 19, "total_tokens": 27}
}
OpenAI Streaming (Broken)
curl -N http://localhost:22434/v1/chat/completions -d '{
"model": "gemini-2.5-flash-lite",
"messages": [{"role": "user", "content": "Write a haiku about programming"}],
"stream": true
}'
Response (all chunks have empty content):
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":"stop"}]}
data: {"choices":[],"usage":{"completion_tokens":19,"prompt_tokens":8,"total_tokens":27}}
data: [DONE]
Ollama API Streaming (Working)
curl -N http://localhost:22434/api/chat -d '{
"model": "gemini-2.5-flash-lite",
"messages": [{"role": "user", "content": "Write a haiku about programming"}],
"stream": true
}'
Response (progressive content chunks):
{"message":{"role":"assistant","content":"Lines"}}
{"message":{"role":"assistant","content":" of text appear,\nLogic flows"}}
{"message":{"role":"assistant","content":" through each line,\nBringing life to dreams."}}
{"message":{"role":"assistant","content":""},"done":true,"done_reason":"stop"}
Total tokens: 21 completion tokens in ~350ms
Root Cause Analysis
The Gemini API is streaming content correctly to OVAI (confirmed by container logs and Ollama endpoint success). The bug is in OVAI's OpenAI compatibility layer, which fails to populate delta.content fields when transforming Gemini's streaming response to OpenAI's Server-Sent Events format.
Key Evidence:
- Same backend generates content successfully (Ollama endpoint proves this)
- Non-streaming OpenAI endpoint works (transformation logic exists for complete responses)
- Container logs show data flow from Gemini API
- Only OpenAI streaming format fails
Likely cause: The OpenAI endpoint's SSE chunk serialization is not extracting content from Gemini's streaming response format.