Skip to content

bug: Empty Content Deltas in OpenAI-Compatible Endpoint #5

@davidggphy

Description

@davidggphy

Summary

OVAI's OpenAI-compatible streaming endpoint (/v1/chat/completions) returns empty content deltas when streaming is enabled. This issue affects all Gemini models regardless of thinking mode configuration.

Key Finding: This is a bug in OVAI's OpenAI compatibility layer. The Ollama-compatible endpoint (/api/chat) streams correctly with the same backend, confirming the issue is isolated to the /v1/ endpoint implementation.

Environment

  • OVAI Version: 0.20.0 (image: prantlf/ovai:latest)
  • Model: gemini-2.5-flash-lite (no reasoning) and gemini-2.5-flash
  • API: OpenAI-compatible streaming endpoint (/v1/chat/completions)
  • Docker Command:
docker run -dt -p 22434:22434 --name ovai \
  --add-host host.docker.internal:host-gateway \
  -e OLLAMA_ORIGIN=http://host.docker.internal:11434 \
  -v /path/to/google-account.json:/google-account.json \
  -v /path/to/model-defaults.json:/model-defaults.json \
  prantlf/ovai

Configuration

model-defaults.json:

{
  "apiLocation": "us-central1",
  "apiEndpoint": "us-central1-aiplatform.googleapis.com",
  "scope": "https://www.googleapis.com/auth/cloud-platform",
  "geminiDefaults": {
    "generationConfig": {
      "thinkingConfig": {
        "includeThoughts": true
      }
    },
    "safetySettings": [
      {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "BLOCK_ONLY_HIGH"
      },
      {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_ONLY_HIGH"
      },
      {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "BLOCK_ONLY_HIGH"
      },
      {
        "category": "HARM_CATEGORY_HARASSMENT",
        "threshold": "BLOCK_ONLY_HIGH"
      }
    ]
  }
}

Issue Description

Expected Behavior

When streaming is enabled ("stream": true), each Server-Sent Event (SSE) chunk should contain incremental content in the delta.content field, allowing clients to progressively display the response as it's generated.

Actual Behavior

All streaming chunks from /v1/chat/completions contain empty content fields ("content":""), despite evidence that content generation is occurring:

  1. Container logs show data bytes being received from Gemini API
  2. Final usage chunk reports significant completion tokens (e.g., 2,302 tokens)
  3. Stream duration matches expected generation time (~23 seconds)
  4. Non-streaming mode ("stream": false) returns full content correctly
  5. Ollama endpoint (/api/chat) streams the same content progressively

This indicates the issue is in how OVAI's OpenAI compatibility layer transforms streaming responses, not in content generation itself.

Reproduction Steps

Test Case 1: OpenAI Endpoint Streaming (FAILS)

curl localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert on Dungeons and Dragons."
    },
    {
      "role": "user",
      "content": "What race is the best for a barbarian?"
    }
  ],
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}'

Result: 46 chunks received, all with empty content:

data: {"model":"gemini-2.5-flash","created":1759687856,"id":"2025-10-05T18:10:56Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"model":"gemini-2.5-flash","created":1759687857,"id":"2025-10-05T18:10:57Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

...

data: {"model":"gemini-2.5-flash","created":1759687879,"id":"2025-10-05T18:11:19Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: {"model":"gemini-2.5-flash","created":1759687879,"id":"2025-10-05T18:11:19Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[],"usage":{"completion_tokens":2302,"prompt_tokens":18,"total_tokens":2320}}

data: [DONE]

Test Case 2: OpenAI Endpoint Non-Streaming (WORKS)

curl localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert on Dungeons and Dragons."
    },
    {
      "role": "user",
      "content": "What race is the best for a barbarian?"
    }
  ],
  "stream": false,
  "stream_options": {
    "include_usage": false
  },
  "reasoning_effort": "medium",
  "max_completion_tokens": 8192,
  "temperature": 1,
  "top_p": 0.95,
  "thinking_budget": null
}'

Result: Complete response with full content (2,017 tokens) returned successfully.

Container Logs Analysis

During the streaming request, the container logs show data being received from Gemini:

2025/10/05 18:10:54.822393 request POST /v1/chat/completions
2025/10/05 18:10:54.822487 > ask with 2 messages using gemini-2.5-flash
2025/10/05 18:10:56.109145 < 1160 bytes
2025/10/05 18:10:57.803591 < 1295 bytes
2025/10/05 18:10:59.702463 < 1361 bytes
...
2025/10/05 18:11:19.103053 < 1165 bytes
2025/10/05 18:11:19.353739 < 1370 bytes
2025/10/05 18:11:19.354352 respond 200: POST /v1/chat/completions

This confirms that:

  • OVAI is receiving data chunks from Gemini API
  • The response is being processed over ~23 seconds
  • The HTTP request completes successfully (200 status)
  • However, the content is not being propagated to the streaming response chunks

Diagnostic Evidence

Cross-Endpoint Comparison

Comprehensive testing across all OVAI endpoints reveals the issue is isolated to OpenAI streaming:

Endpoint Mode Status Evidence
/v1/chat/completions Streaming FAILS Empty delta.content in all chunks
/v1/chat/completions Non-streaming WORKS Full content returned correctly
/api/chat (Ollama) Streaming WORKS Progressive content chunks delivered
/api/chat (Ollama) Non-streaming WORKS Full content returned correctly

Detailed Test Results

OpenAI Non-Streaming (Working)

curl http://localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [{"role": "user", "content": "Write a haiku about programming"}],
  "stream": false
}'

Response (success):

{
  "model": "gemini-2.5-flash-lite",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Lines of logic flow,\nBuilding worlds with careful thought,\nCode runs, problems solved."
    },
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 8, "completion_tokens": 19, "total_tokens": 27}
}

OpenAI Streaming (Broken)

curl -N http://localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [{"role": "user", "content": "Write a haiku about programming"}],
  "stream": true
}'

Response (all chunks have empty content):

data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":"stop"}]}
data: {"choices":[],"usage":{"completion_tokens":19,"prompt_tokens":8,"total_tokens":27}}
data: [DONE]

Ollama API Streaming (Working)

curl -N http://localhost:22434/api/chat -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [{"role": "user", "content": "Write a haiku about programming"}],
  "stream": true
}'

Response (progressive content chunks):

{"message":{"role":"assistant","content":"Lines"}}
{"message":{"role":"assistant","content":" of text appear,\nLogic flows"}}
{"message":{"role":"assistant","content":" through each line,\nBringing life to dreams."}}
{"message":{"role":"assistant","content":""},"done":true,"done_reason":"stop"}

Total tokens: 21 completion tokens in ~350ms

Root Cause Analysis

The Gemini API is streaming content correctly to OVAI (confirmed by container logs and Ollama endpoint success). The bug is in OVAI's OpenAI compatibility layer, which fails to populate delta.content fields when transforming Gemini's streaming response to OpenAI's Server-Sent Events format.

Key Evidence:

  • Same backend generates content successfully (Ollama endpoint proves this)
  • Non-streaming OpenAI endpoint works (transformation logic exists for complete responses)
  • Container logs show data flow from Gemini API
  • Only OpenAI streaming format fails

Likely cause: The OpenAI endpoint's SSE chunk serialization is not extracting content from Gemini's streaming response format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions