bug: Empty Content Deltas in OpenAI-Compatible Endpoint

## Summary

OVAI's OpenAI-compatible streaming endpoint (`/v1/chat/completions`) returns empty content deltas when streaming is enabled. This issue affects all Gemini models regardless of thinking mode configuration.

**Key Finding**: This is a bug in OVAI's OpenAI compatibility layer. The Ollama-compatible endpoint (`/api/chat`) streams correctly with the same backend, confirming the issue is isolated to the `/v1/` endpoint implementation.

## Environment

- **OVAI Version**: 0.20.0 (image: `prantlf/ovai:latest`)
- **Model**: `gemini-2.5-flash-lite` (no reasoning) and `gemini-2.5-flash` 
- **API**: OpenAI-compatible streaming endpoint (`/v1/chat/completions`)
- **Docker Command**:
```bash
docker run -dt -p 22434:22434 --name ovai \
  --add-host host.docker.internal:host-gateway \
  -e OLLAMA_ORIGIN=http://host.docker.internal:11434 \
  -v /path/to/google-account.json:/google-account.json \
  -v /path/to/model-defaults.json:/model-defaults.json \
  prantlf/ovai
```

## Configuration

**model-defaults.json**:
```json
{
  "apiLocation": "us-central1",
  "apiEndpoint": "us-central1-aiplatform.googleapis.com",
  "scope": "https://www.googleapis.com/auth/cloud-platform",
  "geminiDefaults": {
    "generationConfig": {
      "thinkingConfig": {
        "includeThoughts": true
      }
    },
    "safetySettings": [
      {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "BLOCK_ONLY_HIGH"
      },
      {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_ONLY_HIGH"
      },
      {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "BLOCK_ONLY_HIGH"
      },
      {
        "category": "HARM_CATEGORY_HARASSMENT",
        "threshold": "BLOCK_ONLY_HIGH"
      }
    ]
  }
}
```

## Issue Description

### Expected Behavior

When streaming is enabled (`"stream": true`), each Server-Sent Event (SSE) chunk should contain incremental content in the `delta.content` field, allowing clients to progressively display the response as it's generated.

### Actual Behavior

All streaming chunks from `/v1/chat/completions` contain empty content fields (`"content":""`), despite evidence that content generation is occurring:

1. Container logs show data bytes being received from Gemini API
2. Final usage chunk reports significant completion tokens (e.g., 2,302 tokens)
3. Stream duration matches expected generation time (~23 seconds)
4. Non-streaming mode (`"stream": false`) returns full content correctly
5. Ollama endpoint (`/api/chat`) streams the same content progressively

This indicates the issue is in how OVAI's OpenAI compatibility layer transforms streaming responses, not in content generation itself.

## Reproduction Steps

### Test Case 1: OpenAI Endpoint Streaming (FAILS)

```bash
curl localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert on Dungeons and Dragons."
    },
    {
      "role": "user",
      "content": "What race is the best for a barbarian?"
    }
  ],
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}'
```

**Result**: 46 chunks received, all with empty content:
```json
data: {"model":"gemini-2.5-flash","created":1759687856,"id":"2025-10-05T18:10:56Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"model":"gemini-2.5-flash","created":1759687857,"id":"2025-10-05T18:10:57Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

...

data: {"model":"gemini-2.5-flash","created":1759687879,"id":"2025-10-05T18:11:19Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: {"model":"gemini-2.5-flash","created":1759687879,"id":"2025-10-05T18:11:19Z","object":"chat.completion.chunk","system_fingerprint":"fp_gemini","choices":[],"usage":{"completion_tokens":2302,"prompt_tokens":18,"total_tokens":2320}}

data: [DONE]
```

### Test Case 2: OpenAI Endpoint Non-Streaming (WORKS)

```bash
curl localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert on Dungeons and Dragons."
    },
    {
      "role": "user",
      "content": "What race is the best for a barbarian?"
    }
  ],
  "stream": false,
  "stream_options": {
    "include_usage": false
  },
  "reasoning_effort": "medium",
  "max_completion_tokens": 8192,
  "temperature": 1,
  "top_p": 0.95,
  "thinking_budget": null
}'
```

**Result**: Complete response with full content (2,017 tokens) returned successfully.

## Container Logs Analysis

During the streaming request, the container logs show data being received from Gemini:

```
2025/10/05 18:10:54.822393 request POST /v1/chat/completions
2025/10/05 18:10:54.822487 > ask with 2 messages using gemini-2.5-flash
2025/10/05 18:10:56.109145 < 1160 bytes
2025/10/05 18:10:57.803591 < 1295 bytes
2025/10/05 18:10:59.702463 < 1361 bytes
...
2025/10/05 18:11:19.103053 < 1165 bytes
2025/10/05 18:11:19.353739 < 1370 bytes
2025/10/05 18:11:19.354352 respond 200: POST /v1/chat/completions
```

This confirms that:
- OVAI is receiving data chunks from Gemini API
- The response is being processed over ~23 seconds
- The HTTP request completes successfully (200 status)
- However, the content is not being propagated to the streaming response chunks

## Diagnostic Evidence

### Cross-Endpoint Comparison

Comprehensive testing across all OVAI endpoints reveals the issue is isolated to OpenAI streaming:

| Endpoint | Mode | Status | Evidence |
|----------|------|--------|----------|
| `/v1/chat/completions` | Streaming | ❌ **FAILS** | Empty `delta.content` in all chunks |
| `/v1/chat/completions` | Non-streaming | ✅ **WORKS** | Full content returned correctly |
| `/api/chat` (Ollama) | Streaming | ✅ **WORKS** | Progressive content chunks delivered |
| `/api/chat` (Ollama) | Non-streaming | ✅ **WORKS** | Full content returned correctly |

### Detailed Test Results

#### OpenAI Non-Streaming (Working)

```bash
curl http://localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [{"role": "user", "content": "Write a haiku about programming"}],
  "stream": false
}'
```

**Response** (success):
```json
{
  "model": "gemini-2.5-flash-lite",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Lines of logic flow,\nBuilding worlds with careful thought,\nCode runs, problems solved."
    },
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 8, "completion_tokens": 19, "total_tokens": 27}
}
```

#### OpenAI Streaming (Broken)

```bash
curl -N http://localhost:22434/v1/chat/completions -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [{"role": "user", "content": "Write a haiku about programming"}],
  "stream": true
}'
```

**Response** (all chunks have empty content):
```json
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":""},"finish_reason":"stop"}]}
data: {"choices":[],"usage":{"completion_tokens":19,"prompt_tokens":8,"total_tokens":27}}
data: [DONE]
```

#### Ollama API Streaming (Working)

```bash
curl -N http://localhost:22434/api/chat -d '{
  "model": "gemini-2.5-flash-lite",
  "messages": [{"role": "user", "content": "Write a haiku about programming"}],
  "stream": true
}'
```

**Response** (progressive content chunks):
```json
{"message":{"role":"assistant","content":"Lines"}}
{"message":{"role":"assistant","content":" of text appear,\nLogic flows"}}
{"message":{"role":"assistant","content":" through each line,\nBringing life to dreams."}}
{"message":{"role":"assistant","content":""},"done":true,"done_reason":"stop"}
```

**Total tokens**: 21 completion tokens in ~350ms

## Root Cause Analysis

The Gemini API is streaming content correctly to OVAI (confirmed by container logs and Ollama endpoint success). The bug is in OVAI's OpenAI compatibility layer, which fails to populate `delta.content` fields when transforming Gemini's streaming response to OpenAI's Server-Sent Events format.

**Key Evidence**:
- Same backend generates content successfully (Ollama endpoint proves this)
- Non-streaming OpenAI endpoint works (transformation logic exists for complete responses)
- Container logs show data flow from Gemini API
- Only OpenAI streaming format fails

**Likely cause**: The OpenAI endpoint's SSE chunk serialization is not extracting content from Gemini's streaming response format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Empty Content Deltas in OpenAI-Compatible Endpoint #5

Summary

Environment

Configuration

Issue Description

Expected Behavior

Actual Behavior

Reproduction Steps

Test Case 1: OpenAI Endpoint Streaming (FAILS)

Test Case 2: OpenAI Endpoint Non-Streaming (WORKS)

Container Logs Analysis

Diagnostic Evidence

Cross-Endpoint Comparison

Detailed Test Results

OpenAI Non-Streaming (Working)

OpenAI Streaming (Broken)

Ollama API Streaming (Working)

Root Cause Analysis

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Endpoint	Mode	Status	Evidence
`/v1/chat/completions`	Streaming	❌ FAILS	Empty `delta.content` in all chunks
`/v1/chat/completions`	Non-streaming	✅ WORKS	Full content returned correctly
`/api/chat` (Ollama)	Streaming	✅ WORKS	Progressive content chunks delivered
`/api/chat` (Ollama)	Non-streaming	✅ WORKS	Full content returned correctly

bug: Empty Content Deltas in OpenAI-Compatible Endpoint #5

Description

Summary

Environment

Configuration

Issue Description

Expected Behavior

Actual Behavior

Reproduction Steps

Test Case 1: OpenAI Endpoint Streaming (FAILS)

Test Case 2: OpenAI Endpoint Non-Streaming (WORKS)

Container Logs Analysis

Diagnostic Evidence

Cross-Endpoint Comparison

Detailed Test Results

OpenAI Non-Streaming (Working)

OpenAI Streaming (Broken)

Ollama API Streaming (Working)

Root Cause Analysis

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions