Abnormal Output from phi-3.5 Model Fine-tuned with Refocus when Deployed with vLLM

### **Problem Description**

Hi, We are encountering significant output issues when deploying a `phi-3.5` model, fine-tuned with the refocus method (`Trained_Model/lr1e-06_ep2_bb1_edit0_aug0`), using vLLM or pt. The model always fails to generate normal, coherent content. Instead, the output is often truncated, consists of malformed JSON, or contains repetitive, incomplete data structures.

In contrast, the base, untuned `phi-3.5` model functions perfectly in the exact same vLLM deployment environment, producing complete and correct responses. This suggests the issue may stem from an incompatibility or interaction between the fine-tuned model and the vLLM framework.

### **vLLM Backend Prompt and Parameters**

The prompt received by the vLLM backend is structured as follows:

```python
prompt: '<|user|>\n<|image_1|>\n{Does the bird in the picture have striped wing and white underparts?}\nThought: The areas to focus on in the image have bounding box coordinates:<|end|>\n<|assistant|>\n',
params: SamplingParams(n=1,
                       presence_penalty=0.0,
                       frequency_penalty=0.0,
                       repetition_penalty=1.0,
                       temperature=0.7,
                       top_p=0.98,
                       top_k=-1,
                       min_p=0.0,
                       seed=None,
                       stop=[],
                       stop_token_ids=[],
                       bad_words=[],
                       include_stop_str_in_output=False,
                       ignore_eos=False,
                       max_tokens=5000,
                       min_tokens=0,
                       logprobs=None,
                       prompt_logprobs=None,
                       skip_special_tokens=True,
                       spaces_between_special_tokens=True,
                       truncate_prompt_tokens=None,
                       guided_decoding=None,
                       extra_args=None),
prompt_token_ids: None,
lora_request: None,
prompt_adapter_request: None
```

### **pt Environment**


```
--
GPU：A800 40G
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
```
 

### **Abnormal Behavior (Fine-tuned Model)**

We have observed several types of malformed outputs:

**Example 1: Incomplete JSON Output**
The model output cuts off abruptly.

```json
{
  "conversation": [
    {
      "role": "user",
      "content": "<image>\n{What's the least popular game in the chart?}\nThought: The areas to focus on in the image have bounding box coordinates:"
    },
    {
      "role": "assistant",
      "content": " [{'content' ..."
    }
  ]
}
```

**Example 2: Repetitive, Unstructured Output**
The model outputs repeating lists of coordinates instead of a coherent sentence.

```json
{
  "conversation": [
    {
      "role": "user",
      "content": "<image>\n{Is the color of 15% segment light green?}\nThought: The areas to focus on in the image have bounding box coordinates:"
    },
    {
      "role": "assistant",
      "content": " [{'x1': 468, 'y1': 459, 'x2': 685, 'y2': 525}],\n['x1': 468, 'y1': 459, 'x2': 685, 'y2': 525}],\n['x1': 468, 'y1': 459, 'x2': 685, 'y2': 525]"
    }
  ]
}
```

**Example 3: Mixed Structured Data and Truncated Answer**
The output contains some structured data but lacks a complete, natural language response.

```json
{
  "conversation": [
    {
      "role": "user",
      "content": "<image>\n{Is the median value of all the leftmost bars greater than 50?}\nThought: The areas to focus on in the image have bounding box coordinates:"
    },
    {
      "role": "assistant",
      "content": " [[[{'type': 'histogram', 'content': 0, 'x1': 127, 'y1': 0, 'x2': 455, 'y2': 102}, {'type': 'histogram', 'content': 1, 'x1': 455, 'y1': 0, 'x2': 918, 'y2': 98}, {'type': 'histogram', 'content': 2, 'x1': 918, 'y1': 0, 'x2': 1455, 'y2': 96}]]]\nAnswer: Yes \n"
    }
  ]
}
```

### **Expected Behavior (Normal Output)**

We expect the fine-tuned model to generate complete, contextually relevant sentences, similar to the base model.

**Comparison: Normal output from the base `phi-3.5` model in the SAME environment:**

```json
{
  "conversation": [
    {
      "role": "user",
      "content": "What is the difference between the sum of sales growth of face masks and alcohol-based hand sanitizers and the difference between the two?<image>"
    },
    {
      "role": "assistant",
      "content": "First, we find the sum of the sales growth rates for both products: 100% (face masks) + 623% (alcohol-based hand sanitizers) = 723%. Next, we find the difference between the two individual growth rates: 623% (alcohol-based hand sanitizers) - 100% (face masks) = 523%. The sum of the growth rates is 723%, and the difference between the two is 523%."
    }
  ]
}
```

Any guidance on how to resolve this issue would be greatly appreciated. Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal Output from phi-3.5 Model Fine-tuned with Refocus when Deployed with vLLM #7

Problem Description

vLLM Backend Prompt and Parameters

pt Environment

Abnormal Behavior (Fine-tuned Model)

Expected Behavior (Normal Output)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Abnormal Output from phi-3.5 Model Fine-tuned with Refocus when Deployed with vLLM #7

Description

Problem Description

vLLM Backend Prompt and Parameters

pt Environment

Abnormal Behavior (Fine-tuned Model)

Expected Behavior (Normal Output)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions