A question about finetune dataset processing

`DataCollatorForCompletionOnlyLM` seems to be working not as expected, I'm not sure if this will affect the performance of finetuning.


We can know from this [doc Fine Tuning Your Own ChatGPT-like Model](https://ai.plainenglish.io/exploring-dolly-2-0-a-guide-to-training-your-own-chatgpt-like-model-dd9b785ff1df) that the purpose of function `DataCollatorForCompletionOnlyLM` in the dataset preprocessing is:

> The class method encodes the response key new line into token IDs using the tokenizer, and searches for the start position of the response key in each example's label tensor. Once the start position of the response key is found, the label tensor is modified to mask out all tokens before the end of the response key. This is done by setting the label IDs for those tokens to -100, which is a special value that tells the PyTorch loss function to ignore them.

```python
class DataCollatorForCompletionOnlyLM(transformers.DataCollatorForLanguageModeling):
    def torch_call(self, examples):
        batch = super().torch_call(examples)
        # The prompt ends with the response key plus a newline.  We encode this and then try to find it in the
        # sequence of tokens.  This should just be a single token.
        response_token_ids = self.tokenizer.encode(RESPONSE_KEY_NL)
        labels = batch["labels"].clone()
        for i in range(len(examples)):             # one batch, batch["labels"][i] gets every prompt label
            response_token_ids_start_idx = None
            for idx in np.where(batch["labels"][i] == response_token_ids[0])[0]:
                response_token_ids_start_idx = idx
                break
            if response_token_ids_start_idx is not None:
                response_token_ids_end_idx = response_token_ids_start_idx + 1
                # Make pytorch loss function ignore all tokens up through the end of the response key
                labels[i, :response_token_ids_end_idx] = -100
        batch["labels"] = labels
```
I think this is intend to mark every prompt from the beginning to the response key as a special mark, so that the model can focus more on training the response content. But I found this function cannot achieve this purpose in llm-on-ray now.

## For example:
### When parameter group is false
**Prompt content:**
  ```
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  
  ### Instruction:
  Which is a species of fish? Tope or Rope
  
  ### Response:
  Tope
  
  ### End
  ```
**Expected preprocessing results**
```
  -100 -100 -100 -100 -100 -100 -100…………
  Tope
  
  ### End
```
**Actual preprocessing results**
```
  -100 -100 -100 -100 -100 -100 -100…………
  -100 Instruction:
  Which is a species of fish? Tope or Rope
  
  ### Response:
  Tope
  
  ### End
```
This is because only response_token_ids[0] (**###**) is compared in `np.where(batch["labels"][i] == response_token_ids[0])`, causing  "### Instruction" to be discovered first instead of "### Response".

### When parameter group is true
When group is true, multiple prompts will be combined into a new prompt. Here is an example of splicing two prompts.
**Prompt content:**
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Which is a species of fish? Tope or Rope

### Response:
Tope

### End
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Alice's parents have three daughters: Amy, Jessy, and what’s the name of the third daughter?

### Response:
The name of the third daughter is Alice

### End
```
**Expected preprocessing results**
```
-100 -100 -100 -100 -100 -100 -100…………
Tope

### End
-100 -100 -100 -100 -100 -100 -100…………
The name of the third daughter is Alice

### End
```
**Actual preprocessing results**
```
-100 -100 -100 -100 -100 -100 -100…………
-100 Instruction:
Which is a species of fish? Tope or Rope

### Response:
Tope

### End
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Alice's parents have three daughters: Amy, Jessy, and what’s the name of the third daughter?

### Response:
The name of the third daughter is Alice

### End
```
This is because it breaks when the first one is found, and the second prompt spliced in will not be processed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about finetune dataset processing #234

For example:

When parameter group is false

When parameter group is true

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A question about finetune dataset processing #234

Description

For example:

When parameter group is false

When parameter group is true

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions