tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1

**Describe the bug**

`tokenize_dataset()` in `packed_sequence.py` fails when `chat=True` and `pad_seq_to_mult > 1`:

1. **Tensor/list mismatch**: `_chat_preprocess` returns `torch.LongTensor`/`torch.BoolTensor`, but `pre_pad_dataset` concatenates with plain lists (`val + [pad_id] * ...`), raising `TypeError`.

2. **Missing `loss_mask` padding**: `pre_pad_dataset` pads `input_ids` and `context_ids` but not `loss_mask`. Sequences with different original lengths can round to the same padded `input_ids` length, so `create_hist` groups them together — but their `loss_mask` arrays differ in length, causing `np.array()` in `fill_packing_strategy` to fail with `ValueError: inhomogeneous shape`.

**Steps/Code to reproduce bug**

```python
from megatron.bridge.data.builders.finetuning_dataset import FinetuningDatasetBuilder
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs

builder = FinetuningDatasetBuilder(
    dataset_root="path/to/jsonl",
    tokenizer=tokenizer,
    seq_length=221184,
    packed_sequence_specs=PackedSequenceSpecs(
        packed_sequence_size=221184,
        tokenizer_model_name="qwen3-14b",
        pad_seq_to_mult=8,
    ),
    dataset_kwargs={"chat": True, "use_hf_tokenizer_chat_template": True},
)
builder.prepare_packed_data()
```

**Expected behavior**

Tokenization and packing complete without error when using chat datasets with `pad_seq_to_mult > 1`.

**Additional context**

- `GPTSFTDataset` (non-chat) is unaffected — it returns plain lists and does not include `loss_mask` in its output dict.
- `pad_seq_to_mult=1` is unaffected — the padding block is skipped entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1 #2610

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1 #2610

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions