Skip to content

tokenize_dataset crashes with TypeError when chat=True and pad_seq_to_mult > 1 #2610

@shanecmoran

Description

@shanecmoran

Describe the bug

tokenize_dataset() in packed_sequence.py fails when chat=True and pad_seq_to_mult > 1:

  1. Tensor/list mismatch: _chat_preprocess returns torch.LongTensor/torch.BoolTensor, but pre_pad_dataset concatenates with plain lists (val + [pad_id] * ...), raising TypeError.

  2. Missing loss_mask padding: pre_pad_dataset pads input_ids and context_ids but not loss_mask. Sequences with different original lengths can round to the same padded input_ids length, so create_hist groups them together — but their loss_mask arrays differ in length, causing np.array() in fill_packing_strategy to fail with ValueError: inhomogeneous shape.

Steps/Code to reproduce bug

from megatron.bridge.data.builders.finetuning_dataset import FinetuningDatasetBuilder
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs

builder = FinetuningDatasetBuilder(
    dataset_root="path/to/jsonl",
    tokenizer=tokenizer,
    seq_length=221184,
    packed_sequence_specs=PackedSequenceSpecs(
        packed_sequence_size=221184,
        tokenizer_model_name="qwen3-14b",
        pad_seq_to_mult=8,
    ),
    dataset_kwargs={"chat": True, "use_hf_tokenizer_chat_template": True},
)
builder.prepare_packed_data()

Expected behavior

Tokenization and packing complete without error when using chat datasets with pad_seq_to_mult > 1.

Additional context

  • GPTSFTDataset (non-chat) is unaffected — it returns plain lists and does not include loss_mask in its output dict.
  • pad_seq_to_mult=1 is unaffected — the padding block is skipped entirely.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions