-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Describe the bug
tokenize_dataset() in packed_sequence.py fails when chat=True and pad_seq_to_mult > 1:
-
Tensor/list mismatch:
_chat_preprocessreturnstorch.LongTensor/torch.BoolTensor, butpre_pad_datasetconcatenates with plain lists (val + [pad_id] * ...), raisingTypeError. -
Missing
loss_maskpadding:pre_pad_datasetpadsinput_idsandcontext_idsbut notloss_mask. Sequences with different original lengths can round to the same paddedinput_idslength, socreate_histgroups them together — but theirloss_maskarrays differ in length, causingnp.array()infill_packing_strategyto fail withValueError: inhomogeneous shape.
Steps/Code to reproduce bug
from megatron.bridge.data.builders.finetuning_dataset import FinetuningDatasetBuilder
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs
builder = FinetuningDatasetBuilder(
dataset_root="path/to/jsonl",
tokenizer=tokenizer,
seq_length=221184,
packed_sequence_specs=PackedSequenceSpecs(
packed_sequence_size=221184,
tokenizer_model_name="qwen3-14b",
pad_seq_to_mult=8,
),
dataset_kwargs={"chat": True, "use_hf_tokenizer_chat_template": True},
)
builder.prepare_packed_data()Expected behavior
Tokenization and packing complete without error when using chat datasets with pad_seq_to_mult > 1.
Additional context
GPTSFTDataset(non-chat) is unaffected — it returns plain lists and does not includeloss_maskin its output dict.pad_seq_to_mult=1is unaffected — the padding block is skipped entirely.