Skip to content

Training loss crashes when sequence length exceeds 20k #26

@LuLuLuyi

Description

@LuLuLuyi

Hi, I encountered an issue when using llamafactor to train SDAR-30B-A3B-Sci.
Whenever the training sequence length exceeds 20k (e.g., 32k), the training crashes and the loss becomes 0.

Below is my training config:

### model
model_name_or_path: SDAR-30B-A3B-Sci
train_from_scratch: false
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: false
freeze_language_model: true
deepspeed: examples/deepspeed/ds_z3_config.json
# disable_gradient_checkpointing: true
gradient_checkpointing: true

### dataset
dataset: open_r1_math
template: qwen3
block_length: 4
cutoff_len: 32768
truncate_mode: drop
overwrite_cache: false
preprocessing_num_workers: 96
dataloader_num_workers: 4
neat_packing: true

### output
output_dir: /sft_test/sdar_30ba3b_math_r1_cot/
logging_steps: 5
save_steps: 256
save_total_limit: 10
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: wandb

### train
run_name: sdar_30ba3b_math_r1_cot
include_effective_tokens_per_second: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 10.0
lr_scheduler_type: constant_with_warmup
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

this is my training loss curve:

0%|          | 1/5190 [06:08<531:06:38, 368.47s/it]
  0%|          | 2/5190 [20:47<964:11:30, 669.06s/it]
  0%|          | 3/5190 [21:59<570:44:41, 396.12s/it]
  0%|          | 4/5190 [22:59<379:23:41, 263.37s/it]
  0%|          | 5/5190 [23:52<270:08:47, 187.57s/it]
                                                     
{'loss': 2.1261, 'grad_norm': 1.4142135623730951, 'learning_rate': 2.564102564102564e-07, 'epoch': 0.01}

  0%|          | 5/5190 [23:52<270:08:47, 187.57s/it]
  0%|          | 6/5190 [24:46<204:34:09, 142.06s/it]
  0%|          | 7/5190 [25:41<163:40:46, 113.69s/it]
  0%|          | 8/5190 [26:35<136:10:48, 94.61s/it] 
  0%|          | 9/5190 [27:28<117:36:57, 81.73s/it]
  0%|          | 10/5190 [28:22<105:26:29, 73.28s/it]
                                                     
{'loss': 0.0, 'grad_norm': 1.4142135623730951, 'learning_rate': 5.76923076923077e-07, 'epoch': 0.02}

  0%|          | 10/5190 [28:22<105:26:29, 73.28s/it]
  0%|          | 11/5190 [29:19<98:09:25, 68.23s/it] 
  0%|          | 12/5190 [30:14<92:17:26, 64.16s/it]
  0%|          | 13/5190 [31:07<87:32:16, 60.87s/it]
  0%|          | 14/5190 [32:19<92:15:20, 64.17s/it]
  0%|          | 15/5190 [33:22<91:52:30, 63.91s/it]
                                                    
{'loss': 0.0, 'grad_norm': 1.4142135623730951, 'learning_rate': 8.974358974358975e-07, 'epoch': 0.03}

  0%|          | 15/5190 [33:22<91:52:30, 63.91s/it]
  0%|          | 16/5190 [34:18<88:26:25, 61.54s/it]
  0%|          | 17/5190 [35:10<83:56:17, 58.41s/it]
  0%|          | 18/5190 [36:03<81:35:43, 56.79s/it]
  0%|          | 19/5190 [36:55<79:51:08, 55.59s/it]
  0%|          | 20/5190 [37:49<79:04:08, 55.06s/it]
                                                    
{'loss': 0.0, 'grad_norm': 1.4142135623730951, 'learning_rate': 1.217948717948718e-06, 'epoch': 0.04}

  0%|          | 20/5190 [37:49<79:04:08, 55.06s/it]
  0%|          | 21/5190 [38:40<77:23:28, 53.90s/it]
  0%|          | 22/5190 [39:34<77:18:16, 53.85s/it]
  0%|          | 23/5190 [40:29<77:44:53, 54.17s/it]
  0%|          | 24/5190 [41:21<76:54:32, 53.60s/it]
  0%|          | 25/5190 [42:15<77:06:53, 53.75s/it]

I suspect this issue may be related to the model’s maximum supported context length = 40k:
• When training with 20k sequences, the attention mask becomes 20k clean + 20k noise = 40k, which fits within the limit.
• When training with 32k sequences, the attention mask becomes 32k clean + 32k noise = 64k, which exceeds the 40k limit and causes training to crash.

However, I still have some questions: The model’s RoPE / position encoding should only use positions from 0 to cutoff_len, so in principle it shouldn’t trigger out-of-range errors.

Could you clarify what exactly causes the training to crash when the sequence length exceeds around 20k, and whether there is any way to support longer training sequences such as 32k or 64k? Any guidance would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions