Skip to content

if use batch size > 1, got "IndexError: list index out of range" #18

@yetaotjroc93

Description

@yetaotjroc93

Hi Team,

I tried to train a Brainlm model by running:

python train.py
--output_dir /data/users2/ytao/BrainLM_YeTao/output_fake
--train_dataset_path /data/users2/ytao/BrainLM_YeTao/datasets_fake/train_ukbiobank
--val_dataset_path /data/users2/ytao/BrainLM_YeTao/datasets_fake/val_ukbiobank
--coords_dataset_path /data/users2/ytao/BrainLM_YeTao/datasets_fake/Brain_Region_Coordinates
--moving_window_len 20
--num_last_timepoints_masked 4
--hidden_size 128
--num_hidden_layers 4
--num_attention_heads 4
--intermediate_size 512
--decoder_hidden_size 128
--decoder_num_hidden_layers 2
--decoder_num_attention_heads 4
--decoder_intermediate_size 512
--attention_probs_dropout_prob 0.1
--per_device_train_batch_size 32
--per_device_eval_batch_size 32
--gradient_accumulation_steps 8
--save_total_limit 50
--dataloader_num_workers 3
--dataloader_pin_memory True
--wandb_logging True
--dataloader_drop_last True

Then I got the following error. If I changed per_device_train_batch_size and per_device_eval_batch_size to be 1, then I can run the code successfully. May I know the reason?
File "/data/users2/ytao/BrainLM_YeTao/train.py", line 702, in
main()
File "/data/users2/ytao/BrainLM_YeTao/train.py", line 672, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/transformers/trainer.py", line 1899, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in next
data = self._next_data()
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
return self._process_data(data)
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
data.reraise()
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/torch/_utils.py", line 715, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 50, in fetch
data = self.dataset.getitems(possibly_batched_index)
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2809, in getitems
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2809, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/data/users2/ytao/bin/miniconda3/envs/brainlm/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2809, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
IndexError: list index out of range

Thanks,
Ye

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions