Skip to content

RuntimeError: stack expects each tensor to be equal size, but got [1, 150, 150, 150] at entry 0 and [1, 149, 150, 150] at entry 3 #9

@kaymvoit

Description

@kaymvoit

I am currently experimenting with training. I have had 2 successful training runs on small amounts of frames (3000+3000 from L4 mouse and 3000+2982 from L6 mouse). This worked and gave me models that I could use for correction. May next attempt, training on all 4 files from both layers, however, keeps failing with the following error:

$ python train.py --datasets_folder L4_L6 --datasets_path /gpfs/soma_fs/home/voit/bbo/projects/2gramfiberscope/experiments/denoising_training_data/ --n_epochs 40 --GPU 0,1,2,3 --batch_size 4 --train_datasets_size 300 --select_img_num 11400
srun: job 31197 queued and waiting for resources
srun: job 31197 has been allocated resources
Training parameters -----> 
Namespace(GPU='0,1,2,3', b1=0.5, b2=0.999, batch_size=4, datasets_folder='L4_L6', datasets_path='/gpfs/soma_fs/home/voit/bbo/projects/2gramfiberscope/experiments/denoising_training_data/', fmap=16, gap_h=75, gap_s=75, gap_w=75, img_h=150, img_s=150, img_w=150, lr=5e-05, n_epochs=40, ngpu=4, normalize_factor=1, output_dir='./results', pth_path='pth', select_img_num=11400, train_datasets_size=300)
Image list for training -----> 
Total number ----->  4
M210601JKL_20210702_D4_00002_xscancorr_rigidXCorr_export.tif
M210602JKL_20210702_D8_00006_xscancorr_rigidXCorr_export.tif
M210602JKL_20210702_D8_00003_xscancorr_rigidXCorr_export.tif
M210601JKL_20210702_D4_00005_xscancorr_rigidXCorr_export.tif
Using 4 GPU for training -----> 
Traceback (most recent call last):
  File "train.py", line 96, in <module>
    for iteration, (input, target) in enumerate(trainloader):
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1179, in _next_data
    return self._process_data(data)
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
    data.reraise()
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [1, 150, 150, 150] at entry 0 and [1, 149, 150, 150] at entry 2

I have checked that all files have the same spatial dimensions, in case that is important:

$ for i in *.tif; do echo $i;tiffinfo $i|head -7; done
TIFF Directory at offset 0x241f6 (147958)
  Image Width: 269 Image Length: 275
  Bits/Sample: 16
  Sample Format: unsigned integer
  Compression Scheme: None
  Photometric Interpretation: min-is-black
  Samples/Pixel: 1
M210601JKL_20210702_D4_00005_xscancorr_rigidXCorr_export.tif
TIFF Directory at offset 0x241f6 (147958)
  Image Width: 269 Image Length: 275
  Bits/Sample: 16
  Sample Format: unsigned integer
  Compression Scheme: None
  Photometric Interpretation: min-is-black
  Samples/Pixel: 1
M210602JKL_20210702_D8_00003_xscancorr_rigidXCorr_export.tif
TIFF Directory at offset 0x241f6 (147958)
  Image Width: 269 Image Length: 275
  Bits/Sample: 16
  Sample Format: unsigned integer
  Compression Scheme: None
  Photometric Interpretation: min-is-black
  Samples/Pixel: 1
M210602JKL_20210702_D8_00006_xscancorr_rigidXCorr_export.tif
TIFF Directory at offset 0x241f6 (147958)
  Image Width: 269 Image Length: 275
  Bits/Sample: 16
  Sample Format: unsigned integer
  Compression Scheme: None
  Photometric Interpretation: min-is-black
  Samples/Pixel: 1

Despite having read issue #2 , I am not yet sure if I interpret --train_datasets_size correctly. Is this the number or the size of the stacks? Do I have to make sure that --select_img_num is divisible by that number? I also find that the choice --train_datasets_size massively influences the calculation time. What would be an expected sweet spot here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions