-
Notifications
You must be signed in to change notification settings - Fork 23
Description
I am currently experimenting with training. I have had 2 successful training runs on small amounts of frames (3000+3000 from L4 mouse and 3000+2982 from L6 mouse). This worked and gave me models that I could use for correction. May next attempt, training on all 4 files from both layers, however, keeps failing with the following error:
$ python train.py --datasets_folder L4_L6 --datasets_path /gpfs/soma_fs/home/voit/bbo/projects/2gramfiberscope/experiments/denoising_training_data/ --n_epochs 40 --GPU 0,1,2,3 --batch_size 4 --train_datasets_size 300 --select_img_num 11400
srun: job 31197 queued and waiting for resources
srun: job 31197 has been allocated resources
Training parameters ----->
Namespace(GPU='0,1,2,3', b1=0.5, b2=0.999, batch_size=4, datasets_folder='L4_L6', datasets_path='/gpfs/soma_fs/home/voit/bbo/projects/2gramfiberscope/experiments/denoising_training_data/', fmap=16, gap_h=75, gap_s=75, gap_w=75, img_h=150, img_s=150, img_w=150, lr=5e-05, n_epochs=40, ngpu=4, normalize_factor=1, output_dir='./results', pth_path='pth', select_img_num=11400, train_datasets_size=300)
Image list for training ----->
Total number -----> 4
M210601JKL_20210702_D4_00002_xscancorr_rigidXCorr_export.tif
M210602JKL_20210702_D8_00006_xscancorr_rigidXCorr_export.tif
M210602JKL_20210702_D8_00003_xscancorr_rigidXCorr_export.tif
M210601JKL_20210702_D4_00005_xscancorr_rigidXCorr_export.tif
Using 4 GPU for training ----->
Traceback (most recent call last):
File "train.py", line 96, in <module>
for iteration, (input, target) in enumerate(trainloader):
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1179, in _next_data
return self._process_data(data)
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 3.
Original Traceback (most recent call last):
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
return [default_collate(samples) for samples in transposed]
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/gpfs/soma_fs/home/voit/anaconda3/envs/deepcad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [1, 150, 150, 150] at entry 0 and [1, 149, 150, 150] at entry 2
I have checked that all files have the same spatial dimensions, in case that is important:
$ for i in *.tif; do echo $i;tiffinfo $i|head -7; done
TIFF Directory at offset 0x241f6 (147958)
Image Width: 269 Image Length: 275
Bits/Sample: 16
Sample Format: unsigned integer
Compression Scheme: None
Photometric Interpretation: min-is-black
Samples/Pixel: 1
M210601JKL_20210702_D4_00005_xscancorr_rigidXCorr_export.tif
TIFF Directory at offset 0x241f6 (147958)
Image Width: 269 Image Length: 275
Bits/Sample: 16
Sample Format: unsigned integer
Compression Scheme: None
Photometric Interpretation: min-is-black
Samples/Pixel: 1
M210602JKL_20210702_D8_00003_xscancorr_rigidXCorr_export.tif
TIFF Directory at offset 0x241f6 (147958)
Image Width: 269 Image Length: 275
Bits/Sample: 16
Sample Format: unsigned integer
Compression Scheme: None
Photometric Interpretation: min-is-black
Samples/Pixel: 1
M210602JKL_20210702_D8_00006_xscancorr_rigidXCorr_export.tif
TIFF Directory at offset 0x241f6 (147958)
Image Width: 269 Image Length: 275
Bits/Sample: 16
Sample Format: unsigned integer
Compression Scheme: None
Photometric Interpretation: min-is-black
Samples/Pixel: 1
Despite having read issue #2 , I am not yet sure if I interpret --train_datasets_size correctly. Is this the number or the size of the stacks? Do I have to make sure that --select_img_num is divisible by that number? I also find that the choice --train_datasets_size massively influences the calculation time. What would be an expected sweet spot here?