作者您好!
十分感谢你分享的工作,我通过下面命令训练模型
‘’‘
python Train.py
--ngpu 1
--train_keys ./data/data_splits/screen_model/train_keys.pkl
--val_keys ./data/data_splits/screen_model/val_keys.pkl
--test_keys ./data/data_splits/screen_model/test_keys.pkl
--epoch 5
--batch_size 1
--test
’‘’
报错如下:
Available GPU List
id utilization.gpu(%) memory.free(MiB)
0 2 19680
Select id #0 for you.
2025-02-24 08:17:42
Number of train data: 310052
Number of val data: 34867
Number of test data: 104206
number of parameters : 565317
0%| | 0/310052 [00:00<?, ?it/s]
0%| | 1/310052 [00:01<130:24:19, 1.51s/it]
0%| | 2/310052 [00:01<61:32:23, 1.40it/s]
...
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.68 GiB total capacity; 17.76 GiB already allocated; 12.88 MiB free; 18.95 GiB allowed; 18.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
terminate called without an active exception
我的GPU 是 24 GB 的 3090,batch_size 设置的是 1 ,但是随着时间延长,进度到 3% 左右就会报错显存不够。请问怎么设置可以解决这个问题吗?
十分期待你的回复~~