Skip to content

CUDA out of memory #26

@dingjietao

Description

@dingjietao

when i run, i got "RuntimeError: CUDA out of memory."i don't know how to modify.
gpu0: 12G; gpu1: 12G
my command is "bash experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16" or
"bash experiments/scripts/train_faster_rcnn.sh 0,1 pascal_voc vgg16". The two commands both caused "RuntimeError: CUDA out of memory"
And i modified vgg16.yml. TRAIN.BATCH_SIZE : 256 --> 2
Running Logs:

  • set -e
  • export PYTHONUNBUFFERED=True
  • PYTHONUNBUFFERED=True
  • GPU_ID=0
  • DATASET=pascal_voc
  • NET=vgg16
  • array=($@)
  • len=3
  • EXTRA_ARGS=
  • EXTRA_ARGS_SLUG=
  • case ${DATASET} in
  • TRAIN_IMDB=voc_2007_trainval
  • TEST_IMDB=voc_2007_test
  • STEPSIZE='[50000]'
  • ITERS=100000
  • ANCHORS='[8,16,32]'
  • RATIOS='[0.5,1,2]'
    ++ date +%Y-%m-%d_%H-%M-%S
  • LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
  • exec
    ++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
    tee: experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21: No such file or directory
  • echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
    Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
  • set +x
  • '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_MELM_iter_100000.pth.index ']'
  • [[ ! -z '' ]]
  • CUDA_VISIBLE_DEVICES=0
  • python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.pth --imdb voc_2007_trainval --imdbval voc_2007_test --iters 100000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE '[50000]'
    Called with args:
    Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=100000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '[50000]'], tag=None, weight='data/imagenet_weights/vgg16.pth')
    /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/model/config.py:369: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    yaml_cfg = edict(yaml.load(f))
    Loaded dataset voc_2007_trainval for training
    Set proposal method: selective_search
    Appending horizontally-flipped training examples...
    voc_2007_trainval ss roidb loaded from /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/data/cache/voc_2007_trainval_selective_search_roidb.pkl
    done
    Preparing training data...
    done
    10022 roidb entries
    Output will be saved to /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/output/vgg16_MELM/voc_2007_trainval/default
    TensorFlow summaries will be saved to /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tensorboard/vgg16_MELM/voc_2007_trainval/default
    Loaded dataset voc_2007_test for training
    Set proposal method: selective_search
    Preparing training data...
    voc_2007_test ss roidb loaded from /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/data/cache/voc_2007_test_selective_search_roidb.pkl
    done
    4952 validation roidb entries
    Filtered 0 roidb entries: 10022 -> 10022
    Filtered 0 roidb entries: 4952 -> 4952
    Solving...
    Loading initial model weights from data/imagenet_weights/vgg16.pth
    Loaded.
    Traceback (most recent call last):
    File "./tools/trainval_net.py", line 135, in
    max_iters=args.max_iters)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/model/train_val.py", line 377, in train_net
    sw.train_model(max_iters)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/model/train_val.py", line 291, in train_model
    cls_det_loss, refine_loss_1, refine_loss_2, consistency_loss, total_loss = self.net.train_step(blobs,self.optimizer,iter)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 634, in train_step
    self.forward(blobs['data'], blobs['image_level_labels'], blobs['im_info'], blobs['gt_boxes'], blobs['ss_boxes'], step)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 562, in forward
    roi_labels_1, keep_inds_1, roi_labels_2, keep_inds_2, bbox_pred, rois = self._predict_train(ss_boxes_all, step)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 508, in _predict_train
    roi_labels_2, keep_inds_2, bbox_pred = self._region_classification_train(pool5_roi, fc7_roi,fc7_context, fc7_frame, step)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 398, in _region_classification_train
    mask_1 = self._inverted_attention(bbox_feats_new, gt, keep_inds_1_new, 1, step, fg_num_1_new, bg_num_1_new)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 147, in _inverted_attention
    pooled_feat_before_after = torch.cat((bbox_feats_new, bbox_feats_new * mask_all), dim=0)
    RuntimeError: CUDA out of memory. Tried to allocate 766.00 MiB (GPU 0; 11.91 GiB total capacity; 9.59 GiB already allocated; 99.19 MiB free; 1.49 GiB cached)

I would appreciate it if you could help me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions