Skip to content

CUDA_ERROR_OUT_OF_MEMORY: out of memory / on MSN-Hard training #3

@alexcbb

Description

@alexcbb

Hello, thank you for reproducing the work of the paper.

I tried to launch the training of the model on the MSN-Hard dataset, but I'm unable to launch the training because a CUDA_ERROR_OUT_OF_MEMORY error that I get at the beginning of the training.

The training is launched on a cluster and more specifically on a node containing 8 * GPU V100 (32Go each)
I further reduced the batch size to 32 and the sampled points to 2300, but it do not seems to suffice to avoid the error.

Here is the command that I use to launch the code (inside a .slurm file containing the necessary parameters) :
torchrun --standalone --nnodes 1 --nproc_per_node 8 train.py runs/msn/osrt/config.yaml

I have to point out that I was able to make the model train on the CLEVR3D dataset successfully following the same procedure.

To load the dataset, I downloaded the following : https://console.cloud.google.com/storage/browser/kubric-public/tfds/kubric_frames/multi_shapenet_conditional, putting the 2.8.0/ folder inside data/osrt/multi_shapenet_frames/ folder of the project.

Did you ever encounter this issue ?

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions