-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Hello, thank you for reproducing the work of the paper.
I tried to launch the training of the model on the MSN-Hard dataset, but I'm unable to launch the training because a CUDA_ERROR_OUT_OF_MEMORY error that I get at the beginning of the training.
The training is launched on a cluster and more specifically on a node containing 8 * GPU V100 (32Go each)
I further reduced the batch size to 32 and the sampled points to 2300, but it do not seems to suffice to avoid the error.
Here is the command that I use to launch the code (inside a .slurm file containing the necessary parameters) :
torchrun --standalone --nnodes 1 --nproc_per_node 8 train.py runs/msn/osrt/config.yaml
I have to point out that I was able to make the model train on the CLEVR3D dataset successfully following the same procedure.
To load the dataset, I downloaded the following : https://console.cloud.google.com/storage/browser/kubric-public/tfds/kubric_frames/multi_shapenet_conditional, putting the 2.8.0/ folder inside data/osrt/multi_shapenet_frames/ folder of the project.
Did you ever encounter this issue ?
Thank you