Skip to content

ValueError: NaN encountered in pred_rots_vf when trying to train on PINDER #26

@ntoxeg

Description

@ntoxeg

Like the title suggests, I’ve managed to get a run going but it crashes with the following traceback

Traceback (most recent call last):
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 112, in main
    exp.train()
  File "/home/greg/protein-frame-flow/experiments/train_se3_flows.py", line 87, in train
    trainer.fit(
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrup
t
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in
 launch
    return function(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 219, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 181, in run
    closure()
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 142, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 128, in closure
    step_output = self._step_fn()
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 315, in _trainin
g_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 293, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 330, in training_step
    return self.model(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/greg/miniconda3/envs/fm2/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 90, in forward
    output = self._forward_module.training_step(*inputs, **kwargs)
  File "/home/greg/protein-frame-flow/models/flow_module.py", line 295, in training_step
    batch_losses = self.model_step(noisy_batch)
  File "/home/greg/protein-frame-flow/models/flow_module.py", line 125, in model_step
    raise ValueError('NaN encountered in pred_rots_vf')
ValueError: NaN encountered in pred_rots_vf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions