-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
I am attempting to test the relationship between the training dataset size and training time in the SAM repository. I adjusted the train_queries variable in sam_multi/experiments.py to 1000 and ran the following command:
python run_uae.py --run job-light-ranges-mscn-workloadHowever, I encountered the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::NeuroCard.train() (pid=81614, ip=172.17.0.5)
File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trainable.py", line 332, in train
result = self.step()
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/ray/tune/trainable.py", line 636, in step
result = self._train()
File "run_uae.py", line 1264, in _train
q_weight=self.q_weight if self.semi_train else 0
File "run_uae.py", line 542, in run_epoch_query_only
all_loss.backward(retain_graph=True)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/sam/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function 'MmBackward' returned nan values in its 0th output.In the job-light-ranges-mscn-workload configuration within sam_multi/experiments.py, are there any additional parameters or settings that need to be adjusted to properly test the relationship between training dataset size and training time?
I appreciate your time and assistance. Looking forward to your guidance on resolving this issue. Thank you!
Metadata
Metadata
Assignees
Labels
No labels