-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Description
Hi Yala!
Great package. Just letting you know, though, that computation on multiple GPU's is broken for two reasons:
-
The
model.pyfile does not import
import torch.nn as nn
that's an easy fix. -
You have some class-attribute dependencies that are single-thread bound.
Issue for DataParallel pytorch/pytorch#8637
I'm not sure exactly what they are, but here is my error message, which matches the one in the issue I linked to above:
Traceback (most recent call last):
File "scripts/main.py", line 35, in <module>
epoch_stats, model, gen = train.train_model(train_data, dev_data, model, gen, args)
File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/learn/train.py", line 59, in train_model
args=args)
File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/learn/train.py", line 198, in run_epoch
mask, z = gen(x_indx)
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/models/generator.py", line 55, in forward
activ = self.cnn(x)
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/models/cnn.py", line 55, in forward
activ = self._conv(x)
File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/models/cnn.py", line 41, in _conv
next_activ.append( conv(padded_activ) )
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 200, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)```
Alex
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels