Skip to content

Multiple GPUs is broken #13

@alex2awesome

Description

@alex2awesome

Hi Yala!

Great package. Just letting you know, though, that computation on multiple GPU's is broken for two reasons:

  1. The model.py file does not import
    import torch.nn as nn
    that's an easy fix.

  2. You have some class-attribute dependencies that are single-thread bound.
    Issue for DataParallel pytorch/pytorch#8637

I'm not sure exactly what they are, but here is my error message, which matches the one in the issue I linked to above:

Traceback (most recent call last):
  File "scripts/main.py", line 35, in <module>
    epoch_stats, model, gen = train.train_model(train_data, dev_data, model, gen, args)
  File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/learn/train.py", line 59, in train_model
    args=args)
  File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/learn/train.py", line 198, in run_epoch
    mask, z = gen(x_indx)
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/models/generator.py", line 55, in forward
    activ = self.cnn(x)
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/models/cnn.py", line 55, in forward
    activ = self._conv(x)
  File "/auto/rcf-proj/ef/spangher/newspaper-pages/text_nn/rationale_net/models/cnn.py", line 41, in _conv
    next_activ.append( conv(padded_activ) )
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcf-40/spangher/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 200, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)```

Alex

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions