Skip to content

UMA Crash with -d cuda #28

@kevinkong98

Description

@kevinkong98

Describe the bug
UMA server/client with CUDA do not work but works in cpu mode

To Reproduce
Steps to reproduce the behavior:
Launch with Ext_Params "-d cuda"

Calling /home/kwy/orca-external-tools/bin/oet_client TS2_alt_S_search_EXT.extinp.tmp -d cuda ...

Model parameters found in cache. Switching to offline mode.
Server error CalculatorRuntimeException: Model parameters found in cache. Switching to offline mode.
.
Exact traceback: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/kwy/orca-external-tools/src/oet/core/base_calc.py", line 269, in run
energy, gradient = self.calc(
^^^^^^^^^^
File "/home/kwy/orca-external-tools/src/oet/calculator/uma.py", line 322, in calc
self.set_calculator(param=param, basemodel=basemodel, device=device, cache_dir=cache_dir)
File "/home/kwy/orca-external-tools/src/oet/calculator/uma.py", line 94, in set_calculator
predictor = pretrained_mlip.get_predict_unit(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/calculate/pretrained_mlip.py", line 107, in get_predict_unit
return load_predict_unit(
^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/units/mlip_unit/init.py", line 71, in load_predict_unit
return MLIPPredictUnit(
^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/units/mlip_unit/predict.py", line 173, in init
self.device = get_device_for_local_rank() if device == "cuda" else "cpu"
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/common/distutils.py", line 261, in get_device_for_local_rank
return f"cuda:{torch.cuda.current_device()}"
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/torch/cuda/init.py", line 1071, in current_device
_lazy_init()
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/torch/cuda/init.py", line 398, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 187, in _run_calc_in_process
calc.run(**run_kwargs)
File "/home/kwy/orca-external-tools/src/oet/core/base_calc.py", line 277, in run
raise RuntimeError("Failed to compute energy and/or gradient") from e
RuntimeError: Failed to compute energy and/or gradient

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kwy/anaconda3/envs/oet200/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 190, in _run_calc_in_process
raise CalculatorRuntimeException(buf.getvalue()) from e
oet.server_client.server.CalculatorRuntimeException: Model parameters found in cache. Switching to offline mode.

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 461, in calculate
result = server.handle_client({"arguments": arguments, "directory": directory})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 349, in handle_client
output = fut.result()
^^^^^^^^^^^^
File "/home/kwy/anaconda3/envs/oet200/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/kwy/anaconda3/envs/oet200/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
oet.server_client.server.CalculatorRuntimeException: Model parameters found in cache. Switching to offline mode.

Error (ORCA/EXT): external program failed with exit code 256

Input/Output Files
Please provide all relevant input and output files.

H2.out.txt
H2.inp.txt

Expected behavior
Should work with cuda

Version:

  • OS: [e.g. Ubuntu 20.04]
  • ORCA [e.g. 6.1.1]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions