-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Describe the bug
UMA server/client with CUDA do not work but works in cpu mode
To Reproduce
Steps to reproduce the behavior:
Launch with Ext_Params "-d cuda"
Calling /home/kwy/orca-external-tools/bin/oet_client TS2_alt_S_search_EXT.extinp.tmp -d cuda ...
Model parameters found in cache. Switching to offline mode.
Server error CalculatorRuntimeException: Model parameters found in cache. Switching to offline mode.
.
Exact traceback: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/kwy/orca-external-tools/src/oet/core/base_calc.py", line 269, in run
energy, gradient = self.calc(
^^^^^^^^^^
File "/home/kwy/orca-external-tools/src/oet/calculator/uma.py", line 322, in calc
self.set_calculator(param=param, basemodel=basemodel, device=device, cache_dir=cache_dir)
File "/home/kwy/orca-external-tools/src/oet/calculator/uma.py", line 94, in set_calculator
predictor = pretrained_mlip.get_predict_unit(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/calculate/pretrained_mlip.py", line 107, in get_predict_unit
return load_predict_unit(
^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/units/mlip_unit/init.py", line 71, in load_predict_unit
return MLIPPredictUnit(
^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/units/mlip_unit/predict.py", line 173, in init
self.device = get_device_for_local_rank() if device == "cuda" else "cpu"
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/fairchem/core/common/distutils.py", line 261, in get_device_for_local_rank
return f"cuda:{torch.cuda.current_device()}"
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/torch/cuda/init.py", line 1071, in current_device
_lazy_init()
File "/home/kwy/orca-external-tools/.venv/lib/python3.11/site-packages/torch/cuda/init.py", line 398, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 187, in _run_calc_in_process
calc.run(**run_kwargs)
File "/home/kwy/orca-external-tools/src/oet/core/base_calc.py", line 277, in run
raise RuntimeError("Failed to compute energy and/or gradient") from e
RuntimeError: Failed to compute energy and/or gradient
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kwy/anaconda3/envs/oet200/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 190, in _run_calc_in_process
raise CalculatorRuntimeException(buf.getvalue()) from e
oet.server_client.server.CalculatorRuntimeException: Model parameters found in cache. Switching to offline mode.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 461, in calculate
result = server.handle_client({"arguments": arguments, "directory": directory})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kwy/orca-external-tools/src/oet/server_client/server.py", line 349, in handle_client
output = fut.result()
^^^^^^^^^^^^
File "/home/kwy/anaconda3/envs/oet200/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/kwy/anaconda3/envs/oet200/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
oet.server_client.server.CalculatorRuntimeException: Model parameters found in cache. Switching to offline mode.
Error (ORCA/EXT): external program failed with exit code 256
Input/Output Files
Please provide all relevant input and output files.
Expected behavior
Should work with cuda
Version:
- OS: [e.g. Ubuntu 20.04]
- ORCA [e.g. 6.1.1]
Additional context
Add any other context about the problem here.