-
Notifications
You must be signed in to change notification settings - Fork 20
Description
great for your work!
but when i use the command ./tools/dist_train.sh 4 test1 --autoscale-lr, I met the error below. Datasets has been downloaded and preprocessed completely, how do i to solve this problem?
(topologic) mario@mario-Legion-R9000P-ARX8:~/TopoLogic$ ./tools/dist_train.sh 4 test1 --autoscale-lr
++ date +%y%m%d.%H%M%S
- timestamp=250620.161354
- WORK_DIR=work_dirs/test1
- CONFIG=projects/configs/topologic_r50_8x1_24e_olv2_subset_B.py
- GPUS=4
- PORT=28510
- python -m torch.distributed.run --nproc_per_node=4 --master_port=28510 tools/train.py projects/configs/topologic_r50_8x1_24e_olv2_subset_B.py --launcher pytorch --work-dir work_dirs/test1 --deterministic --autoscale-lr
- tee work_dirs/test1/train.250620.161354.log
WARNING:main:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 50837) of binary: /home/mario/miniconda3/envs/topologic/bin/python
/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 50837 (local_rank 0) FAILED (exitcode -11)
Error msg: Signal 11 (SIGSEGV) received by PID 50837
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
==================================================
Root Cause:
[0]:
time: 2025-06-20_16:14:00
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 50837)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50837"
Other Failures:
[1]:
time: 2025-06-20_16:14:00
rank: 1 (local_rank: 1)
exitcode: -11 (pid: 50838)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50838"
[2]:
time: 2025-06-20_16:14:00
rank: 2 (local_rank: 2)
exitcode: -11 (pid: 50839)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50839"
[3]:
time: 2025-06-20_16:14:00
rank: 3 (local_rank: 3)
exitcode: -11 (pid: 50841)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50841"