Skip to content

ERROR when training on subset_B #15

@LostDreammm

Description

@LostDreammm

great for your work!
but when i use the command ./tools/dist_train.sh 4 test1 --autoscale-lr, I met the error below. Datasets has been downloaded and preprocessed completely, how do i to solve this problem?

(topologic) mario@mario-Legion-R9000P-ARX8:~/TopoLogic$ ./tools/dist_train.sh 4 test1 --autoscale-lr
++ date +%y%m%d.%H%M%S

  • timestamp=250620.161354
  • WORK_DIR=work_dirs/test1
  • CONFIG=projects/configs/topologic_r50_8x1_24e_olv2_subset_B.py
  • GPUS=4
  • PORT=28510
  • python -m torch.distributed.run --nproc_per_node=4 --master_port=28510 tools/train.py projects/configs/topologic_r50_8x1_24e_olv2_subset_B.py --launcher pytorch --work-dir work_dirs/test1 --deterministic --autoscale-lr
  • tee work_dirs/test1/train.250620.161354.log
    WARNING:main:*****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 50837) of binary: /home/mario/miniconda3/envs/topologic/bin/python
/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 50837 (local_rank 0) FAILED (exitcode -11)
Error msg: Signal 11 (SIGSEGV) received by PID 50837
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/mario/miniconda3/envs/topologic/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          tools/train.py FAILED               

==================================================
Root Cause:
[0]:
time: 2025-06-20_16:14:00
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 50837)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50837"

Other Failures:
[1]:
time: 2025-06-20_16:14:00
rank: 1 (local_rank: 1)
exitcode: -11 (pid: 50838)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50838"
[2]:
time: 2025-06-20_16:14:00
rank: 2 (local_rank: 2)
exitcode: -11 (pid: 50839)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50839"
[3]:
time: 2025-06-20_16:14:00
rank: 3 (local_rank: 3)
exitcode: -11 (pid: 50841)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 50841"


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions