Skip to content

按照实例ChatGLM3-6B_ds.ipynb任务卡住 #135

@Silver-Glacier

Description

@Silver-Glacier

我按照给出的实例按照实例ChatGLM3-6B_ds.ipynb进行模型训练,从日志上看,到start submit deepspeed task deepspeed_202505131608478195830_nn_0_0_host_10000会一直卡住,并且nvidia-smi中没有看到任何进程

我使用AnsibleFATE_2.1.1_LLM_2.1.0_release_offline部署,在单独一台GPU机器上,guest、host、arbiter均设置为10000

我的完整日志是
1
[INFO][2025-05-13 16:08:50,475][11373][_wraps.preprocess][line:113]: start generating input artifacts
2
[INFO][2025-05-13 16:08:50,475][11373][_wraps.preprocess][line:114]: data=None model=None
3
[INFO][2025-05-13 16:08:50,475][11373][_wraps.preprocess][line:116]: input artifacts are ready
4
[INFO][2025-05-13 16:08:50,475][11373][_wraps.preprocess][line:118]: PYTHON PATH: /data/projects/fate/fate_flow/python:/data/projects/fate/fate/python:/data/projects/fate/fate_flow/python:/data/projects/fate/eggroll/python
5
[INFO][2025-05-13 16:08:50,475][11373][_wraps.preprocess][line:121]: start generating output artifacts
6
7
[INFO][2025-05-13 16:08:50,970][11373][_wraps.preprocess][line:123]: output_artifacts: {'output_data': ArtifactOutputApplySpec(uri='file:///data/projects/fate/fate_flow/jobs/202505131608478195830/host/10000/reader_0/0/output/output_data/data_unresolved', type_name='data_unresolved'), 'metric': ArtifactOutputApplySpec(uri='http://xxx.xxx.xxx.xxx:9380/v2/worker/metric/save/202505131608478195830_reader_0_0_host_10000', type_name='json_metric')}
7
[INFO][2025-05-13 16:08:50,970][11373][_wraps.preprocess][line:123]: output_artifacts: {'output_data': ArtifactOutputApplySpec(uri='file:///data/projects/fate/fate_flow/jobs/202505131608478195830/host/10000/reader_0/0/output/output_data/data_unresolved', type_name='data_unresolved'), 'metric': ArtifactOutputApplySpec(uri='http://xxx.xxx.xxx.xxx:9380/v2/worker/metric/save/202505131608478195830_reader_0_0_host_10000', type_name='json_metric')}
8
[INFO][2025-05-13 16:08:50,970][11373][_wraps.run_component][line:151]: start run task
8
[INFO][2025-05-13 16:08:50,970][11373][_wraps.run_component][line:151]: start run task
9
[INFO][2025-05-13 16:08:52,802][11581][_session.init][line:119]: session init finished: 202505131608478195830_reader_0_0_host_10000, details: <ErSessionMeta(id=202505131608478195830_reader_0_0_host_10000, name=, status=ACTIVE, tag=, processors=[, len=1], options=[{'eggroll.session.processors.per.node': '1', 'nodes': '1', 'cores': '1'}]) at 0x7fc18c9aa370>
9
[INFO][2025-05-13 16:08:52,802][11581][_session.init][line:119]: session init finished: 202505131608478195830_reader_0_0_host_10000, details: <ErSessionMeta(id=202505131608478195830_reader_0_0_host_10000, name=, status=ACTIVE, tag=, processors=[
, len=1], options=[{'eggroll.session.processors.per.node': '1', 'nodes': '1', 'cores': '1'}]) at 0x7fc18c9aa370>
10
[INFO][2025-05-13 16:08:52,805][11581][_rollsite_context.init][line:82]: inited RollSiteContext: {'roll_site_session_id': '202505131608478195830_reader_0_0', 'rp_ctx': <eggroll.computing.roll_pair._roll_pair_context.RollPairContext object at 0x7fc18dde0b80>, '_config': <eggroll.config.config.Config object at 0x7fc18dde04f0>, 'role': 'host', 'party_id': '10000', '_options': {}, '_registered_comm_types': {}, 'proxy_endpoint': <ErEndpoint(host=xxx.xxx.xxx.xxx, port=9370) at 0x7fc18c9aa6d0>, 'pushing_latch': <eggroll.federation._rollsite_context.CountDownLatch object at 0x7fc18c9aa610>, 'push_session_enabled': False, '_wait_push_exit_timeout': 600}
10
[INFO][2025-05-13 16:08:52,805][11581][_rollsite_context.init][line:82]: inited RollSiteContext: {'roll_site_session_id': '202505131608478195830_reader_0_0', 'rp_ctx': <eggroll.computing.roll_pair._roll_pair_context.RollPairContext object at 0x7fc18dde0b80>, '_config': <eggroll.config.config.Config object at 0x7fc18dde04f0>, 'role': 'host', 'party_id': '10000', '_options': {}, '_registered_comm_types': {}, 'proxy_endpoint': <ErEndpoint(host=xxx.xxx.xxx.xxx, port=9370) at 0x7fc18c9aa6d0>, 'pushing_latch': <eggroll.federation._rollsite_context.CountDownLatch object at 0x7fc18c9aa610>, 'push_session_enabled': False, '_wait_push_exit_timeout': 600}
11
[INFO][2025-05-13 16:08:52,841][11581][_profile.profile_ends][line:279]: Total: 0.0219s, Driver: 0.0219s(100.00%), Federation: 0.0000s(0.00%), Computing: 0.0000s(0.00%)
11
[INFO][2025-05-13 16:08:52,841][11581][_profile.profile_ends][line:279]: Total: 0.0219s, Driver: 0.0219s(100.00%), Federation: 0.0000s(0.00%), Computing: 0.0000s(0.00%)
12
[INFO][2025-05-13 16:08:52,841][11581][_profile.profile_ends][line:290]:
12
[INFO][2025-05-13 16:08:52,841][11581][_profile.profile_ends][line:290]:
13
Computing:
13
Computing:
14
+----------+------------------------------------------+
14
+----------+------------------------------------------+
15
| function | |
15
| function | |
16
+----------+------------------------------------------+
16
+----------+------------------------------------------+
17
| total | n=0, sum=0.0000, mean=0.0000, max=0.0000 |
18
+----------+------------------------------------------+
18
+----------+------------------------------------------+
19
21
+--------+------------------------------------------+
40
[INFO][2025-05-13 16:08:59,559][12020][_wraps._intput_data_artifacts][line:453]: get key[train_data] channel[producer_task='reader_0' output_artifact_key='output_data' output_artifact_type_alias=None parties=[PartySpec(role='host', party_id=['10000'])]]
41
[INFO][2025-05-13 16:08:59,559][12020][_wraps._intput_data_artifacts][line:479]: query data: [{'job_id': '202505131608478195830', 'role': 'host', 'party_id': '10000', 'task_name': 'reader_0', 'output_key': 'output_data'}]
42
[INFO][2025-05-13 16:08:59,595][12020][_wraps._intput_data_artifacts][line:491]: intput data artifacts are ready
43
[INFO][2025-05-13 16:08:59,595][12020][_wraps.preprocess][line:116]: input artifacts are ready
44
[INFO][2025-05-13 16:08:59,595][12020][_wraps.preprocess][line:118]: PYTHON PATH: /data/projects/fate/fate_flow/python:/data/projects/fate/fate/python:/data/projects/fate/fate_flow/python:/data/projects/fate/eggroll/python
45
[INFO][2025-05-13 16:08:59,596][12020][_wraps.preprocess][line:121]: start generating output artifacts
50
[INFO][2025-05-13 16:09:01,255][12020][_eggroll_deepspeed.start_submit][line:116]: command_arguments: ['component', 'execute', '--env-name', 'FATE_TASK_CONFIG', '--execution-final-meta-path', 'EGGROLL_DEEPSPEED_RESULT_DIR/task_result.yaml']
51
[INFO][2025-05-13 16:09:01,255][12020][_eggroll_deepspeed.start_submit][line:117]: environment_variables: {'FATE_TASK_CONFIG': '{"job_id": "202505131608478195830", "task_id": "202505131608478195830_nn_0", "party_task_id": "202505131608478195830_nn_0_0_host_10000", "task_name": "nn_0", "component": "homo_nn", "role": "host", "party_id": "10000", "stage": "train", "parameters": {"runner_class": "Seq2SeqRunner", "runner_conf": {"algo": "fedavg", "data_collator_conf": {"item_name": "get_seq2seq_data_collator", "kwargs": {"tokenizer_name_or_path": "/root/chatglm3-6b/", "trust_remote_code": true}, "module_name": "fate_llm.data.data_collator.cust_data_collator", "source": null}, "dataset_conf": {"item_name": "PromptDataset", "kwargs": {"tokenizer_name_or_path": "/root/chatglm3-6b/", "trust_remote_code": true}, "module_name": "fate_llm.dataset.prompt_dataset", "source": null}, "fed_args_conf": {"aggregate_freq": 1, "aggregate_strategy": "epoch", "aggregator": "secure_aggregate"}, "model_conf": {"item_name": "ChatGLM", "kwargs": {"peft_config": {"alpha_pattern": {}, "auto_mapping": null, "base_model_name_or_path": null, "bias": "none", "fan_in_fan_out": false, "inference_mode": false, "init_lora_weights": true, "layers_pattern": null, "layers_to_transform": null, "loftq_config": {}, "lora_alpha": 32, "lora_dropout": 0.1, "megatron_config": null, "megatron_core": "megatron.core", "modules_to_save": null, "peft_type": "LORA", "r": 8, "rank_pattern": {}, "revision": null, "target_modules": ["query_key_value"], "task_type": "CAUSAL_LM", "use_rslora": false}, "peft_type": "LoraConfig", "pretrained_path": "/root/chatglm3-6b/", "trust_remote_code": true}, "module_name": "fate_llm.model_zoo.pellm.chatglm", "source": null}, "optimizer_conf": null, "save_trainable_weights_only": true, "task_type": "causal_lm", "tokenizer_conf": null, "training_args_conf": {"dataloader_pin_memory": true, "deepspeed": {"fp16": {"enabled": true}, "gradient_accumulation_steps": 1, "optimizer": {"params": {"adam_w_mode": false, "lr": 0.0005, "torch_adam": true}, "type": "Adam"}, "train_micro_batch_size_per_gpu": 1, "zero_optimization": {"allgather_bucket_size": 1xxx.xxx.xxx.xxx0.0, "allgather_partitions": true, "contiguous_gradients": true, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"}, "overlap_comm": true, "reduce_bucket_size": 1xxx.xxx.xxx.xxx0.0, "reduce_scatter": true, "stage": 2}}, "fp16": true, "learning_rate": 0.0005, "num_train_epochs": 1, "per_device_train_batch_size": 1, "remove_unused_columns": false, "use_cpu": false}}, "runner_module": "homo_seq2seq_runner"}, "input_artifacts": {"train_data": {"uri": "file:///root/train.json", "metadata": {"metadata": {"options": {"partitions": 8}, "schema": {}}, "name": null, "namespace": null, "model_overview": {}, "data_overview": null, "source": null, "model_key": null, "type_name": null, "index": null}, "type_name": "data_directory"}}, "output_artifacts": {"train_output_data": {"uri": "eggroll:///202505131608478195830_nn_0/864728262fd111f0804cfa163e74ca79", "type_name": "dataframe"}, "output_model": {"uri": "file://EGGROLL_DEEPSPEED_MODEL_DIR/202505131608478195830/host/10000/nn_0/0/output/output_model/model_directory", "type_name": "model_directory"}, "metric": {"uri": "http://xxx.xxx.xxx.xxx:9380/v2/worker/metric/save/202505131608478195830_nn_0_0_host_10000", "type_name": "json_metric"}}, "conf": {"device": {"type": "CPU", "metadata": {}}, "computing": {"type": "eggroll", "metadata": {"computing_id": "202505131608478195830_nn_0_0_host_10000", "host": "xxx.xxx.xxx.xxx", "port": 4670, "config_options": null, "config_properties_file": null, "options": {}}}, "storage": "eggroll", "federation": {"type": "rollsite", "metadata": {"federation_id": "202505131608478195830_nn_0_0", "parties": {"local": {"role": "host", "partyid": "10000"}, "parties": [{"role": "guest", "partyid": "10000"}, {"role": "host", "partyid": "10000"}, {"role": "arbiter", "partyid": "10000"}]}, "rollsite_config": {"host": "xxx.xxx.xxx.xxx", "port": 9370}}}, "logger": {"config": {"disable_existing_loggers": false, "filters": {"component_profile_filter": {"()": "logging.Filter", "name": "fate.arch.computing._profile"}}, "formatters": {"component": {"format": "[%(levelname)s][%(asctime)-8s][%(process)s][%(module)s.%(funcName)s][line:%(lineno)d]: %(message)s"}, "root": {"format": "[%(levelname)s][%(asctime)-8s][%(process)s][%(module)s.%(funcName)s][line:%(lineno)d]: %(message)s"}}, "handlers": {"component_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/component/ERROR", "filters": [], "formatter": "component", "level": "ERROR"}, "component_info": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/component/INFO", "filters": [], "formatter": "component", "level": "INFO"}, "component_profile": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/component/PROFILE", "filters": ["component_profile_filter"], "formatter": "component", "level": "DEBUG"}, "component_warning": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/component/WARNING", "filters": [], "formatter": "component", "level": "WARNING"}, "global_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/ERROR", "filters": [], "formatter": "root", "level": "ERROR"}, "root_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/root/ERROR", "filters": [], "formatter": "root", "level": "ERROR"}, "root_info": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/root/INFO", "filters": [], "formatter": "root", "level": "INFO"}, "root_party_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/ERROR", "filters": [], "formatter": "root", "level": "ERROR"}, "root_party_info": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/INFO", "filters": [], "formatter": "root", "level": "INFO"}, "root_party_warning": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/WARNING", "filters": [], "formatter": "root", "level": "WARNING"}, "root_warning": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505131608478195830/host/10000/nn_0/root/WARNING", "filters": [], "formatter": "root", "level": "WARNING"}}, "loggers": {"fate": {"handlers": ["component_info", "component_warning", "component_error", "component_profile"], "level": "INFO"}}, "root": {"handlers": ["root_info", "root_warning", "root_error", "root_party_info", "root_party_warning", "root_party_error", "global_error"], "level": "INFO"}, "version": 1}}}}', 'DEEPSPEED_LOGS_DIR_PLACEHOLDER': 'EGGROLL_DEEPSPEED_LOGS_DIR', 'DEEPSPEED_MODEL_DIR_PLACEHOLDER': 'EGGROLL_DEEPSPEED_MODEL_DIR', 'DEEPSPEED_RESULT_PLACEHOLDER': 'EGGROLL_DEEPSPEED_RESULT_DIR'}
52
[INFO][2025-05-13 16:09:01,256][12020][_eggroll_deepspeed.start_submit][line:118]: resource_options: {'timeout_seconds': 21600, 'resource_exhausted_strategy': 'waiting', 'cores': 1, 'nodes': 1, 'task_cores_per_node': 1}
53
[INFO][2025-05-13 16:09:01,256][12020][_eggroll_deepspeed.start_submit][line:119]: options: {'eggroll.container.deepspeed.script.path': '/data/projects/fate/fate_flow/python/fate_flow/manager/worker/fate_ds_executor.py'}
54
[INFO][2025-05-13 16:09:01,256][12020][_eggroll_deepspeed.start_submit][line:120]: start submit deepspeed task deepspeed_202505131608478195830_nn_0_0_host_10000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions