-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
`bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64`
> NCCL_DEBUG=VERSION NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 CUDA_LAUNCH_BLOCKING=0 torchrun --nproc_per_node 1 --master_port=19865 /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py --bf16 --skip-init --mode finetune --rotary-embedding-2d --seed 12345 --sampling-strategy BaseStrategy --max-gen-length 128 --min-gen-length 0 --num-beams 4 --length-penalty 1.0 --no-repeat-ngram-size 0 --multiline_stream --temperature 0.8 --top_k 0 --top_p 0.9 --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64
> [2025-01-21 11:21:52,777] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
> [2025-01-21 11:21:54,854] [WARNING] No training data specified
> [2025-01-21 11:21:54,855] [WARNING] No train_iters (recommended) or epochs specified, use default 10k iters.
> [2025-01-21 11:21:54,855] [INFO] using world size: 1 and model-parallel size: 1
> [2025-01-21 11:21:54,855] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
> [2025-01-21 11:21:54,855] [INFO] [RANK 0] > initializing model parallel with size 1
> [2025-01-21 11:21:54,856] [INFO] [comm.py:652:init_distributed] cdb=None
> [2025-01-21 11:21:54,856] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1004:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1125:configure] Activation Checkpointing Information
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1126:configure] ----Partition Activations False, CPU CHECKPOINTING False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1127:configure] ----contiguous Memory Checkpointing False with 6 total layers
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1128:configure] ----Synchronization False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1129:configure] ----Profiling time in checkpointing False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345
> [2025-01-21 11:21:54,857] [INFO] [RANK 0] building MSAGPT model ...
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345
> [2025-01-21 11:21:55,104] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 2860508544
> [2025-01-21 11:21:56,144] [INFO] [RANK 0] CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
> [2025-01-21 11:21:56,144] [INFO] [RANK 0] global rank 0 is loading checkpoint ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt
> [2025-01-21 11:21:58,535] [INFO] [RANK 0] > successfully loaded ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt
> Traceback (most recent call last):
> File "/home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py", line 44, in <module>
> model = model.to('cuda')
> ^^^^^^^^^^^^^^^^
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
> return self._apply(convert)
> ^^^^^^^^^^^^^^^^^^^^
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
> module._apply(fn)
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
> module._apply(fn)
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
> module._apply(fn)
> [Previous line repeated 2 more times]
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply
> param_applied = fn(param)
> ^^^^^^^^^
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
> return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6700) of binary: /home/mca/anaconda3/bin/python
> Traceback (most recent call last):
> File "/home/mca/anaconda3/bin/torchrun", line 8, in <module>
> sys.exit(main())
> ^^^^^^
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
> return f(*args, **kwargs)
> ^^^^^^^^^^^^^^^^^^
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
> run(args)
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
> elastic_launch(
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
> return launch_agent(self._config, self._entrypoint, list(args))
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
> raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
> ============================================================
> /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py FAILED
> ------------------------------------------------------------
> Failures:
> <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
> time : 2025-01-21_11:22:01
> host : mca-lab6
> rank : 0 (local_rank: 0)
> exitcode : 1 (pid: 6700)
> error_file: <N/A>
> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> ============================================================
nvidia-smi
> Tue Jan 21 11:24:48 2025
> +---------------------------------------------------------------------------------------+
> | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
> |-----------------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
> | | | MIG M. |
> |=========================================+======================+======================|
> | 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:01:00.0 On | N/A |
> | 0% 48C P8 12W / 120W | 219MiB / 6144MiB | 3% Default |
> | | | N/A |
> +-----------------------------------------+----------------------+----------------------+
>
> +---------------------------------------------------------------------------------------+
> | Processes: |
> | GPU GI CI PID Type Process name GPU Memory |
> | ID ID Usage |
> |=======================================================================================|
> | 0 N/A N/A 2307 G /usr/lib/xorg/Xorg 92MiB |
> | 0 N/A N/A 2435 G ...libexec/gnome-remote-desktop-daemon 1MiB |
> | 0 N/A N/A 2473 G /usr/bin/gnome-shell 70MiB |
> | 0 N/A N/A 4776 G ...seed-version=20250119-180455.285000 51MiB |
> +---------------------------------------------------------------------------------------+
This is my INPUT file
7pno_D:GSGSGSGSGTNSLLNLRSRLAAKAAKEAASSNSENLYFQ---SGGTRLTNSLLNLRSRLAAKAAKEAASSNAT------STSGGTRLTNSLLNLRSRLAAKAIKEST----------
Metadata
Metadata
Assignees
Labels
No labels