-
Notifications
You must be signed in to change notification settings - Fork 83
Description
I installed the environment environment.yml and wanted to generate evaluation results. On S-Terminal, I am running bash /app/distserve/distserve/evaluation/ae-scripts/kick-the-tires/distllm-server.sh and C-Terminal bash /app/distserve/distserve/evaluation/ae-scripts/kick-the-tires/distllm-client.sh. I tried with OPT-125M, OPT-1.3B, OPT-6.7B. However, I always get the following error in S-Terminal
INFO 22:10:15 (context) 0 waiting, 7 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 22:10:15 (decoding) CPU blocks: 0 / 2048 (0.00%) used, (0 swapping in)
INFO 22:10:15 (decoding) GPU blocks: 270 / 3755 (7.19%) used, (0 swapping out)
INFO 22:10:15 (decoding) 7 unaccepted, 0 waiting, 3 processing
(ParaWorker pid=14304) [ERROR] CUDA error /DistServe/SwiftTransformer/src/csrc/model/gpt/gpt.cc:294 'cudaMemcpy(ith_context_req_token_index.ptr, ith_context_req_token_index_cpu, sizeof(int32_t) * (batch_size+1), cudaMemcpyHostToDevice)': (719) unspecified launch failure
(ParaWorker pid=14304) INFO 21:47:14 (worker decoding.#0) model facebook/opt-6.7b loaded
(ParaWorker pid=14304) INFO 21:47:14 runtime peak memory: 12.844 GB
(ParaWorker pid=14304) INFO 21:47:14 total GPU memory: 44.403 GB
(ParaWorker pid=14304) INFO 21:47:14 kv cache size for one token: 0.50000 MB
(ParaWorker pid=14304) INFO 21:47:14 num_gpu_blocks: 3755
(ParaWorker pid=14304) INFO 21:47:14 num_cpu_blocks: 2048
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffe65f627fffa81faeb8ef03d901000000 Worker ID: 8bb0dd0ebc21a6586b9a66fd01f48a05fd4f19680185a80d75c0ed4a Node ID: 5c51d1203d5390054ce3a6973294116a342892b6cf2bb763c8dfd950 Worker IP address: 192.168.0.44 Worker port: 46581 Worker PID: 14304 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
File "/DistServe/distserve/api_server/distserve_api_server.py", line 174, in start_event_loop_wrapper
await task
File "/DistServe/distserve/llm.py", line 167, in start_event_loop
await self.engine.start_all_event_loops()
File "/DistServe/distserve/engine.py", line 253, in start_all_event_loops
await asyncio.gather(
File "/DistServe/distserve/single_stage_engine.py", line 672, in start_event_loop
await asyncio.gather(event_loop1(), event_loop2(), event_loop3())
File "/DistServe/distserve/single_stage_engine.py", line 663, in event_loop2
await self._step()
File "/DistServe/distserve/single_stage_engine.py", line 609, in _step
generated_tokens_ids = await self.batches_ret_futures[0]
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ParaWorker
actor_id: e65f627fffa81faeb8ef03d901000000
pid: 14304
namespace: d56e3e54-6ed7-4dc2-9f7f-10a347d1960d
ip: 192.168.0.44
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
^CException ignored in atexit callback: <function _exit_function at 0x7a4ab42ea050>
Traceback (most recent call last):
File "/miniconda3/envs/distserve/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/miniconda3/envs/distserve/lib/python3.10/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/miniconda3/envs/distserve/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/miniconda3/envs/distserve/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
I also tried to run in another machine, in that machine I could not run DistServe, I am attaching the error message
Can someone help me, how to run OPT models with DistServe? I tried with ShareGPT dataset.
