-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
Hi,
I encountered an issue when running vllm_inference.py after modifying the model path to a local directory.
Here is the error message I received:
AttributeError: CachedPreTrainedTokenizerFast has no attribute build_chat_input
I suspect that this might be related to compatibility issues between transformers and the tokenizer being used. Could you please confirm if this is expected behavior? Is there a recommended approach to resolve this issue?
Thanks for your help!
Best,
Yiju Guo
INFO 03-08 22:45:02 selector.py:135] Using Flash Attention backend.
INFO 03-08 22:45:05 model_runner.py:1072] Starting to load model /home/test/testdata/models/meta-llama/Llama-3.2-1B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.84s/it]
INFO 03-08 22:45:10 model_runner.py:1077] Loading model weights took 2.3185 GB
INFO 03-08 22:45:15 worker.py:232] Memory profiling results: total_gpu_memory=79.33GiB initial_memory_usage=2.82GiB peak_torch_memory=3.52GiB memory_usage_post_profile=2.84GiB non_torch_memory=0.51GiB kv_cache_size=75.29GiB gpu_memory_utilization=1.00
INFO 03-08 22:45:15 gpu_executor.py:113] # GPU blocks: 154199, # CPU blocks: 8192
INFO 03-08 22:45:15 gpu_executor.py:117] Maximum concurrency for 8096 tokens per request: 304.74x
INFO 03-08 22:45:17 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-08 22:45:17 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 03-08 22:45:27 model_runner.py:1518] Graph capturing finished in 10 secs, took 0.47 GiB
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/test/test05/gyj/LongCite/vllm_inference.py", line 200, in <module>
[rank0]: result = model.query_longcite(context, query, tokenizer=tokenizer, max_input_length=128000, max_new_tokens=1024)
[rank0]: File "/home/test/test05/gyj/LongCite/vllm_inference.py", line 178, in query_longcite
[rank0]: output, _ = self.chat(tokenizer, prompt, history=[], max_new_tokens=max_new_tokens, temperature=temperature)
[rank0]: File "/home/test/test05/anaconda3/envs/lzy_vllm_0.5.4/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/test/test05/gyj/LongCite/vllm_inference.py", line 14, in chat
[rank0]: inputs = tokenizer.build_chat_input(query, history=history, role=role)
[rank0]: File "/home/test/test05/anaconda3/envs/lzy_vllm_0.5.4/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1104, in __getattr__
[rank0]: raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
[rank0]: AttributeError: CachedPreTrainedTokenizerFast has no attribute build_chat_input
[rank0]:[W308 22:45:31.977270329 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Metadata
Metadata
Assignees
Labels
No labels