Open
Conversation
Collaborator
Author
|
かなりのバージョンアップなので、念のためいくつかのモデル・ベンチマークで顕著なスコア差が出ないか確認。 このスクリプトを現行(vllm==0.10.2)環境と vllm==0.16.0 環境で実行 実行スクリプト: ${SAVE_BASE_DIR:?"Need to set SAVE_BASE_DIR environment variable"}
eval_setups=(
examples/sarashina_2_2_evaluation/configs/pretrained_evals/gsm8k.jsonnet
examples/sarashina_2_2_evaluation/configs/pretrained_evals/niilcqa.jsonnet
)
models=(
examples/sarashina_2_2_evaluation/configs/pretrained_models/Llama-3.2-3B.jsonnet
examples/sarashina_2_2_evaluation/configs/pretrained_models/Qwen2.5-3B.jsonnet
examples/sarashina_2_2_evaluation/configs/pretrained_models/sarashina2-7b.jsonnet
examples/sarashina_2_2_evaluation/configs/pretrained_models/sarashina2-70b.jsonnet
examples/sarashina_2_2_evaluation/configs/pretrained_models/sarashina2.2-3b.jsonnet
)
for eval_setup in ${eval_setups[@]}; do
for language_model in ${models[@]}; do
model_name=$(basename ${language_model%.*})
eval_setup_name=$(basename ${eval_setup%.*})
echo "Evaluating ${model_name} on ${eval_setup_name}"
save_dir=${SAVE_BASE_DIR}/pretrained/${model_name}/${eval_setup_name}
if [ -f ${save_dir}/metrics.json ]; then
echo "results already exist. Skipping evaluation to avoid overwriting results."
continue
fi
flexeval_lm --eval_setup ${eval_setup} --language_model ${language_model} --save_dir ${save_dir} --eval_setup.batch_size 10000 --force true
done
done
eval_setups=(
examples/sarashina_2_2_evaluation/configs/instruction_evals/elyza_tasks_100.jsonnet
)
models=(
examples/sarashina_2_2_evaluation/configs/instruction_models/Llama-3.1-Swallow-8B-Instruct-v0.3.jsonnet
examples/sarashina_2_2_evaluation/configs/instruction_models/llm-jp-3-7.2b-instruct3.jsonnet
examples/sarashina_2_2_evaluation/configs/instruction_models/Qwen2.5-1.5B-Instruct.jsonnet
examples/sarashina_2_2_evaluation/configs/instruction_models/Qwen2.5-7B-Instruct.jsonnet
examples/sarashina_2_2_evaluation/configs/instruction_models/sarashina2.2-3b-instruct-v0.1.jsonnet
)
for eval_setup in ${eval_setups[@]}; do
for language_model in ${models[@]}; do
model_name=$(basename ${language_model%.*})
eval_setup_name=$(basename ${eval_setup%.*})
echo "Evaluating ${model_name} on ${eval_setup_name}"
save_dir=${SAVE_BASE_DIR}/instruction/${model_name}/${eval_setup_name}
if [ -f ${save_dir}/metrics.json ]; then
echo "results already exist. Skipping evaluation to avoid overwriting results."
continue
fi
flexeval_lm --eval_setup ${eval_setup} --language_model ${language_model} --save_dir ${save_dir} --eval_setup.batch_size 10000 --cleanup_after_generation true --force true
done
doneniilcqa (pretrained)
gsm8k (pretrained)
elyza-tasks-100 (chat)
|
junya-takayama
commented
Feb 27, 2026
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def chat_lm_with_system_message() -> VLLM: |
Collaborator
Author
There was a problem hiding this comment.
I don't know why this function was originally placed here, but since it was causing errors, I moved it to test_vllm_specific.py .
junya-takayama
commented
Feb 27, 2026
| vllm_log_probs = chat_lm.compute_log_probs(text_list) | ||
| hf_log_probs = hf_lm.compute_log_probs(text_list) | ||
| assert vllm_log_probs == pytest.approx(hf_log_probs, abs=1e-2) | ||
| assert vllm_log_probs == pytest.approx(hf_log_probs, abs=0.5) |
Collaborator
Author
There was a problem hiding this comment.
It seems that differences of this magnitude can occur depending on the seed (or the environment), so I widen the acceptable tolerance range.
The values are roughly around -33.2 (Qwen) and -47.2 (Sarashina), so I think allowing an error margin of about ±0.5 is reasonable in practice.
junya-takayama
commented
Feb 27, 2026
| from vllm import RequestOutput, SamplingParams | ||
| from vllm.inputs import TokensPrompt | ||
| from vllm.sequence import Logprob | ||
| from vllm.logprobs import Logprob |
Collaborator
Author
There was a problem hiding this comment.
junya-takayama
commented
Feb 27, 2026
| self.llm = LLM(self.model_name, **self.model_kwargs) | ||
| if self.model_limit_tokens == "default": | ||
| self.model_limit_tokens = self.llm.llm_engine.get_model_config().max_model_len | ||
| self.model_limit_tokens = self.llm.llm_engine.model_config.max_model_len |
Collaborator
Author
There was a problem hiding this comment.
junya-takayama
commented
Mar 1, 2026
| ] | ||
|
|
||
| max_length = self.llm.llm_engine.get_model_config().max_seq_len_to_capture | ||
| max_length = self.llm.llm_engine.model_config.max_model_len |
Collaborator
Author
There was a problem hiding this comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update the dependency on vLLM to v0.16.0, including support for deprecations and removals.