-
Notifications
You must be signed in to change notification settings - Fork 103
Open
Description
Hello,
I conducted an evaluation of Qwen3-8B on the ARC-C dataset using two different methods. The results show a significant drop in accuracy with the latent_mas method compared to the baseline. Below are the commands used and the corresponding results:
Baseline Method
python run.py --method baseline --model_name Qwen/Qwen3-8B --task arc_challenge --max_samples -1 --use_vllm --max_new_tokens 2048 --gpu_memory_utilization 0.9Result:
{
"method": "baseline",
"model": "Qwen3-8B",
"split": "test",
"seed": 42,
"max_samples": 1172,
"accuracy": 0.9198,
"correct": 1078,
"total_time_sec": 1971.0141,
"time_per_sample_sec": 1.6818
}
Latent MAS Method
CUDA_VISIBLE_DEVICES=0,1 python run.py --method latent_mas --model_name Qwen/Qwen3-8B --task arc_challenge --prompt sequential --max_samples -1 --max_new_tokens 2048 \
--use_vllm \
--use_second_HF_model \
--enable_prefix_caching \
--device2 cuda:1Result:
{
"method": "latent_mas",
"model": "Qwen3-8B",
"split": "test",
"seed": 42,
"max_samples": 1172,
"accuracy": 0.7108,
"correct": 833,
"total_time_sec": 1754.6171,
"time_per_sample_sec": 1.4971
}
Issue
The baseline achieves an accuracy of ~92.0%, while the latent_mas method drops to ~71.1%. Could you please help me understand if I made any mistakes in the setup or if this is an expected behavior? Any insights would be greatly appreciated.
Thank you!
Metadata
Metadata
Assignees
Labels
No labels