Skip to content

the result about qwen3-8b, ARC-C #27

@coding-rw

Description

@coding-rw

Hello,

I conducted an evaluation of Qwen3-8B on the ARC-C dataset using two different methods. The results show a significant drop in accuracy with the latent_mas method compared to the baseline. Below are the commands used and the corresponding results:

Baseline Method

python run.py --method baseline --model_name Qwen/Qwen3-8B --task arc_challenge --max_samples -1 --use_vllm --max_new_tokens 2048 --gpu_memory_utilization 0.9

Result:

{
  "method": "baseline",
  "model": "Qwen3-8B",
  "split": "test",
  "seed": 42,
  "max_samples": 1172,
  "accuracy": 0.9198,
  "correct": 1078,
  "total_time_sec": 1971.0141,
  "time_per_sample_sec": 1.6818
}

Latent MAS Method

CUDA_VISIBLE_DEVICES=0,1 python run.py --method latent_mas --model_name Qwen/Qwen3-8B --task arc_challenge --prompt sequential --max_samples -1 --max_new_tokens 2048 \
  --use_vllm \
  --use_second_HF_model \
  --enable_prefix_caching \
  --device2 cuda:1

Result:

{
  "method": "latent_mas",
  "model": "Qwen3-8B",
  "split": "test",
  "seed": 42,
  "max_samples": 1172,
  "accuracy": 0.7108,
  "correct": 833,
  "total_time_sec": 1754.6171,
  "time_per_sample_sec": 1.4971
}

Issue

The baseline achieves an accuracy of ~92.0%, while the latent_mas method drops to ~71.1%. Could you please help me understand if I made any mistakes in the setup or if this is an expected behavior? Any insights would be greatly appreciated.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions