the result about qwen3-8b, ARC-C


Hello,  

I conducted an evaluation of Qwen3-8B on the ARC-C dataset using two different methods. The results show a significant drop in accuracy with the `latent_mas` method compared to the baseline. Below are the commands used and the corresponding results:

### **Baseline Method**
```bash
python run.py --method baseline --model_name Qwen/Qwen3-8B --task arc_challenge --max_samples -1 --use_vllm --max_new_tokens 2048 --gpu_memory_utilization 0.9
```
**Result:**  
```
{
  "method": "baseline",
  "model": "Qwen3-8B",
  "split": "test",
  "seed": 42,
  "max_samples": 1172,
  "accuracy": 0.9198,
  "correct": 1078,
  "total_time_sec": 1971.0141,
  "time_per_sample_sec": 1.6818
}
```

### **Latent MAS Method**
```bash
CUDA_VISIBLE_DEVICES=0,1 python run.py --method latent_mas --model_name Qwen/Qwen3-8B --task arc_challenge --prompt sequential --max_samples -1 --max_new_tokens 2048 \
  --use_vllm \
  --use_second_HF_model \
  --enable_prefix_caching \
  --device2 cuda:1
```
**Result:**  
```
{
  "method": "latent_mas",
  "model": "Qwen3-8B",
  "split": "test",
  "seed": 42,
  "max_samples": 1172,
  "accuracy": 0.7108,
  "correct": 833,
  "total_time_sec": 1754.6171,
  "time_per_sample_sec": 1.4971
}
```

### **Issue**
The baseline achieves an accuracy of **~92.0%**, while the `latent_mas` method drops to **~71.1%**. Could you please help me understand if I made any mistakes in the setup or if this is an expected behavior? Any insights would be greatly appreciated.

Thank you!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

the result about qwen3-8b, ARC-C #27

Baseline Method

Latent MAS Method

Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

the result about qwen3-8b, ARC-C #27

Description

Baseline Method

Latent MAS Method

Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions