AlphaD-RL: Multi-Teacher Monte Carlo Tree Search with Dynamic Weighting and Policy Distillation for Code Generation
Multi-Teacher Monte Carlo Tree Search (MT-MCTS) for code generation, where 3+ diverse teacher models (DeepSeek-Coder, CodeLlama, Qwen2.5-Coder) propose token paths that form the MCTS search tree. The policy network learns to navigate these trees using proper UCB (Q-value + exploration term) with execution rewards from unit tests as the sole signal.
Student model: Qwen/Qwen3-4B
Teacher model: Qwen/Qwen2.5-Coder-14B-Instruct, mistralai/Codestral-22B-v0.1, openai/gpt-oss-20b
There are two phases in here:
- training the model to determine, which level is enough for the token level, and generating the whole function from there
- mcts planned teacher populated tree search with boiling down to path of selected and rejected, and dpo optimization of local model, and then after iteration update the main target model.