When running the train_rlhf_llama.sh script for LLaMA2-7B, the performance gain from using vLLM as the inference backend is not significant compared to Megatron (approximately a 20% improvement). According to some other reports, the expected improvement should be around 200%. How can it be promoted?