Hi, thanks for your work!
Some parameters are already specified in the paper: AdamW optimizer, linear warm-up ratio of 0.03, batch size of 128, LoRA rank of 128, LoRA alpha of 32, and BF16 precision. For the LLaMA3.2-1B model, the learning rate is set to 8e-4 with a training duration of 10 epochs.
However, I have a limited understanding of the implementation method. Does the SFTbase still rely on the code from this repository, retaining only the ref_ce_loss? If so, does gamma also need to be set to 10( as stated in Appendix A Implementation Details)? Or is the SFT implemented through dedicated frameworks such as LLaMA Factory or Swift? Could you please give a complete set of hyperparameters for SFT-based baselines?