Hi, thanks for the great work and for releasing the code!
There is a small typo in the paper, in Appendix A: Implementation Details. The text currently says:
For Coconut training of GPT-2, we use a learning rate of 1e-4 and train for 25 epochs without continuous tokens and 25 epochs with continuous tokens (50 epochs in total). For iCoT training of LLaMA-1b, we use a learning rate of 1e-5 and train 5 epochs for both stages (10 epochs in total). LoRA is not used during training.
But based on the context, I believe the second sentence should be:
For Coconut on LLaMA-1B, we use a learning rate of 1e-5 and train 5 epochs for both stages (10 epochs in total).
Could you confirm whether this is the intended setting?