Could you please share a training recipe for applying SDPO to math tasks?
Currently, I am training Qwen2.5-3B-Instruct on a Math training split using your SDPO implementation, but the val-core metric keeps degrading throughout the training progress. I have already tried swapping in other models and different datasets, but the training still isn't working as expected.
I would love to know if you have any empirical experience, recommended hyperparameters, or general advice for adapting your method successfully to math-heavy tasks. Thanks!