Problem about training method

Thank you for sharing this excellent work!
I've been experimenting with the approach and observed that while the Transformer architecture supports multi-dataset training, the performance tends to be suboptimal when employing a sequential training strategy across individual datasets.
I would like to kindly ask if there are recommended techniques or best practices to mitigate this issue.