Specifically, I would like to understand what the reward curves should look like once training has converged, and what the expected converged reward values are for each of the three datasets provided in the README.
In my experiments, when training with the 100style and lafan1 datasets, the final reward values after convergence are relatively low—only reaching about 70% to 80% of the defined reward scale. I am unsure whether this behavior is expected or if it indicates an issue with my training setup.
Could you please clarify what reward levels are typically achieved after convergence for each of the three datasets?
Thanks!