Hi, I’m trying to replicate the results from Tables 7 and 8 in the paper, but I’m unsure which split was used for the evaluation—validation or test? Thanks in advance!