Question about evaluation setup

From Table 1, the testing set contains 10,000 real videos and several fake subsets (each corresponding to one generative model, e.g., Sora, Gen-2, etc.).
In Table 4, each fake subset is reported with R, F1, and AP metrics.
My question is: 
When evaluating each fake subset (e.g., Sora), are all 10,000 real videos used as the negative samples, or only a subset of them?
Thanks