-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I was interested in the benchmarks for label projection, hoping to implement the "best" method (logistic regression) in a project.
Using the example pancreas dataset, I was unable to replicate the performance (e.g. 99% accuracy for random split, which from experience seemed too high). Going through the code, I saw that "process_dataset" takes an already processed h5ad file, does an 80:20 split, and passes those subsets to the various methods.
Focusing on my example, which uses PCs as features, openproblems calculates this on all the data, while I calculated only on the training set, and then applied the same centering/scaling/rotation to the test. Otherwise these benchmarks don't reflect how it would perform on new data.
As it currently stands, the metrics and therefore the rankings cannot be relied upon. This is a problem especially for methods that use PCA; in theory it could give them an apparent edge over those operating directly on genes.