Widespread inflated metrics for label projection due to leakage

I was interested in the benchmarks for label projection, hoping to implement the "best" method (logistic regression) in a project. 

Using the example pancreas dataset, I was unable to replicate the performance (e.g. 99% accuracy for random split, which from experience seemed too high). Going through the code, I saw that "process_dataset" takes an already processed h5ad file, does an 80:20 split, and passes those subsets to the various methods.

Focusing on my example, which uses PCs as features, openproblems calculates this on all the data, while I calculated only on the training set, and then applied the same centering/scaling/rotation to the test. Otherwise these benchmarks don't reflect how it would perform on new data. 

As it currently stands, the metrics and therefore the rankings cannot be relied upon. This is a problem especially for methods that use PCA; in theory it could give them an apparent edge over those operating directly on genes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Widespread inflated metrics for label projection due to leakage #386

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Widespread inflated metrics for label projection due to leakage #386

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions