Skip to content

Widespread inflated metrics for label projection due to leakage #386

@kthorner

Description

@kthorner

I was interested in the benchmarks for label projection, hoping to implement the "best" method (logistic regression) in a project.

Using the example pancreas dataset, I was unable to replicate the performance (e.g. 99% accuracy for random split, which from experience seemed too high). Going through the code, I saw that "process_dataset" takes an already processed h5ad file, does an 80:20 split, and passes those subsets to the various methods.

Focusing on my example, which uses PCs as features, openproblems calculates this on all the data, while I calculated only on the training set, and then applied the same centering/scaling/rotation to the test. Otherwise these benchmarks don't reflect how it would perform on new data.

As it currently stands, the metrics and therefore the rankings cannot be relied upon. This is a problem especially for methods that use PCA; in theory it could give them an apparent edge over those operating directly on genes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions