Why using random sampling during inference and not pick instead the X patches with maximum attention? 

Hi all,

I was reading the paper to understand the implementation, but there is something strange to me.

If I understand correctly, the goal of using sampling in the training phase is to give to each patch an opportunity to have its attention score updated, when it is sampled from the distribution. They also prove it results in the minimum variance estimator.

But for inference, why don't we just pick the N patches with the best attention instead of repeating the same sampling process? How sampling can be more accurate than taking the best attention patches for inference, since the model has been trained?

What makes me confuse even more, is the fact the authors compare ATS-10 and ATS-50 for inference, but never talk about what sampling size they use during training. 

TL;DR: Why sampling during inference and not taking the maximum attention values? 

I also wonder about the manual selection of the patch size? Does it mean this algorithm will be inefficient for classification tasks where objects can represent a different proportion of the image? Can't this work be adapted for object detection task, similar to yolo?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why using random sampling during inference and not pick instead the X patches with maximum attention? #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why using random sampling during inference and not pick instead the X patches with maximum attention? #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions