Hello,
First of all, thank you for your excellent work—both the paper and this repository are incredibly insightful.
I noticed in the CNNDetection repository that they achieved better results using uncropped images. I’m interested in experimenting with this approach using your model. However, since CLIP:ViT models are sensitive to image dimensions, I’m considering using CLIP:ResNet models instead (especially CLIP:RN50 as mentionned in your paper). Although CLIP:ResNet models tend to yield slightly lower performance compared to CLIP:ViT, they offer greater flexibility in handling varying image dimensions, which could be beneficial for this experiment.
Do you still have the pretained weights of CLIP:RN50 ?