Hi! First of all congratulations on the impressive paper and this very well written repo. I've been playing around with this model, and I have a question about Table 4 / C.1 in the paper.
In columns 1 and 2, you report unimodal evaluation results in NLP and vision, but the pretraining dataset used here is PMD.
If I understand correctly, the normal pretraining pipeline is to use ImageNet-1k to train the image encoder, and CCNews+BookCorpus for the text encoder.
I was wondering if you have GLUE finetuning and image linear probe results on (unimodal) models pretrained on ImageNet-1k and CCNews+BookCorpus?
Thanks a lot!