MSc Thesis: "Audio-Visual Self-Supervised Representation Learning in-the-wild"
We provide checkpoints for models pre-trained on a subset of VGGSound with 50,000 videos. The former method refers to Cross-modal Instance Discrimination (xID), whereas the latter is based on the recently proposed VICReg method.
| Method | Checkpoint (100 epochs) |
|---|---|
| xID | download link |
| VICReg | download link |
To train a model using xID method run the following (assuming that DDP strategy is used):
python3 main-ssl.py configs/VGGSound-N1024.yaml --multiprocessing-distributedFor VICReg method, run:
python3 main-vicreg.py configs/VGGSound-VICReg.yaml --multiprocessing-distributedTo avoid data parallelism, discard --multiprocessing-distributed argument and set the --gpu argument on either of the aforementioned scripts to a specific id (e.g. 0 for the first GPU device).
For this experiment, run the following (e.g. for UCF-101 dataset and model pre-trained using xID method):
python3 eval-action-recg-linear.py configs/ucf/8at16-linear.yaml configs/VGGSound-N1024.yaml --distributedNote that this script does not yet support multi-node evaluation.
Final results on both UCF-101 and HMDB-51 datasets are shown in the following table:
| Method | Top-1 Acc. (UCF-101) | Top-5 Acc. (UCF-101) | Top-1 Acc. (HMDB-51) | Top-5 Acc. (HMDB-51) |
|---|---|---|---|---|
| xID | 51.20% | 80.91% | 28.08% | 61.29% |
| VICReg | 39.75% | 71.30% | 21.85% | 52.69% |
For this experiment, run the following (e.g. for HMDB-51 dataset and model pre-trained using VICReg method):
python3 eval-action-recg.py configs/hmdb51/8at16-fold1.yaml configs/VGGSound-VICReg.yaml --distributedNote that this script does not yet support multi-node evaluation.
Final results on both UCF-101 and HMDB-51 datasets are shown in the following table:
| Method | Top-1 Acc. (UCF-101) | Top-5 Acc. (UCF-101) | Top-1 Acc. (HMDB-51) | Top-5 Acc. (HMDB-51) |
|---|---|---|---|---|
| xID | 73.22% | 92.78% | 42.85% | 73.69% |
| VICReg | 59.53% | 85.94% | 34.65% | 68.96% |
In this experiment, we test the generalization performance of self-supervised models on data belonging to unknown classes (i.e. classes not found in the pre-training dataset). To perform the split on the so-called seen and unseen concepts, please use the label_similarities.ipynb notebook. Based on our results, you can find the set of unseen concepts for UCF-101 and HMDB-51 respectively in datasets/rest_classes/ directory.
To perform this experiment, run the following (e.g. for xID model and UCF-101 dataset with 20% of training data per class for tuning the linear classifier):
python3 eval-action-recg-linear.py configs/ucf/8at16-linear.yaml configs/VGGSound-N1024.yaml --distributed --few-shot-ratio 0.2 --use-rest-classesFinal results are depicted in the following plots:

