In your train_vgg script, the feature length of query is 300, but the length of clip_text_features in the charades-sta dataset is 512. What is the purpose of taking 300?
# text features
if [[ ${t_feat_type} == "clip" ]]; then
t_feat_dir=${feat_root}/clip_text_features/
t_feat_dim=300
else
echo "Wrong arg for t_feat_type."
exit 1
fi