The links to the acoustic and visual features are here:
audio_embedding_6373.npy: the embedding table composed of the 6373-dimensional acoustic features of each utterances extracted with openSMILEvideo_embedding_4096.npy: the embedding table composed of the 4096-dimensional visual features of each utterances extracted with 3D-CNN