VAPwithAudioFaceEncoders

This repository is the official implementation of the voice activity projection (VAP) model, which is enhanced by audio and face encoders.

We extended the code from Inoue's real-time VAP repository.

Prerequisite

Please prepare the following environment beforehand

Ubuntu 20.04
Conda

As an exception, we used Windows 11 for facial image sequence extraction.

Installation

conda env create py311_rvap.yml
pip install -r requirements.txt
pip install -r requirements_cu118 --index-url https://download.pytorch.org/whl/cu118
copy files in asset directory from Inoue's real-time VAP repository into asset directory this repository

For facial image sequence extraction on Windows 11, please install with the following. conda env create py311_dlib.yml

Pretrained models

You can download pretrained models from here and put those directories into pretrained_models directory.

Once you specify the checkpoint directory in evaluation.py, the best model will be selected automatically.

pretrained_models
- trained_data_audio_paris_256
- trained_data_onishi_paris_256
- ...

Training

1. Preparation on Ubuntu and Windows

You need to download the NoXi dataset (at least the Paris subset is required).

mainly from here.
for preprocessed nonverbal features (AU, gaze, head pose, body joint), from here

Place the data in the following structure.

noxi_orig
- Paris_01
--- audio_expert.wav
--- audio_mix.wav
--- audio_novice.wav
--- non_verbal_expert.csv
--- non_verbal_novice.csv
--- vad_expert.txt
--- vad_novice.txt

2. Preparation on Windows

Place video data as follows from the NoXi dataset (You might need to rename files).

face_extract/src_paris
- Paris_01-video_expert.mp4
- Paris_01-video_novice.mp4

Download mmod_human_face_detector.dat and shapre_predictor_5_face_landmarks.dat from dlib and place it into face_extract.

Extract cropped face image sequences as follows.

conda activate py311_dlib
cd face_extract
python preprocess_face_dlib.py

Copy the extracted face image sequences in noxi on Windows to noxi on Ubuntu.

3. Preparation on Ubuntu again

Run the following command to prepare for the training.

conda activate py311_rvap
python noxi_00_modify_format.py
python noxi_01_filter_csv_by_keyword.py
python noxi_02_add_face_image.py

4. Start training

Start the training.

conda activate py311_rvap
python finetune.py \
--data_train_path noxi/train_paris_ext.csv --data_val_path noxi/valid_paris_ext.csv --data_test_path noxi/test_paris_ext.csv \
--data_use_cache --data_exclude_av_cache --data_preload_av --data_cache_dir tmp_cache4 \
--data_global_batch_size 256 \
--data_batch_size 4 \
--vap_encoder_type cpc \
--vap_pretrained_cpc asset/cpc/60k_epoch4-d0f474de.pt \
--vap_pretrained_vap asset/vap/vap_state_dict_eng_20hz_2500msec.pt \
--vap_freeze_encoder 1 --vap_channel_layers 1 \
--vap_cross_layers 3 --vap_context_limit -1 --vap_context_limit_cpc_sec -1 --vap_frame_hz 25 \
--vap_multimodal \
--vap_use_face_encoder --vap_pretrained_face_encoder asset/FormerDFER/DFER_encoder_weight_only.pt \
--vap_mode 1 \
--event_frame_hz 25 \
--opt_early_stopping 1 --opt_patience 5 \
--opt_save_top_k 5 --opt_max_epochs 200 \
--opt_saved_dir trained_data_mode1_paris_256/ --opt_log_dir lightning_logs_mode1_paris_256 \
--devices 0,1 --seed 0

Depending on your GPU VRAM size, you might need to decrease the data_batch_size (we used two 12 GB VRAM GPUs for this setup).

You can change devices if you use one GPU setting.

Evaluation

You can evaluate the pretrained model with the following command.

conda install py311_rvap
python evaluation.py \
--data_train_path noxi/train_paris_ext.csv --data_val_path noxi/valid_paris_ext.csv --data_test_path noxi/test_paris_ext.csv \
--data_use_cache --data_exclude_av_cache --data_preload_av --data_cache_dir tmp_cache4 \
--vap_freeze_encoder 1 --vap_channel_layers 1 --vap_cross_layers 3 --vap_context_limit -1 --vap_context_limit_cpc_sec -1 --vap_frame_hz 25 \
--event_frame_hz 25 \
--devices 0 --seed 0 \
--vap_multimodal --vap_use_face_encoder --vap_pretrained_face_encoder asset/FormerDFER/DFER_encoder_weight_only.pt --vap_mode 1 --vap_load_pretrained 0 \
--vap_multimodal --vap_use_face_encoder --vap_mode 1 \
--checkpoint trained_data_mode1_paris_256

You can change the evaluation target model by specifying the target directory with the checkpoint option.

Citation

TBA

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
asset		asset
env		env
face_extract		face_extract
noxi		noxi
noxi_orig		noxi_orig
pretrained_models		pretrained_models
LICENSE		LICENSE
README.md		README.md
audio.py		audio.py
callbacks.py		callbacks.py
datamodule.py		datamodule.py
dataset.py		dataset.py
encoder.py		encoder.py
encoder_CPC.py		encoder_CPC.py
encoder_FormerDFER.py		encoder_FormerDFER.py
evaluation.py		evaluation.py
events.py		events.py
finetune.py		finetune.py
model.py		model.py
modules.py		modules.py
noxi_00_modify_format.py		noxi_00_modify_format.py
noxi_01_filter_csv_by_keyword.py		noxi_01_filter_csv_by_keyword.py
noxi_02_add_face_image.py		noxi_02_add_face_image.py
objective.py		objective.py
transforms.py		transforms.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VAPwithAudioFaceEncoders

Prerequisite

Installation

Pretrained models

Training

1. Preparation on Ubuntu and Windows

2. Preparation on Windows

3. Preparation on Ubuntu again

4. Start training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

sagatake/VAPwithAudioFaceEncoders

Folders and files

Latest commit

History

Repository files navigation

VAPwithAudioFaceEncoders

Prerequisite

Installation

Pretrained models

Training

1. Preparation on Ubuntu and Windows

2. Preparation on Windows

3. Preparation on Ubuntu again

4. Start training

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages