About • Examples • Installation • How To Use • Final results • Credits • Authors • License
This repository contains the end-to-end pipeline for solving AVSS task with PyTorch. The model was implemented is RTFS-Net.
See the task assignment here.
See a report for more information.
Examples
| Mixed audio |
|---|
mix.mov |
-
AVSS model:
speaker 1 speaker 2 avss_s1.mov
avss_s2.mov
-
Audio-only model:
speaker 1 speaker 2 ss_s1.mov
ss_s2.mov
Follow these steps to install the project:
-
(Optional) Create and activate new environment using
conda.# create env conda create -n AVSS python=3.11 # activate env conda activate AVSS
-
Install all required packages.
pip install -r requirements.txt
-
Download model checkpoint, vocab and language model.
python scripts/download_weights.py
1. To separate two-speaker mixed audio, audio directory should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
├── mix
├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
├── FirstSpeakerID2_SecondSpeakerID2.wav
.
.
.
└── FirstSpeakerIDn_SecondSpeakerIDn.wav
Run the following command:
python inference.py \
datasets=inference_custom \
datasets.test.data_dir=TEST_DATASET_PATH \
inferencer.save_path=SAVE_PATH \
model=no_video_rtfs \
inferencer.from_pretrained='data/other/no_video_model.pth'where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.
2. To separate two-speaker mixed audio using reference mouth recordings, audio and video directories should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
│ ├── mix
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
├── FirstOrSecondSpeakerID2.npz
.
.
.
└── FirstOrSecondSpeakerIDn.npz
Run the following command:
python inference.py \
datasets=inference_custom \
datasets.test.data_dir=TEST_DATASET_PATH
inferencer.save_path=SAVE_PATHwhere SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.
3. To separate two-speaker mixed audio and evaluate results against ground truth separation, audio directory should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
├── mix
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # ground truth for the speaker s1, may not be given
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # ground truth for the speaker s2, may not be given
├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
├── FirstSpeakerID2_SecondSpeakerID2.wav
.
.
.
└── FirstSpeakerIDn_SecondSpeakerIDn.wav
Run the following command:
python inference.py \
datasets=inference_custom \
datasets.test.data_dir=TEST_DATASET_PATH \
inferencer.save_path=SAVE_PATH \
model=no_video_rtfs \
inferencer.from_pretrained='data/other/no_video_model.pth' \
metrics.inference.0.use_pit=True \
metrics.inference.1.use_pit=True \
metrics.inference.2.use_pit=True \
metrics.inference.3.use_pit=Truewhere SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.
4. To separate two-speaker mixed audio using reference mouth recordings and evaluate results against ground truth separation, audio directory should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
│ ├── mix
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│ ├── s1 # ground truth for the speaker s1, may not be given
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│ └── s2 # ground truth for the speaker s2, may not be given
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
├── FirstOrSecondSpeakerID2.npz
.
.
.
└── FirstOrSecondSpeakerIDn.npz
Run the following command:
python inference.py \
datasets=inference_custom \
datasets.test.data_dir=TEST_DATASET_PATH
inferencer.save_path=SAVE_PATHwhere SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.
5. To evaluate the model using only predicted and ground truth audio separations, ensure that the directory containing them has the following format:
PredictDir
├── mix
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # prediction for the speaker s1, may not be given
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # prediction for the speaker s2
├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
├── FirstSpeakerID2_SecondSpeakerID2.wav
.
.
.
└── FirstSpeakerIDn_SecondSpeakerIDn.wav
GroundTruthDir
├── mix
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # ground truth for the speaker s1, may not be given
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # ground truth for the speaker s2, may not be given
├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
├── FirstSpeakerID2_SecondSpeakerID2.wav
.
.
.
└── FirstSpeakerIDn_SecondSpeakerIDn.wav
Run the following command:
python scripts/calculate_metrics.py \
predict_dir=PREDICT_DIR_PATH \
gt_dir=GROUND_TRUTH_DIR_PATH- Finally, if you want to reproduce results from here, run the following code:
python inference.py \
inferencer.save_path=SAVE_PATHFeel free to choose what kind of metrics you want to evaluate (see this config).
To calculate efficiency metrics such as MACs, memory required to process 1 batch with a single mix and video, the number of model parameters, size of the saved model on disk, time for 1 step on training and inference and real-time factor, run the following code:
python scripts/calculate_efficiency.py dataloader.batch_size=1- To reproduce AVSS model, train model with
python train.py- To reproduce audio-only SS model, train model with
python train.py \
-cn=rtfs_pit dataloader.batch_size=10It takes around 10 and 4 days to train AVSS and audio-only SS models from scratch on V100 GPU respectively.
SI-SNRi SDRi PESQ STOI Params(M) MACs(G) Memory(GB) Train time(s) Infer. time(ms) Real-time factor Size on disk(MB)
RTFS-Net-12 12.95 13.33 2.42 0.92 0.771 88.5 7.12 1.85 1.17 0.58 57.04
audio-only model (R=4) 11.35 11.80 1.79 0.61 0.772 17.9 2.33 0.66 0.22 0.11 9.52
This repository is based on a PyTorch Project Template.
The pre-trained video feature extractors was taken from Lip-reading repository. Thanks to the authors for sharing their code.