Audio-Visual Source Separation(AVSS)

About • Examples • Installation • How To Use • Final results • Credits • Authors • License

About

This repository contains the end-to-end pipeline for solving AVSS task with PyTorch. The model was implemented is RTFS-Net.

See the task assignment here.

See a report for more information.

Examples

Mixed audio
mix.mov

AVSS model:

speaker 1 speaker 2

avss_s1.mov

avss_s2.mov
Audio-only model:

speaker 1 speaker 2

ss_s1.mov

ss_s2.mov

Installation

Follow these steps to install the project:

(Optional) Create and activate new environment using conda.

# create env
conda create -n AVSS python=3.11

# activate env
conda activate AVSS

Install all required packages.
```
pip install -r requirements.txt
```
Download model checkpoint, vocab and language model.
```
python scripts/download_weights.py
```

How To Use

Inference

1. To separate two-speaker mixed audio, audio directory should have the following format:

NameOfTheDirectoryWithTestDataset
├── audio
    ├── mix
        ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
        ├── FirstSpeakerID2_SecondSpeakerID2.wav
        .
        .
        .
        └── FirstSpeakerIDn_SecondSpeakerIDn.wav

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH \
    inferencer.save_path=SAVE_PATH \
    model=no_video_rtfs \
    inferencer.from_pretrained='data/other/no_video_model.pth'

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

2. To separate two-speaker mixed audio using reference mouth recordings, audio and video directories should have the following format:

NameOfTheDirectoryWithTestDataset
├── audio
│   ├── mix
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
    ├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
    ├── FirstOrSecondSpeakerID2.npz
    .
    .
    .
    └── FirstOrSecondSpeakerIDn.npz

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH
    inferencer.save_path=SAVE_PATH

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

3. To separate two-speaker mixed audio and evaluate results against ground truth separation, audio directory should have the following format:

NameOfTheDirectoryWithTestDataset
├── audio
    ├── mix
    │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
    │   .
    │   .
    │   .
    │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
    ├── s1 # ground truth for the speaker s1, may not be given
    │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
    │   .
    │   .
    │   .
    │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
    └── s2 # ground truth for the speaker s2, may not be given
        ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
        ├── FirstSpeakerID2_SecondSpeakerID2.wav
        .
        .
        .
        └── FirstSpeakerIDn_SecondSpeakerIDn.wav

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH \
    inferencer.save_path=SAVE_PATH \
    model=no_video_rtfs \
    inferencer.from_pretrained='data/other/no_video_model.pth' \
    metrics.inference.0.use_pit=True \
    metrics.inference.1.use_pit=True \
    metrics.inference.2.use_pit=True \
    metrics.inference.3.use_pit=True

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

4. To separate two-speaker mixed audio using reference mouth recordings and evaluate results against ground truth separation, audio directory should have the following format:

NameOfTheDirectoryWithTestDataset
├── audio
│   ├── mix
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│   ├── s1 # ground truth for the speaker s1, may not be given
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│   └── s2 # ground truth for the speaker s2, may not be given
│       ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│       ├── FirstSpeakerID2_SecondSpeakerID2.wav
│       .
│       .
│       .
│       └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
    ├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
    ├── FirstOrSecondSpeakerID2.npz
    .
    .
    .
    └── FirstOrSecondSpeakerIDn.npz

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH
    inferencer.save_path=SAVE_PATH

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

5. To evaluate the model using only predicted and ground truth audio separations, ensure that the directory containing them has the following format:

PredictDir
├── mix
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # prediction for the speaker s1, may not be given
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # prediction for the speaker s2
    ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    ├── FirstSpeakerID2_SecondSpeakerID2.wav
    .
    .
    .
    └── FirstSpeakerIDn_SecondSpeakerIDn.wav

GroundTruthDir
├── mix
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # ground truth for the speaker s1, may not be given
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # ground truth for the speaker s2, may not be given
    ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    ├── FirstSpeakerID2_SecondSpeakerID2.wav
    .
    .
    .
    └── FirstSpeakerIDn_SecondSpeakerIDn.wav

Run the following command:

python scripts/calculate_metrics.py \
    predict_dir=PREDICT_DIR_PATH \
    gt_dir=GROUND_TRUTH_DIR_PATH

Finally, if you want to reproduce results from here, run the following code:

python inference.py \
    inferencer.save_path=SAVE_PATH

Feel free to choose what kind of metrics you want to evaluate (see this config).

To calculate efficiency metrics such as MACs, memory required to process 1 batch with a single mix and video, the number of model parameters, size of the saved model on disk, time for 1 step on training and inference and real-time factor, run the following code:

python scripts/calculate_efficiency.py dataloader.batch_size=1

Training

To reproduce AVSS model, train model with

python train.py

To reproduce audio-only SS model, train model with

python train.py \
   -cn=rtfs_pit dataloader.batch_size=10

It takes around 10 and 4 days to train AVSS and audio-only SS models from scratch on V100 GPU respectively.

Final results

                        SI-SNRi    SDRi    PESQ    STOI    Params(M)    MACs(G)    Memory(GB)    Train time(s)    Infer. time(ms)    Real-time factor    Size on disk(MB)
RTFS-Net-12              12.95    13.33    2.42    0.92      0.771       88.5         7.12            1.85             1.17                0.58               57.04
audio-only model (R=4)   11.35    11.80    1.79    0.61      0.772       17.9         2.33            0.66             0.22                0.11                9.52

Credits

This repository is based on a PyTorch Project Template.

The pre-trained video feature extractors was taken from Lip-reading repository. Thanks to the authors for sharing their code.

Authors

Ilya Drobyshevskiy, Boris Panfilov, Ekaterina Grishina.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
scripts		scripts
src		src
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
report.pdf		report.pdf
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio-Visual Source Separation(AVSS)

About

Examples

Installation

How To Use

Inference

Training

Final results

Credits

Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

free001style/AVSS

Folders and files

Latest commit

History

Repository files navigation

Audio-Visual Source Separation(AVSS)

About

Examples

Installation

How To Use

Inference

Training

Final results

Credits

Authors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages