Skip to content

free001style/AVSS

Repository files navigation

Audio-Visual Source Separation(AVSS)

AboutExamplesInstallationHow To UseFinal resultsCreditsAuthorsLicense

About

This repository contains the end-to-end pipeline for solving AVSS task with PyTorch. The model was implemented is RTFS-Net.

See the task assignment here.

See a report for more information.

Examples

Examples
Mixed audio
mix.mov
  • AVSS model:

    speaker 1 speaker 2
    avss_s1.mov
    avss_s2.mov
  • Audio-only model:

    speaker 1 speaker 2
    ss_s1.mov
    ss_s2.mov

Installation

Follow these steps to install the project:

  1. (Optional) Create and activate new environment using conda.

    # create env
    conda create -n AVSS python=3.11
    
    # activate env
    conda activate AVSS
  2. Install all required packages.

    pip install -r requirements.txt
  3. Download model checkpoint, vocab and language model.

    python scripts/download_weights.py

How To Use

Inference

1. To separate two-speaker mixed audio, audio directory should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
    ├── mix
        ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
        ├── FirstSpeakerID2_SecondSpeakerID2.wav
        .
        .
        .
        └── FirstSpeakerIDn_SecondSpeakerIDn.wav

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH \
    inferencer.save_path=SAVE_PATH \
    model=no_video_rtfs \
    inferencer.from_pretrained='data/other/no_video_model.pth'

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

2. To separate two-speaker mixed audio using reference mouth recordings, audio and video directories should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
│   ├── mix
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
    ├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
    ├── FirstOrSecondSpeakerID2.npz
    .
    .
    .
    └── FirstOrSecondSpeakerIDn.npz

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH
    inferencer.save_path=SAVE_PATH

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

3. To separate two-speaker mixed audio and evaluate results against ground truth separation, audio directory should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
    ├── mix
    │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
    │   .
    │   .
    │   .
    │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
    ├── s1 # ground truth for the speaker s1, may not be given
    │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
    │   .
    │   .
    │   .
    │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
    └── s2 # ground truth for the speaker s2, may not be given
        ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
        ├── FirstSpeakerID2_SecondSpeakerID2.wav
        .
        .
        .
        └── FirstSpeakerIDn_SecondSpeakerIDn.wav

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH \
    inferencer.save_path=SAVE_PATH \
    model=no_video_rtfs \
    inferencer.from_pretrained='data/other/no_video_model.pth' \
    metrics.inference.0.use_pit=True \
    metrics.inference.1.use_pit=True \
    metrics.inference.2.use_pit=True \
    metrics.inference.3.use_pit=True

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

4. To separate two-speaker mixed audio using reference mouth recordings and evaluate results against ground truth separation, audio directory should have the following format:
NameOfTheDirectoryWithTestDataset
├── audio
│   ├── mix
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│   ├── s1 # ground truth for the speaker s1, may not be given
│   │   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   │   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   │   .
│   │   .
│   │   .
│   │   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│   └── s2 # ground truth for the speaker s2, may not be given
│       ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│       ├── FirstSpeakerID2_SecondSpeakerID2.wav
│       .
│       .
│       .
│       └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
    ├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
    ├── FirstOrSecondSpeakerID2.npz
    .
    .
    .
    └── FirstOrSecondSpeakerIDn.npz

Run the following command:

python inference.py \
    datasets=inference_custom \
    datasets.test.data_dir=TEST_DATASET_PATH
    inferencer.save_path=SAVE_PATH

where SAVE_PATH is a path to save separation predictions and TEST_DATASET_PATH is directory with test data.

5. To evaluate the model using only predicted and ground truth audio separations, ensure that the directory containing them has the following format:
PredictDir
├── mix
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # prediction for the speaker s1, may not be given
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # prediction for the speaker s2
    ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    ├── FirstSpeakerID2_SecondSpeakerID2.wav
    .
    .
    .
    └── FirstSpeakerIDn_SecondSpeakerIDn.wav

GroundTruthDir
├── mix
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
├── s1 # ground truth for the speaker s1, may not be given
│   ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│   ├── FirstSpeakerID2_SecondSpeakerID2.wav
│   .
│   .
│   .
│   └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── s2 # ground truth for the speaker s2, may not be given
    ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
    ├── FirstSpeakerID2_SecondSpeakerID2.wav
    .
    .
    .
    └── FirstSpeakerIDn_SecondSpeakerIDn.wav

Run the following command:

python scripts/calculate_metrics.py \
    predict_dir=PREDICT_DIR_PATH \
    gt_dir=GROUND_TRUTH_DIR_PATH
  1. Finally, if you want to reproduce results from here, run the following code:
python inference.py \
    inferencer.save_path=SAVE_PATH

Feel free to choose what kind of metrics you want to evaluate (see this config).

To calculate efficiency metrics such as MACs, memory required to process 1 batch with a single mix and video, the number of model parameters, size of the saved model on disk, time for 1 step on training and inference and real-time factor, run the following code:

python scripts/calculate_efficiency.py dataloader.batch_size=1

Training

  1. To reproduce AVSS model, train model with
python train.py
  1. To reproduce audio-only SS model, train model with
python train.py \
   -cn=rtfs_pit dataloader.batch_size=10

It takes around 10 and 4 days to train AVSS and audio-only SS models from scratch on V100 GPU respectively.

Final results

                        SI-SNRi    SDRi    PESQ    STOI    Params(M)    MACs(G)    Memory(GB)    Train time(s)    Infer. time(ms)    Real-time factor    Size on disk(MB)
RTFS-Net-12              12.95    13.33    2.42    0.92      0.771       88.5         7.12            1.85             1.17                0.58               57.04
audio-only model (R=4)   11.35    11.80    1.79    0.61      0.772       17.9         2.33            0.66             0.22                0.11                9.52

Credits

This repository is based on a PyTorch Project Template.

The pre-trained video feature extractors was taken from Lip-reading repository. Thanks to the authors for sharing their code.

Authors

Ilya Drobyshevskiy, Boris Panfilov, Ekaterina Grishina.

License

License

About

RTFS-Net Implementation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages