Audio-Visual Speech Enhancement Challenge (AVSE)

Human performance in everyday noisy situations is known to be dependent upon both aural and visual senses that are contextually combined by the brain’s multi-level integration strategies. The multimodal nature of speech is well established, with listeners known to unconsciously lip read to improve the intelligibility of speech in a real noisy environment. Studies in neuroscience have shown that the visual aspect of speech has a potentially strong impact on the ability of humans to focus their auditory attention on a particular stimulus.

Over the last few decades, there have been major advances in machine learning applied to speech technology made possible by Machine Learning related Challenges including CHiME, REVERB, Blizzard, Clarity and Hurricane. However, the aforementioned challenges are based on single and multi-channel audio-only processing and have not exploited the multimodal nature of speech. The aim of this first audio visual (AV) speech enhancement challenge is to bring together the wider computer vision, hearing and speech research communities to explore novel approaches to multimodal speech-in-noise processing.

In this repository, you will find code to support the AVSE Challenge, including the baseline and scripts for preparing the necessary data.

More details can be found on the challenge website: https://challenge.cogmhear.org

Announcements

Any announcements about the challenge will be made in our mailing list (avse-challenge@mlist.is.ed.ac.uk). See here on how to subscribe to it.

Installation

Instructions to build data from previous AVSEC{1,2,3} editions are here

We are currently running the fourth edition of AVSEC

Follow instructions below to build the AVSEC-4 dataset

# Clone repository
git clone https://github.com/cogmhear/avse_challenge.git
cd avse_challenge

# Create & activate environment with conda, see https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda create --name avse python=3.9
conda activate avse

# Install ffmpeg 2.8
conda install -c rmg ffmpeg

# Install requirements
pip install -r requirements.txt

Data preparation

These scripts should be run in a unix environment and require an installed version of the ffmpeg tool (required version 2.8; see Installation for the correct installation command).

Download necessary data:

target videos:
Lip Reading Sentences 3 (LRS3) Dataset
https://mm.kaist.ac.kr/datasets/lip_reading/

Follow the instructions on the website to obtain credentials to download the videos.

Noise maskers and metadata (AVSEC-4): https://data.cstr.ed.ac.uk/cogmhear/protected/avsec4_data.tar [4.1GB]

Please register for the AVSE challenge to obtain the download credentials: registration form

Noise maskers and metadata of previous editions are available here

Room simulation data and impulse responses from the Clarity Challenge (CEC2) and Head-Related Transfer Functions from OlHeaD-HRTF Database: https://data.cstr.ed.ac.uk/cogmhear/protected/clarity_cec2_data.tar [64GB]

AVSEC-4 uses a subset of the data released by the Clarity Enhancement Challenge 2 and a subset of HRTFs of the OlHeaD-HRTF Database from Oldenburg University. Download the tar file above to obtain HRTFs, room simulation data and resampled (16000 Hz) impulse responses.

Set up data structure and create speech maskers (see EDIT_THIS to change local paths):

cd data_preparation/avse4
./setup_avsec4_data.sh

Change root path defined in data_preparation/avse4/config.yaml to the location of the data.
Prepare noisy data:

Data preparation scripts were adapted from original code by Clarity Enhancement Challenge 2 under MIT License.

cd data_preparation/avse4
python build_scenes.py

Tu build data locally single-run:

python render_scenes.py

Alternatively, if using multi-run:

#20 subjobs, starting in scene 0 and rendering 400 scenes
python render_scenes.py 'render_starting_chunk=range(0, 400, 20)' --multirun

Rendering binaural and/or monoaural signals

Scripts allow you to render binaural and monoaural signals. To choose which signals to render set the corresponding parameters in the config file to True for the set of signals you want to render:

  binaural_render: True
  monoaural_render: True

Data structure

└── avsec4
    ├── dev
    │   ├── interferers
    │   ├── rooms 
    │   │   ├─ ac [20 MB]
    │   │   ├─ HOA_IRs_16k [18.8 GB]
    │   │   ├─ rpf [79 MB]
    │   ├── scenes [12 GB]
    │   ├── targets
    │   └── targets_video 
    ├── hrir
    │    ├─ HRIRs_MAT
    ├── maskers_music [607 MB]
    ├── maskers_noise [3.9 GB]
    ├── maskers_speech [5.3 GB]
    ├── metadata 
    └── train
    │    ├── interferers
    │    ├── rooms
    │    │    ├─ ac [48 MB]
    │    │    ├─ HOA_IRs_16k [45.2 GB]
    │    │    ├─ rpf [189 MB]
    │    ├── scenes [141 GB]
    │    ├── targets
    │    └── targets_video

Baseline

AVSEC-4 baseline coming soon (late March 2025)

The credentials to download the pretrained model are the same as the ones used to download the noise maskers and the metadata.

Evaluation

Binaural signals

We provide a script to compute MBSTOI from binaural signals. We use MBSTOI scripts from the Clarity Challenge. The original MBSTOI Matlab implementation is available here.

cd evaluation/avse4/
python objective_evaluation.py

Note: before running this script please edit the paths and file name formats defined in evaluation/avse1/config.yaml (see EDIT_THIS).

Monophonic signals

To compute objective metrics using monophonic signals (i.e., STOI and PESQ) please use evaluation scripts from in AVSEC-1.

cd evaluation/avse1/
python objective_evaluation.py

that require the following libraries:

pip install pystoi==0.3.3
pip install pesq==0.0.4

Challenges

Current challenge

The 4th Audio-Visual Speech Enhancement Challenge (AVSEC-4)
data_preparation
baseline -TBA
evaluation

License

Videos are derived from:

LRS3 dataset
Creative Commons BY-NC-ND 4.0 license.

Interferers are derived from:

Clarity Enhancement Challenge (CEC1)
Creative Commons Attribution Share Alike 4.0 International.
DNS Challenge second edition.
Only Freesound clips were selected
Creative Commons 0 License.
LRS3 dataset
Creative Commons BY-NC-ND 4.0 license.
MedleyDB audio
The dataset is licensed under CC BY-NC-SA 4.0.

Impulse responses and room simulation data derived from:

Clarity Enhancement Challenge (CEC2) The dataset is licensed under CC BY-SA 4.0.

Head-Related Transfer Functions derived from:

OlHeaD-HRTF Database: The dataset is licensed under CC BY-NC-SA 4.0.

Scripts:

Data preparation scripts were adapted from original code by Clarity Enhancement Challenge 2. Modifications include: extracting target audio from video and different settings for sampling rate (16kHz), no random starting time for target speaker and no head rotations.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
baseline		baseline
data_preparation		data_preparation
evaluation		evaluation
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio-Visual Speech Enhancement Challenge (AVSE)

Announcements

Installation

Data preparation

Data structure

Baseline

Evaluation

Challenges

License

About

Uh oh!

Releases

Packages

Languages

License

cogmhear/avse_challenge

Folders and files

Latest commit

History

Repository files navigation

Audio-Visual Speech Enhancement Challenge (AVSE)

Announcements

Installation

Data preparation

Data structure

Baseline

Evaluation

Challenges

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages