Skip to content

cogmhear/avse_challenge

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio-Visual Speech Enhancement Challenge (AVSE)

Human performance in everyday noisy situations is known to be dependent upon both aural and visual senses that are contextually combined by the brain’s multi-level integration strategies. The multimodal nature of speech is well established, with listeners known to unconsciously lip read to improve the intelligibility of speech in a real noisy environment. Studies in neuroscience have shown that the visual aspect of speech has a potentially strong impact on the ability of humans to focus their auditory attention on a particular stimulus.

Over the last few decades, there have been major advances in machine learning applied to speech technology made possible by Machine Learning related Challenges including CHiME, REVERB, Blizzard, Clarity and Hurricane. However, the aforementioned challenges are based on single and multi-channel audio-only processing and have not exploited the multimodal nature of speech. The aim of this first audio visual (AV) speech enhancement challenge is to bring together the wider computer vision, hearing and speech research communities to explore novel approaches to multimodal speech-in-noise processing.

In this repository, you will find code to support the AVSE Challenge, including the baseline and scripts for preparing the necessary data.

More details can be found on the challenge website: https://challenge.cogmhear.org

Announcements

Any announcements about the challenge will be made in our mailing list (avse-challenge@mlist.is.ed.ac.uk). See here on how to subscribe to it.

Installation

Instructions to build data from previous AVSEC{1,2,3} editions are here

We are currently running the fourth edition of AVSEC

Follow instructions below to build the AVSEC-4 dataset

# Clone repository
git clone https://github.com/cogmhear/avse_challenge.git
cd avse_challenge

# Create & activate environment with conda, see https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda create --name avse python=3.9
conda activate avse

# Install ffmpeg 2.8
conda install -c rmg ffmpeg

# Install requirements
pip install -r requirements.txt

Data preparation

These scripts should be run in a unix environment and require an installed version of the ffmpeg tool (required version 2.8; see Installation for the correct installation command).

  1. Download necessary data:

Follow the instructions on the website to obtain credentials to download the videos.

Please register for the AVSE challenge to obtain the download credentials: registration form

Noise maskers and metadata of previous editions are available here

AVSEC-4 uses a subset of the data released by the Clarity Enhancement Challenge 2 and a subset of HRTFs of the OlHeaD-HRTF Database from Oldenburg University. Download the tar file above to obtain HRTFs, room simulation data and resampled (16000 Hz) impulse responses.

  1. Set up data structure and create speech maskers (see EDIT_THIS to change local paths):
cd data_preparation/avse4
./setup_avsec4_data.sh 
  1. Change root path defined in data_preparation/avse4/config.yaml to the location of the data.

  2. Prepare noisy data:

Data preparation scripts were adapted from original code by Clarity Enhancement Challenge 2 under MIT License.

cd data_preparation/avse4
python build_scenes.py

Tu build data locally single-run:

python render_scenes.py

Alternatively, if using multi-run:

#20 subjobs, starting in scene 0 and rendering 400 scenes
python render_scenes.py 'render_starting_chunk=range(0, 400, 20)' --multirun  

Rendering binaural and/or monoaural signals

Scripts allow you to render binaural and monoaural signals. To choose which signals to render set the corresponding parameters in the config file to True for the set of signals you want to render:

  binaural_render: True
  monoaural_render: True

Data structure

└── avsec4
    ├── dev
    │   ├── interferers
    │   ├── rooms 
    │   │   ├─ ac [20 MB]
    │   │   ├─ HOA_IRs_16k [18.8 GB]
    │   │   ├─ rpf [79 MB]
    │   ├── scenes [12 GB]
    │   ├── targets
    │   └── targets_video 
    ├── hrir
    │    ├─ HRIRs_MAT
    ├── maskers_music [607 MB]
    ├── maskers_noise [3.9 GB]
    ├── maskers_speech [5.3 GB]
    ├── metadata 
    └── train
    │    ├── interferers
    │    ├── rooms
    │    │    ├─ ac [48 MB]
    │    │    ├─ HOA_IRs_16k [45.2 GB]
    │    │    ├─ rpf [189 MB]
    │    ├── scenes [141 GB]
    │    ├── targets
    │    └── targets_video 

Baseline

AVSEC-4 baseline coming soon (late March 2025)

The credentials to download the pretrained model are the same as the ones used to download the noise maskers and the metadata.

Evaluation

Binaural signals

We provide a script to compute MBSTOI from binaural signals. We use MBSTOI scripts from the Clarity Challenge. The original MBSTOI Matlab implementation is available here.

cd evaluation/avse4/
python objective_evaluation.py

Note: before running this script please edit the paths and file name formats defined in evaluation/avse1/config.yaml (see EDIT_THIS).

Monophonic signals

To compute objective metrics using monophonic signals (i.e., STOI and PESQ) please use evaluation scripts from in AVSEC-1.

cd evaluation/avse1/
python objective_evaluation.py

that require the following libraries:

pip install pystoi==0.3.3
pip install pesq==0.0.4

Challenges

Current challenge

License

Videos are derived from:

Interferers are derived from:

Impulse responses and room simulation data derived from:

Head-Related Transfer Functions derived from:

Scripts:

Data preparation scripts were adapted from original code by Clarity Enhancement Challenge 2. Modifications include: extracting target audio from video and different settings for sampling rate (16kHz), no random starting time for target speaker and no head rotations.

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Shell 0.9%