Speech enhancement in noisy and reverberant environments using deep neural networks.
- Generate datasets of noisy and reverberant mixtures using multiple databases of clean speech utterances, noise recordings and binaural room impulse responses (BRIRs).
- Train different learning-based speech enhancement systems. Currently implemented models are:
- A feed-forward neural network (FFNN)-based system
- Conv-TasNet (Y. Luo and N. Mesgarani)
- DCCRN (Y. Hu et al.)
- SGMSE+ (J. Richter et al.), SGMSE+M (J.-M. Lemercier et al.)
- MANNER (H. J. Park et al.)
- TF-GridNet (Z.-Q. Wang et al.)
- MetricGAN+ (S. Fu et al.), MetricGAN-OKD (W. Shin et al.) [experimental; does not seem to work]
- Evaluate models in terms of different metrics: PESQ, STOI, ESTOI, SNR, SI-SNR.
"brever" reads "reverb" backwards.
The code was tested with Python 3.10 and 3.11.
- Clone the repo:
git clone https://github.com/philgzl/brever.git
cd brever
- Create a virtual environment (optional):
python -m venv venv
source venv/bin/activate
- Install requirements:
pip install -r requirements.txt
External databases of clean speech utterances, noise recordings and binaural room impulse responses (BRIRs) are required to generate datasets of noisy and reverberant mixtures. The following databases are used:
- Speech databases:
- TIMIT (not open)
- LibriSpeech (open)
- WSJ0 (not open)
- VCTK (open)
- Clarity (open)
- Noise databases:
- BRIR databases:
The path to each database in the file system is specified in config/paths.yaml.
The open databases can be downloaded using the download_databases.sh script (tested on Linux only). The script also resamples the VCTK, Clarity and TAU files from 48 kHz to 16 kHz and converts them to mono FLAC files using ffmpeg. Once the script is finished, the integrity of the files can be checked using pytest tests/test_loader.py (the tests for the non-open databases are skipped).
The WSJ0 files should be reorganized by speaker using this script.
You can initialize a dataset using scripts/init_dataset.py. The script takes as mandatory argument the type of dataset (train, val or test) and as optional arguments the parameters for the dataset. The script creates a new directory under data/datasets/train/, data/datasets/val/ or data/datasets/test/ which contains a config.yaml file with all the dataset parameters. The directory is named by default after a unique ID generated from the config.yaml file.
usage: init_dataset.py [-h] [--fs FS] [--seed SEED] [--padding PADDING]
[--uniform_tmr UNIFORM_TMR]
[--reflection_boundary REFLECTION_BOUNDARY]
[--speakers SPEAKERS] [--noises NOISES] [--rooms ROOMS]
[--target_snr_dist_name TARGET_SNR_DIST_NAME]
[--target_snr_dist_args TARGET_SNR_DIST_ARGS]
[--target_angle TARGET_ANGLE] [--noise_num NOISE_NUM]
[--noise_angle NOISE_ANGLE]
[--noise_ndr_dist_name NOISE_NDR_DIST_NAME]
[--noise_ndr_dist_args NOISE_NDR_DIST_ARGS]
[--diffuse DIFFUSE] [--diffuse_color DIFFUSE_COLOR]
[--diffuse_ltas_eq DIFFUSE_LTAS_EQ] [--decay DECAY]
[--decay_color DECAY_COLOR]
[--decay_rt60_dist_name DECAY_RT60_DIST_NAME]
[--decay_rt60_dist_args DECAY_RT60_DIST_ARGS]
[--decay_drr_dist_name DECAY_DRR_DIST_NAME]
[--decay_drr_dist_args DECAY_DRR_DIST_ARGS]
[--decay_delay_dist_name DECAY_DELAY_DIST_NAME]
[--decay_delay_dist_args DECAY_DELAY_DIST_ARGS]
[--rms_jitter_dist_name RMS_JITTER_DIST_NAME]
[--rms_jitter_dist_args RMS_JITTER_DIST_ARGS]
[--speech_files SPEECH_FILES]
[--noise_files NOISE_FILES] [--room_files ROOM_FILES]
[--weight_by_avg_length WEIGHT_BY_AVG_LENGTH]
[--duration DURATION] [--sources SOURCES] [-f]
[-n NAME] [--all_databases]
{train,val,test}
initialize a dataset
positional arguments:
{train,val,test} dump in train or test subdir
options:
-h, --help show this help message and exit
-f, --force overwrite config file if already exists
-n NAME, --name NAME dataset name
--all_databases use all databases
random mixture maker options:
--fs FS
--seed SEED
--padding PADDING
--uniform_tmr UNIFORM_TMR
--reflection_boundary REFLECTION_BOUNDARY
--speakers SPEAKERS
--noises NOISES
--rooms ROOMS
--target_snr_dist_name TARGET_SNR_DIST_NAME
--target_snr_dist_args TARGET_SNR_DIST_ARGS
--target_angle TARGET_ANGLE
--noise_num NOISE_NUM
--noise_angle NOISE_ANGLE
--noise_ndr_dist_name NOISE_NDR_DIST_NAME
--noise_ndr_dist_args NOISE_NDR_DIST_ARGS
--diffuse DIFFUSE
--diffuse_color DIFFUSE_COLOR
--diffuse_ltas_eq DIFFUSE_LTAS_EQ
--decay DECAY
--decay_color DECAY_COLOR
--decay_rt60_dist_name DECAY_RT60_DIST_NAME
--decay_rt60_dist_args DECAY_RT60_DIST_ARGS
--decay_drr_dist_name DECAY_DRR_DIST_NAME
--decay_drr_dist_args DECAY_DRR_DIST_ARGS
--decay_delay_dist_name DECAY_DELAY_DIST_NAME
--decay_delay_dist_args DECAY_DELAY_DIST_ARGS
--rms_jitter_dist_name RMS_JITTER_DIST_NAME
--rms_jitter_dist_args RMS_JITTER_DIST_ARGS
--speech_files SPEECH_FILES
--noise_files NOISE_FILES
--room_files ROOM_FILES
--weight_by_avg_length WEIGHT_BY_AVG_LENGTH
extra options:
--duration DURATION
--sources SOURCES
The dataset is then created using the scripts/create_dataset.py script.
usage: create_dataset.py [-h] [-f] [--no_tar] input
create a dataset
positional arguments:
input dataset directory
options:
-h, --help show this help message and exit
-f, --force overwrite if already exists
--no_tar do not save mixtures in tar archive
Example:
$ python scripts/init_dataset.py train --duration 3600 --speakers libri_.* --noises demand --rooms surrey_.*
Initialized data/datasets/train/<dataset_id>/config.yaml
$ python scripts/create_dataset.py data/datasets/train/<dataset_id>/
The following files are then created next to the config.yaml file:
audio.tar: an archive containing noisy and clean speech files in FLAC formatlog.log: a log filemixture_info.json: metadata about each mixture
You can use your own datasets of noisy speech mixtures, but the files should be organized as follows:
- Each of your training, validation and test directories should contain a sub-directory called
audiocontaining all the audio files. - The audio files should be named
f'{i:05d}_mixture.flac'for the noisy speech files andf'{i:05d}_foreground.flac'for the clean speech files whereiis the acoustic scene index starting from0. - Finally, either
- Bundle each
audiosub-directory into a tar archive namedaudio.tarusing e.g.tar -cvf audio.tar audio/ - Or make sure to pass
--tar 0toscripts/init_model.pyorscripts/train_model.pyso the training script does not look for a tar archive.
- Bundle each
Example:
$ tree <my_custom_train_dataset>
<my_custom_train_dataset>
└── audio
├── 00000_foreground.flac
├── 00000_mixture.flac
├── 00001_foreground.flac
├── 00001_mixture.flac
├── 00002_foreground.flac
├── 00002_mixture.flac
...
$ tree <my_custom_val_dataset>
<my_custom_val_dataset>
└── audio
├── 00000_foreground.flac
├── 00000_mixture.flac
├── 00001_foreground.flac
├── 00001_mixture.flac
├── 00002_foreground.flac
├── 00002_mixture.flac
...
Or:
$ tree <my_custom_train_dataset>
<my_custom_train_dataset>
└── audio.tar
├── audio/00000_foreground.flac
├── audio/00000_mixture.flac
├── audio/00001_foreground.flac
├── audio/00001_mixture.flac
├── audio/00002_foreground.flac
├── audio/00002_mixture.flac
...
$ tree <my_custom_val_dataset>
<my_custom_val_dataset>
└── audio.tar
├── audio/00000_foreground.flac
├── audio/00000_mixture.flac
├── audio/00001_foreground.flac
├── audio/00001_mixture.flac
├── audio/00002_foreground.flac
├── audio/00002_mixture.flac
...
You can also download the VoiceBank+DEMAND dataset using scripts/vbdemand_to_brever.py to quickly get started. The script resamples the files from 48 kHz to 16 kHz and converts them to FLAC. The --val_speakers option can be used to select the set of speakers used for validation (p226 and p287 by default). The training, validation and test sets are then placed in data/datasets/train/vbdemand/, data/datasets/val/vbdemand/ and data/datasets/test/vbdemand/ respectively.
You can initialize a model using scripts/init_model.py. The script takes as optional arguments the training parameters, and as a mandatory sub-command the model architecture. The sub-command then takes as optional arguments the parameters for the model. The script creates a new directory under models/ which contains a config.yaml file with all the model parameters. The directory is named by default after a unique ID generated from the config.yaml file. The --train_path and --val_path arguments are mandatory.
usage: init_model.py [-h] [--segment_length SEGMENT_LENGTH]
[--overlap_length OVERLAP_LENGTH] [--sources SOURCES]
[--segment_strategy SEGMENT_STRATEGY]
[--max_segment_length MAX_SEGMENT_LENGTH] [--tar TAR]
[--dynamic_mixing DYNAMIC_MIXING]
[--dynamic_mixtures_per_epoch DYNAMIC_MIXTURES_PER_EPOCH]
[--workers WORKERS] [--epochs EPOCHS] [--device DEVICE]
[--batch_sampler BATCH_SAMPLER] [--batch_size BATCH_SIZE]
[--num_buckets NUM_BUCKETS]
[--dynamic_batch_size DYNAMIC_BATCH_SIZE] [--fs FS]
[--ema EMA] [--ema_decay EMA_DECAY]
[--ignore_checkpoint IGNORE_CHECKPOINT]
[--preload PRELOAD] [--ddp DDP] [--rank RANK]
[--use_wandb USE_WANDB] [--profile PROFILE]
[--val_metrics VAL_METRICS] [--val_period VAL_PERIOD]
[--use_amp USE_AMP] [--compile COMPILE]
[--save_on_epochs SAVE_ON_EPOCHS] [--seed SEED]
--train_path TRAIN_PATH --val_path VAL_PATH [-f]
[-n NAME]
{convtasnet,dccrn,ffnn,manner,metricganokd,metricganp,sgmsep,sgmsepm,idmse,tfgridnet}
...
initialize a model
positional arguments:
{convtasnet,dccrn,ffnn,manner,metricganokd,metricganp,sgmsep,sgmsepm,idmse,tfgridnet}
model architecture
options:
-h, --help show this help message and exit
-f, --force overwrite config file if already exists
-n NAME, --name NAME model name
dataset options:
--segment_length SEGMENT_LENGTH
--overlap_length OVERLAP_LENGTH
--sources SOURCES
--segment_strategy SEGMENT_STRATEGY
--max_segment_length MAX_SEGMENT_LENGTH
--tar TAR
--dynamic_mixing DYNAMIC_MIXING
--dynamic_mixtures_per_epoch DYNAMIC_MIXTURES_PER_EPOCH
trainer options:
--workers WORKERS
--epochs EPOCHS
--device DEVICE
--batch_sampler BATCH_SAMPLER
--batch_size BATCH_SIZE
--num_buckets NUM_BUCKETS
--dynamic_batch_size DYNAMIC_BATCH_SIZE
--fs FS
--ema EMA
--ema_decay EMA_DECAY
--ignore_checkpoint IGNORE_CHECKPOINT
--preload PRELOAD
--ddp DDP
--rank RANK
--use_wandb USE_WANDB
--profile PROFILE
--val_metrics VAL_METRICS
--val_period VAL_PERIOD
--use_amp USE_AMP
--compile COMPILE
--save_on_epochs SAVE_ON_EPOCHS
extra options:
--seed SEED
--train_path TRAIN_PATH
--val_path VAL_PATH
The model is then trained using the scripts/train_model.py script. Training options can be provided, which will override the parameters in the config.yaml file.
usage: train_model.py [-h] [-f] [--wandb_run_id WANDB_RUN_ID]
[--segment_length SEGMENT_LENGTH]
[--overlap_length OVERLAP_LENGTH] [--sources SOURCES]
[--segment_strategy SEGMENT_STRATEGY]
[--max_segment_length MAX_SEGMENT_LENGTH] [--tar TAR]
[--dynamic_mixing DYNAMIC_MIXING]
[--dynamic_mixtures_per_epoch DYNAMIC_MIXTURES_PER_EPOCH]
[--workers WORKERS] [--epochs EPOCHS] [--device DEVICE]
[--batch_sampler BATCH_SAMPLER]
[--batch_size BATCH_SIZE] [--num_buckets NUM_BUCKETS]
[--dynamic_batch_size DYNAMIC_BATCH_SIZE] [--fs FS]
[--ema EMA] [--ema_decay EMA_DECAY]
[--ignore_checkpoint IGNORE_CHECKPOINT]
[--preload PRELOAD] [--ddp DDP] [--rank RANK]
[--use_wandb USE_WANDB] [--profile PROFILE]
[--val_metrics VAL_METRICS] [--val_period VAL_PERIOD]
[--use_amp USE_AMP] [--compile COMPILE]
[--save_on_epochs SAVE_ON_EPOCHS] [--seed SEED]
[--train_path TRAIN_PATH] [--val_path VAL_PATH]
input
train a model
positional arguments:
input model directory
options:
-h, --help show this help message and exit
-f, --force train even if already trained
--wandb_run_id WANDB_RUN_ID
id of wandb run to resume
the following options supersede the config file:
--segment_length SEGMENT_LENGTH
--overlap_length OVERLAP_LENGTH
--sources SOURCES
--segment_strategy SEGMENT_STRATEGY
--max_segment_length MAX_SEGMENT_LENGTH
--tar TAR
--dynamic_mixing DYNAMIC_MIXING
--dynamic_mixtures_per_epoch DYNAMIC_MIXTURES_PER_EPOCH
--workers WORKERS
--epochs EPOCHS
--device DEVICE
--batch_sampler BATCH_SAMPLER
--batch_size BATCH_SIZE
--num_buckets NUM_BUCKETS
--dynamic_batch_size DYNAMIC_BATCH_SIZE
--fs FS
--ema EMA
--ema_decay EMA_DECAY
--ignore_checkpoint IGNORE_CHECKPOINT
--preload PRELOAD
--ddp DDP
--rank RANK
--use_wandb USE_WANDB
--profile PROFILE
--val_metrics VAL_METRICS
--val_period VAL_PERIOD
--use_amp USE_AMP
--compile COMPILE
--save_on_epochs SAVE_ON_EPOCHS
--seed SEED
--train_path TRAIN_PATH
--val_path VAL_PATH
Example:
$ python scripts/init_model.py --train_path data/datasets/train/<dataset_id>/ convtasnet
Initialized models/<model_id>/config.yaml
$ python scripts/train_model.py models/<model_id>/
The following files are then created next to the config.yaml file:
checkpoints/: a sub-directory containing the model checkpointslog_train.log: a log filelosses.npz: training and validation curves in NumPy formattraining_curve.png: a plot of the training and validation curves
You can evaluate a trained model using the scripts/test_model.py script.
usage: test_model.py [-h] -i INPUTS [INPUTS ...] -t TESTS [TESTS ...] [-f]
[--output_dir OUTPUT_DIR] [--cuda]
[--metrics METRICS [METRICS ...]] [--no_train_check]
[--best BEST] [--batch_size BATCH_SIZE]
[--workers WORKERS] [--ddp]
test a model
options:
-h, --help show this help message and exit
-i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
model directories or checkpoints
-t TESTS [TESTS ...], --tests TESTS [TESTS ...]
test dataset paths
-f, --force test even if already tested
--output_dir OUTPUT_DIR
where to write signals
--cuda run on GPU
--metrics METRICS [METRICS ...]
metrics to evaluate with
--no_train_check test even if model is not trained
--best BEST metric to use for checkpoint selection
--batch_size BATCH_SIZE
batch size
--workers WORKERS number of workers
--ddp use DDP
Example:
python scripts/test_model.py -i models/<model_id>/ -t data/datasets/test/<dataset_id>/
This creates a scores.hdf5 file in the model directory containing the objective metrics of the enhanced output mixtures and the unprocessed input mixtures.
To visualize the scores, you can use the scripts/compare_models.py script.
Example:
python scripts/compare_models.py -i models/<model_id_1>/ models/<model_id_2>/ -t data/datasets/test/<dataset_id>/
- Inherit from
brever.models.base.BreverBaseModeland implement thelossand_enhancemethods. Eventually override other methods if needed. See the docstrings inbrever/models/base.pyfor details. See the implementation of the other models inbrever/models/for examples. - Add the model to
brever.models.base.ModelRegistryusing theregisterdecorator. This will expose the model to the command line. - Add default arguments and type hints to the
__init__method of the model. This will allow for type checks and default values when initializing models from the command line. - Create a default config file for the model in
config/models/. You can do this automatically usingscripts/write_default_config.py. Eventually replace the default options for the trainer and the dataset with values that make sense for your model. - Implement tests in
tests/test_models.pyandtests/test_training.py.
- Fix MetricGAN+
- Add CSIG, CBAK, COVL metrics
- Dynamically check for
audio/oraudio.tarin dataset directories - Support
.wavfiles. Eventually support arbitrary file extensions. - Convert BRAS files from SOFA to WAV
- Add noise type selection for DEMAND and ARTE
- Make
download_databases.shcross-platform
To build the image:
docker build -t brever:latest .
To start the container:
docker run -it --rm -v ./models:/brever/models -v ./data:/brever/data brever:latest
@article{gonzalez2023assessing,
title={Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments},
author={Philippe Gonzalez and Tommy Sonne Alstrøm and Tobias May},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={31},
pages={3390--3403},
year={2023},
doi={10.1109/TASLP.2023.3318965},
}
@inproceedings{gonzalez2023batching,
title={On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems},
author={Philippe Gonzalez and Tommy Sonne Alstrøm and Tobias May},
booktitle={IEEE International Conference on Acoustics, Speech, and Signal Processing},
pages={1--5},
year={2023},
doi={10.1109/ICASSP49357.2023.10097075},
}
@inproceedings{gonzalez2024diffusion,
title={Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler},
author={Philippe Gonzalez and Zheng-Hua Tan and Jan Østergaard and Jesper Jensen and Tommy Sonne Alstrøm and Tobias May},
booktitle={IEEE International Conference on Acoustics, Speech and, Signal Processing},
pages={1--5},
year={2024},
doi={10.1109/ICASSP48485.2024.10446610},
}
@article{gonzalez2024investigating,
title={Investigating the Design Space of Diffusion Models for Speech Enhancement},
author={Philippe Gonzalez and Zheng-Hua Tan and Jan Østergaard and Jesper Jensen and Tommy Sonne Alstrøm and Tobias May},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={32},
pages={4486--4500},
year={2024},
doi={10.1109/TASLP.2024.3473319},
}
@article{gonzalez2024effect,
title={The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems},
author={Philippe Gonzalez and Zheng-Hua Tan and Jan Østergaard and Jesper Jensen and Tommy Sonne Alstrøm and Tobias May},
journal={IEEE Signal Processing Letters},
volume={31},
pages={2225--2229},
year={2024},
doi={10.1109/LSP.2024.3449221},
}