This repository contains the inference module for a Voice Activity Detection (VAD) model, implemented in C++. The VAD model is designed to identify whether an input audio segment contains human speech (voice activity) or is background noise (non-voice activity). The model architecture incorporates SincNet layers for efficient feature extraction, LSTM layers for capturing temporal dependencies, and linear layers for final classification.
The VAD model consists of the following key components:
-
SincNet Layers (Convolutions):
- SincNet layers are utilized for effective feature extraction from the input audio signal.
- These layers employ parametrized sinc functions to learn filters that are particularly well-suited for audio processing.
-
LSTM Layers:
- Long Short-Term Memory (LSTM) layers are employed to capture temporal dependencies in the audio sequence.
- LSTMs help the model understand the context and sequential patterns in the input data.
-
Linear Layers:
- Linear layers are used for the final classification task, distinguishing between voice activity and non-voice activity.
- These layers make the model capable of providing binary predictions for each input segment.
Ensure you have a GNU compiler of version 12.3.0 or later installed on your system.
-
Clone the repository to your local machine.
git clone https://github.com/CodeGreatCommander/VAD cd VAD/inference/pretrained_models -
Follow the instructions in the subsequent sections to set up and use the VAD inference module.
The VAD inference module provides a streamlined process for performing voice activity detection on audio segments. The pretrained models are located in the ./inference/pretrained_models directory to further explore the model please use the README of the desired folder section