This repository contains my hands-on work following the Hugging Face Audio Course.
Each notebook demonstrates a key concept or application in audio machine learning.
- Introduction to audio ML applications using Hugging Face pipelines.
- Demonstrated:
- Loading a dataset in streaming mode.
- Displaying the waveform and spectrogram of an audio example.
- Loading pre-trained audio models.
- Running simple inference for tasks like ASR (Automatic Speach recognition).
- Goal: Build an audio classifier using transformer models.
- Demonstrated:
-
Running pipelines with diverse datasets for Keyword Spotting, Language Identification or Zero-Shot Audio Classification.
-
Loading and preprocessing an audio datasets for Music Classification.
-
Fine-tuning the model: https://huggingface.co/arsonor/distilhubert-finetuned-gtzan
-
and integrating it in a Gradio demo.
-
3-gtzan-music-genre-classification.ipynb (Hands-on exercise)
- Goal: Apply audio classification to music genres with the GTZAN dataset.
- Demonstrated:
- Fine-tuning an AST model (Audio Spectrogram Transformer).
- Evaluating results with accuracy and confusion matrix.
- Achieving >87% Accuracy: https://huggingface.co/arsonor/ast-finetuned-audioset-10-10-0.4593-finetuned-gtzan
- Goal: Perform automatic speech recognition (ASR) with pre-trained models.
- Demonstrated:
- Using pipeline with diverse datasets for speech-to-text transcription.
- Evaluating transcriptions with metrics like WER (Word Error Rate).
- Fine-tuning a Whisper model on a specific language: https://huggingface.co/arsonor/whisper-small-dv
5-asr-model-fine-tuning.ipynb (Hands-on exercise)
- Goal: Fine-tune the ”openai/whisper-tiny” model using the American English (“en-US”) subset of the ”PolyAI/minds14” dataset.
- Demonstrated:
- Training a pre-trained ASR model for domain adaptation.
- Achieving a Word Error Rate of 0.32: https://huggingface.co/arsonor/whisper-tiny-en
- Goal: Explore text-to-speech (TTS) using pre-trained models.
- Demonstrated:
- Converting text into natural-sounding speech.
- Experimenting with different voices and vocoders: SpeechT5, Bark and Massive Multilingual Speech (MMS)
- Goal: Try advanced TTS with the Bark model on Hugging Face.
- Demonstrated:
- Generating expressive and controllable synthetic voices.
- Running interactive speech demos.
- Goal: Build a speech-to-speech translation pipeline.
- Demonstrated:
- Combining ASR + translation + TTS.
- Converting speech in one language to speech in another within a Gradio demo.
- Goal: Create a simple voice assistant application.
- Demonstrated:
- Capturing speech input, processing it, and returning responses.
- Integrating the process in 4 steps:
- Wake word detection
- Speech Transcription
- Language model query
- Synthesize speech
- Goal: Transcribe and analyze meeting audio recordings.
- Demonstrated:
- Multi-speaker transcription and diarization.
- Structuring transcripts with speaker labels and timestamps.