Skip to content

Audio Transformer-based architecture and fine-tuning (music genre classifier, automatic speech recognition, speech-to-text) showcasing applications like Speech-to-Speech translation, Vocal assistant, Meeting transcription.

License

Notifications You must be signed in to change notification settings

arsonor/audio-transformer-applications

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hugging Face Audio Course – Notebooks

This repository contains my hands-on work following the Hugging Face Audio Course.
Each notebook demonstrates a key concept or application in audio machine learning.


Unit 2

  • Introduction to audio ML applications using Hugging Face pipelines.
  • Demonstrated:
    • Loading a dataset in streaming mode.
    • Displaying the waveform and spectrogram of an audio example.
    • Loading pre-trained audio models.
    • Running simple inference for tasks like ASR (Automatic Speach recognition).

Unit 4

  • Goal: Build an audio classifier using transformer models.
  • Demonstrated:
    • Running pipelines with diverse datasets for Keyword Spotting, Language Identification or Zero-Shot Audio Classification.

    • Loading and preprocessing an audio datasets for Music Classification.

    • Fine-tuning the model: https://huggingface.co/arsonor/distilhubert-finetuned-gtzan

    • and integrating it in a Gradio demo.


Unit 5

  • Goal: Perform automatic speech recognition (ASR) with pre-trained models.
  • Demonstrated:
    • Using pipeline with diverse datasets for speech-to-text transcription.
    • Evaluating transcriptions with metrics like WER (Word Error Rate).
    • Fine-tuning a Whisper model on a specific language: https://huggingface.co/arsonor/whisper-small-dv

5-asr-model-fine-tuning.ipynb (Hands-on exercise)

  • Goal: Fine-tune the ”openai/whisper-tiny” model using the American English (“en-US”) subset of the ”PolyAI/minds14” dataset.
  • Demonstrated:

Unit 6

  • Goal: Explore text-to-speech (TTS) using pre-trained models.
  • Demonstrated:
    • Converting text into natural-sounding speech.
    • Experimenting with different voices and vocoders: SpeechT5, Bark and Massive Multilingual Speech (MMS)
  • Goal: Try advanced TTS with the Bark model on Hugging Face.
  • Demonstrated:
    • Generating expressive and controllable synthetic voices.
    • Running interactive speech demos.

Unit 7

  • Goal: Build a speech-to-speech translation pipeline.
  • Demonstrated:
    • Combining ASR + translation + TTS.
    • Converting speech in one language to speech in another within a Gradio demo.
  • Goal: Create a simple voice assistant application.
  • Demonstrated:
    • Capturing speech input, processing it, and returning responses.
    • Integrating the process in 4 steps:
      1. Wake word detection
      2. Speech Transcription
      3. Language model query
      4. Synthesize speech
  • Goal: Transcribe and analyze meeting audio recordings.
  • Demonstrated:
    • Multi-speaker transcription and diarization.
    • Structuring transcripts with speaker labels and timestamps.

About

Audio Transformer-based architecture and fine-tuning (music genre classifier, automatic speech recognition, speech-to-text) showcasing applications like Speech-to-Speech translation, Vocal assistant, Meeting transcription.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published