- ASR I
- Automatic speech recognition (ASR)
- Metrics for ASR
- CTC loss function
- "Listen, Attend and Spell" archetecture
- Beam-search
- This lecture assumes you are familiar with the attention mechanism. If you're not feeling confident about this
topic, we recommend that you read one of the following materials:
- How Attention works in Deep Learning: understanding the attention mechanism in sequence models
- Simple language. Easy to understand. Quite detailed. Not too technical. Surface knowledge.
- Sequence to Sequence (seq2seq) and Attention
- Nicely illustrated. Detailed math explanations. A bit bulk.
- How Attention works in Deep Learning: understanding the attention mechanism in sequence models
- Seminar 1 Audio augmentations with code and examples
- Seminar 2 Practical excersises
- Writing and testing WER-metric
- Implementing CTC decoding
- Implementing CTC beam-search
- (bonus!) Intro to PyCharm
All links are provided on the last slide of the lecture
- wer are we - track global ASR progress on various datasets
- (paper) Librispeech: An ASR corpus based on public domain audio books (2015)
- (paper) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks [2006]
- (blog post) Sequence Modeling With CTC
- (paper) Deep Speech: Scaling up end-to-end speech recognition (2014)
- (paper) Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (2015)
- (blog post) Speech Recognition — Deep Speech, CTC, Listen, Attend, and Spell
- (paper) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
- (paper) Listen, Attend and Spell (2015)
- (paper) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (2019)
- (site) http://kaldi-asr.org/
- (paper) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention