Implement and train a neural-network speech recognition system with CTC loss.
We don't accept homework if any of the following requirements are not satisfied:
-
The code should be situated in a public github (or gitlab) repository
-
All the necessary packages should be mentioned in
./requirements.txtor in an installation guide section ofREADME.md -
All necessary resources (such as model checkpoints, LMs, and logs) should be downloadable with a script. Mention the script (or lines of code) in the
README.md -
You should implement all functions in
test.pyso that we can check your assignment -
Generally, your
test.pyandtrain.pyscripts should run without issues after running all commands in your installation guide. -
You must provide the logs for the training of your final model from the start of the training. Either use W&B reports or include your tensorboard directory in the resources of your project.
-
Attech a brief report. That includes:
- How to reproduce your model? (example: train 50 epochs with config
train_1.jsonand 50 epochs withtrain_2.json) - Attach training logs to show how fast did you network train
- How did you train your final model?
- What have you tried?
- What worked and what didn't work?
- What were the major challenges?
Also attach a summary of all bonus tasks you've implemented.
- How to reproduce your model? (example: train 50 epochs with config
grade =
quality_score
- (1.0 * days_expired)
- implementation_penalty
+ optional_tasks_score
We also require that you fulfill the following requirements. Not fulfilling them will result in score penalties.
- (Up to
-2.0 pointsif missing) Logging. Your tensorboard/W&B logs should include:- Text reports with random samples, including
target: {target_text}, prediction: {prediction_text}, CER: {cer}, WER: {wer} - Images of your train/valid spectrograms
- Gradient norm
- Learning rate
- Loss
- audio records (after augmentation)
- Text reports with random samples, including
- (Up to
-2.0 pointsif missing) Implement a simple beam search for evaluation - (Up to
-1.0 pointsif missing) Implement at least 4 types of audio augmentations
| Score | Dataset | CER | WER | Description |
|---|---|---|---|---|
| 1.0 | -- | -- | -- | At least you tried |
| 2.0 | LibriSpeech: test-clean | 75 | -- | It's probably just predicting common characters at this point |
| 3.0 | LibriSpeech: test-clean | 50 | -- | Well, it's something |
| 4.0 | LibriSpeech: test-clean | 30 | -- | You can guess the target phrase if you try |
| 5.0 | LibriSpeech: test-clean | 20 | -- | It gets some words right |
| 6.0 | LibriSpeech: test-clean | -- | 40 | More than half of the words are looking fine |
| 7.0 | LibriSpeech: test-clean | -- | 30 | It's quite readable |
| 8.0 | LibriSpeech: test-clean | -- | 20 | Occasional mistakes |
| 9.0 | LibriSpeech: test-other | -- | 20 | Your network can handle noisy audio. Good job. |
| 10.0 | LibriSpeech: test-other | -- | 10 | Technically better than a human. Well done! |
Note: all the results will be sanity-checked on an unannounced dataset. So it's not a good idea to fine-tune on a test set. It will be considered cheating.
- (
+1.0) BPE instead of characters. You can use SentencePiece, HuggingFace, or YouTokenToMe. - (
+1.5) Use an external language model for evaluation. The choice of an LM-fusion method is up to you. - (up to
+3.0) Train a LAS/RNN-T model (instead CTC / with CTC). Don't forget to log your attention matrices. You can skip beam-search or implement it for an extra +1.0 - (
+3.0) Russian ASR. We will use test part of the russian Common Voice dataset for estimating yourquality_score. Note: this option has a high score value, but russian language is generally more difficult to recognize, so expect your WERs and CERs to be lower on average compared to an English dataset.
We can subtract or add up to 1.0 points for extremely bad or surprisingly clean code structure.
Recommended archetectures:
Training a good NN model is a challenging task that is extremely difficult to debug. We recommend you follow these steps:
- Overfit your model on a single batch of examples
- Train your model in LJ-speech dataset (until you achieve at least 30 WER on LJ-speech test set)
- Fine-tune your model on Librispeech dataset (until you achieve at least 30 WER on Libirispeech clean test set)
- Fine-tune your model on a mix of Librispeech and Common Voice datasets (for extra quality on Librispeech test sets)
If you run out of time during one of these steps, you can always submit your somewhat good result and avoid getting deadline penalties.
Links:
To save some coding time it is recommended to use HuggingFace dataset library. Look how easy it is:
from datasets import load_dataset
dataset = load_dataset("librispeech_asr", split='train-clean-360')