Spanish Gigaword LM recipe by saikiranvalluri · Pull Request #2 · GoVivace/kaldi

saikiranvalluri · 2019-02-19T06:36:29Z

No description provided.

…g. (kaldi-asr#2996)

…asr#2999)

)

…-asr#3003)

kaldi-asr#3000) thanks: armando.muscariello@gmail.com

…sr#3011)

…turn (kaldi-asr#3018)

…noreturn (kaldi-asr#3020)

kaldi-asr#3022)

…s broken by commit cc2469e). (kaldi-asr#3025)

…r#3027)

…n-probs used (kaldi-asr#3033) bug would likely have resulted in determinization failure (only when not using word-position-dependent phones).

kaldi-asr#3040)

… corpus

…kaldi-asr#3311) This avoids a ping pong of memory to host. Implementation now assumes device memory. interfaces will allocate device memory and copy to it if data starts on host. Add a cuda matrix copy function which clamps rows. This is much faster than copying one row at a time and the kernel can handle the clamping for free.

…an Tobler (kaldi-asr#3347)

…ons (kaldi-asr#3341)

…i-asr#3360)

* Add CUDA accelerated MFCC computation. Creates a new directory 'cudafeat' for placing cuda feature extraction components as it is developed. Added a directory 'cudafeatbin' for placing binaries that are cuda accelerated that mirrior binaries elsewhere. This commit implements: feature-window-cuda.h/cu which implements a feature window on the device by copying it from a host feature window. feature-mfcc-cuda.h/cu which implements the cuda mfcc feature extractor. compute-mfcc-feats-cuda.cc which mirriors compute-mfcc-feats.cc There were also minor changes to other files. * Only build cuda binaries if cuda is enabled

…ldi-asr#3351) small cuda memory copies are inefficeint because each copy can add multiple micro-seconds of latency. The code as written would copy a small matrices or vectors to and from the tasks one after another. To avoid this i've implemented a batched matrix copy routine. This takes arrays of matrix descriptions for the input and output and batches the copies in a single kernel call. This is used in both FormatInputs and FormatOutputs to reduce launch latency overhead. The kernel for the batched copy uses a trick to avoid a memory copy of the host paramters. The parameters are put into a struct containing a static sized array. These parameters are then marshalled like normal cuda parameters. This avoids additional launch latency overhead. There is still more work to do at the beginning and end of nnet3. In particular we may want to batch the clamped memory copies and the large number of D2D copies at the end. I haven't fully tracked those down and may return to them in the future.

…aldi-asr#3326)

…r#3358) - end the training when there is no more data to refill one of the streams, - this avoids overtraining to the 'last' utterance,

…eanup code (kaldi-asr#3337)

…ray (kaldi-asr#3364)

…ray (kaldi-asr#3365)

…3371)

…nd header. (kaldi-asr#3374)

…di-asr#3377)

…es out (kaldi-asr#3378)

KarelVesely84 and others added 27 commits January 17, 2019 20:42

[src] Fix 'sausage-time' issue which occurs with disabled MBR decodin…

e8d1287

…g. (kaldi-asr#2996)

[egs] Add scripts for yomdle Russian (OCR task) (kaldi-asr#2953)

99dc4d8

[egs] Simplify lexicon preparation in Fisher callhome Spanish (kaldi-…

7e529ed

…asr#2999)

[egs] Update GALE Arabic recipe (kaldi-asr#2934)

25f09e8

[egs] Remove outdated NN results from Gale Arabic recipe (kaldi-asr#3002

4338004

)

[egs] Add RESULTS file for the tedlium s5_r3 (release 3) setup (kaldi…

05d9a3d

…-asr#3003)

[src] Fixes to grammar-fst code to handle LM-disambig symbols properly (

1dcdf80

kaldi-asr#3000) thanks: armando.muscariello@gmail.com

[src] Cosmetic change to mel computation (fix option string) (kaldi-a…

6f56512

…sr#3011)

[src] Fix Visual Studio error due to alternate syntactic form of nore…

56cfb95

…turn (kaldi-asr#3018)

[egs] Fix location of sequitur installation (kaldi-asr#3017)

9e35898

[src] Fix w/ ifdef Visual Studio error from alternate syntactic form …

a51bd96

…noreturn (kaldi-asr#3020)

[egs] Some fixes to getting data in heroico recipe (kaldi-asr#3021)

41ea8cf

[egs] BABEL script fix: avoid make_L_align.sh generating invalid files (

fb514dc

kaldi-asr#3022)

[src] Fix to older online decoding code in online/ (OnlineFeInput; wa…

afc5e78

…s broken by commit cc2469e). (kaldi-asr#3025)

[script] Fix unset bash variable in make_mfcc.sh (kaldi-asr#3030)

226cbf7

[scripts] Extend limit_num_gpus.sh to support --num-gpus 0. (kaldi-as…

6fc4c60

…r#3027)

[scripts] fix bug in utils/add_lex_disambig.pl when sil-probs and pro…

2f92bd9

…n-probs used (kaldi-asr#3033) bug would likely have resulted in determinization failure (only when not using word-position-dependent phones).

[egs] Fix path in Tedlium r3 rnnlm training script (kaldi-asr#3039)

403c5ee

Spanish Gigaword LM recipe

cbc8eeb

Some bug fixes

e8aecbb

Update rnnlm.sh

ece34bd

Combining lexicon words with pocolm wordslist for RNNLM training

0c4fe47

merge conflict resolved

92e241b

[src] Thread-safety for GrammarFst (thx:armando.muscariello@gmail.com) (

abfbc56

kaldi-asr#3040)

[scripts] Cosmetic fix to get_degs.sh (kaldi-asr#3045)

f09d48a

[egs] Small bug fixes for IAM and UW3 recipes (kaldi-asr#3048)

b0fc09d

Integrated the 2 stage scientific method POCOLM training for Gigaword…

1439b0d

… corpus

saikiranvalluri changed the title ~~Spanish Gigaword LM recipe - replicates the best offline decode WER model on fisher Spanish testset~~ Spanish Gigaword LM recipe Feb 24, 2019

saikiranvalluri and others added 2 commits February 26, 2019 05:17

Update train_pocolm.sh

8ad0e01

[scripts] Nnet3 segmentation: fix default params (kaldi-asr#3051)

4494a85

saikiranvalluri and others added 30 commits May 23, 2019 13:56

[src,egs] Add option for applying SVD on trained models (kaldi-asr#3272)

9e0a7f6

[build] Update GCC support check for CUDA toolkit 10.1 (kaldi-asr#3345)

52e7ecf

[egs] Fix to aishell1 v1 download script (kaldi-asr#3344)

29f3c14

[scripts] Support utf-8 files in some scripts (kaldi-asr#3346)

a5dd6bd

[src] Fix potential underflow bug in MFCC, RE energy floor, thx: Zolt…

8c6cd31

…an Tobler (kaldi-asr#3347)

[scripts]: add warning to nnet3/chain/train.py about ineffective opti…

e643c73

…ons (kaldi-asr#3341)

[scripts] Fix regarding UTF handling in cleanup script (kaldi-asr#3352)

8706f06

[scripts] Change encoding to utf-8 in data augmentation scripts (kald…

800924d

…i-asr#3360)

[scripts,minor] Remove outdated comment (kaldi-asr#3361)

16097b4

[egs] A kaldi recipe based on the corpus named "aidatatang_200zh". (k…

ced53e1

…aldi-asr#3326)

[src] nnet1: changing end-rule in 'nnet-train-multistream', (kaldi-as…

f8a4376

…r#3358) - end the training when there is no more data to refill one of the streams, - this avoids overtraining to the 'last' utterance,

[scripts] Fix how the empty (faulty?) segments are handled in data-cl…

9c734a5

…eanup code (kaldi-asr#3337)

[src] Fix to bug in ivector extraction causing assert failure, thx: s…

b276d70

…ray (kaldi-asr#3364)

[src] Fix to bug in ivector extraction causing assert failure, thx: s…

de4a3e3

…ray (kaldi-asr#3365)

[scripts] add script to compute dev PPL on kaldi-rnnlm (kaldi-asr#3340)

1a4aa52

[scripts,egs] Small fixes to diarization scripts (kaldi-asr#3366)

1735003

[egs] Modify split_scp.pl usage to match its updated code (kaldi-asr#…

338cc58

…3371)

[src] Fix non-cuda make depend build by putting compile guards arou…

254d636

…nd header. (kaldi-asr#3374)

[build] Docker docs update and minor changes to the Docker files (kal…

3648df5

…di-asr#3377)

[egs] Scripts for MATERIAL ASR (kaldi-asr#2165)

0071003

[src] Batch nnet3 optimizations. Batch some of the copies in and copi…

acff3f6

…es out (kaldi-asr#3378)

[build] Widen cuda guard in cudafeat makefile. (kaldi-asr#3379)

23ba982

Update run.sh

6636557

Merge branch 'master' into feature/Spanish_gigaword_LM

69b1bca

Update run.sh

36499a7

Reverse the order of Abbreviation process after punct syms

8da5c3e

Update run_norm.sh

510b415

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanish Gigaword LM recipe#2

Spanish Gigaword LM recipe#2
saikiranvalluri wants to merge 241 commits intomasterfrom
feature/Spanish_gigaword_LM

saikiranvalluri commented Feb 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Comments

Conversation

saikiranvalluri commented Feb 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Comments