Skip to content

Spanish Gigaword LM recipe#2

Open
saikiranvalluri wants to merge 241 commits intomasterfrom
feature/Spanish_gigaword_LM
Open

Spanish Gigaword LM recipe#2
saikiranvalluri wants to merge 241 commits intomasterfrom
feature/Spanish_gigaword_LM

Conversation

@saikiranvalluri
Copy link
Collaborator

No description provided.

KarelVesely84 and others added 27 commits January 17, 2019 20:42
…n-probs used (kaldi-asr#3033)

bug would likely have resulted in determinization failure (only when not using word-position-dependent phones).
@saikiranvalluri saikiranvalluri changed the title Spanish Gigaword LM recipe - replicates the best offline decode WER model on fisher Spanish testset Spanish Gigaword LM recipe Feb 24, 2019
saikiranvalluri and others added 30 commits May 23, 2019 13:56
…kaldi-asr#3311)

This avoids a ping pong of memory to host.

Implementation now assumes device memory.  interfaces will allocate
device memory and copy to it if data starts on host.

Add a cuda matrix copy function which clamps rows.  This is much
faster than copying one row at a time and the kernel can handle the
clamping for free.
* Add CUDA accelerated MFCC computation.

Creates a new directory 'cudafeat' for placing cuda feature extraction
components as it is developed.  Added a directory 'cudafeatbin' for
placing binaries that are cuda accelerated that mirrior binaries
elsewhere.

This commit implements:
  feature-window-cuda.h/cu which implements a feature window on the device
    by copying it from a host feature window.
  feature-mfcc-cuda.h/cu which implements the cuda mfcc feature
    extractor.
  compute-mfcc-feats-cuda.cc which mirriors compute-mfcc-feats.cc

  There were also minor changes to other files.

* Only build cuda binaries if cuda is enabled
…ldi-asr#3351)

small cuda memory copies are inefficeint because each copy can
add multiple micro-seconds of latency.  The code as written
would copy a small matrices or vectors to and from the tasks one
after another.  To avoid this i've implemented a batched matrix
copy routine.  This takes arrays of matrix descriptions for the
input and output and batches the copies in a single kernel call.
This is used in both FormatInputs and FormatOutputs to reduce
launch latency overhead.

The kernel for the batched copy uses a trick to avoid a memory
copy of the host paramters.  The parameters are put into a struct
containing a static sized array.  These parameters are then marshalled
like normal cuda parameters.  This avoids additional launch latency
overhead.

There is still more work to do at the beginning and end of nnet3.
In particular we may want to batch the clamped memory copies and
the large number of D2D copies at the end.  I haven't fully tracked
those down and may return to them in the future.
…r#3358)

- end the training when there is no more data to refill one of the streams,
- this avoids overtraining to the 'last' utterance,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments