Open
Conversation
kaldi-asr#3000) thanks: armando.muscariello@gmail.com
…s broken by commit cc2469e). (kaldi-asr#3025)
…n-probs used (kaldi-asr#3033) bug would likely have resulted in determinization failure (only when not using word-position-dependent phones).
…kaldi-asr#3311) This avoids a ping pong of memory to host. Implementation now assumes device memory. interfaces will allocate device memory and copy to it if data starts on host. Add a cuda matrix copy function which clamps rows. This is much faster than copying one row at a time and the kernel can handle the clamping for free.
* Add CUDA accelerated MFCC computation.
Creates a new directory 'cudafeat' for placing cuda feature extraction
components as it is developed. Added a directory 'cudafeatbin' for
placing binaries that are cuda accelerated that mirrior binaries
elsewhere.
This commit implements:
feature-window-cuda.h/cu which implements a feature window on the device
by copying it from a host feature window.
feature-mfcc-cuda.h/cu which implements the cuda mfcc feature
extractor.
compute-mfcc-feats-cuda.cc which mirriors compute-mfcc-feats.cc
There were also minor changes to other files.
* Only build cuda binaries if cuda is enabled
…ldi-asr#3351) small cuda memory copies are inefficeint because each copy can add multiple micro-seconds of latency. The code as written would copy a small matrices or vectors to and from the tasks one after another. To avoid this i've implemented a batched matrix copy routine. This takes arrays of matrix descriptions for the input and output and batches the copies in a single kernel call. This is used in both FormatInputs and FormatOutputs to reduce launch latency overhead. The kernel for the batched copy uses a trick to avoid a memory copy of the host paramters. The parameters are put into a struct containing a static sized array. These parameters are then marshalled like normal cuda parameters. This avoids additional launch latency overhead. There is still more work to do at the beginning and end of nnet3. In particular we may want to batch the clamped memory copies and the large number of D2D copies at the end. I haven't fully tracked those down and may return to them in the future.
…r#3358) - end the training when there is no more data to refill one of the streams, - this avoids overtraining to the 'last' utterance,
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.