-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
hey this is awesome thank you! It is way faster than the reference implementation
I had some issues with numeric stability training a mamba model with a chunk size of 16, batch size 8, sequence length 256, and dim 32 and state dim 8. I found that adding a small epsilon term of 1e-11 in this division helped with all NaNs. The model seemed to train just the same but I'm not sure what the implications of adding this epsilon is
How do you obtain the last hidden state, like what is in the original reference function; is it just rearrange(hprefix, 'B G D N -> B') ?
Metadata
Metadata
Assignees
Labels
No labels