-
-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Milestone
Description
The bytes decoder is trained with the sequences up to length N being padded.
So for a batch with B samples and L words we create a batch of (BxL) words and decode those.
This number can be very large - 128 batch size of maximum 512 words for example creates a 65536 words batch size for the decoder.
Using pack_sequence, and some engineering, we can probably pack these words into fewer byte sequences, taking short words and joining them into a single training sample (for example, if we have the word "a" it is like 2 tokens, but the word "hello" is like 6, so we could fit 3 "a"s in the same training sample)
Copilot