Train bytes decoder with on-the-fly packing

The bytes decoder is trained with the sequences up to length N being padded.
So for a batch with B samples and L words we create a batch of (BxL) words and decode those. 
This number can be very large - 128 batch size of maximum 512 words for example creates a 65536 words batch size for the decoder.

Using `pack_sequence`, and some engineering, we can probably pack these words into fewer byte sequences, taking short words and joining them into a single training sample (for example, if we have the word "a" it is like 2 tokens, but the word "hello" is like 6, so we could fit 3 "a"s in the same training sample)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train bytes decoder with on-the-fly packing #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Train bytes decoder with on-the-fly packing #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions