Implement on-the-fly sequence packing for bytes decoder training #40

Copilot · 2025-11-08T05:08:31Z

The bytes decoder flattens (B, L, T) inputs to (B×L, T) for training, padding each word to max length T. With B=128, L=512, T=32, this creates 65,536 sequences of 32 tokens each (2.1M tokens total), despite most words being 2-5 tokens long.

Changes

Core Implementation (welt/model.py)

_pack_sequences_for_decoding(): Greedily packs sequences until reaching max_packed_length = T × 2, tracking indices for unpacking
_unpack_logits(): Reconstructs original (B, L, T, vocab_size) shape from packed decoder outputs
Modified parallel_causal_decode(): Routes through packing pipeline transparently

Edge Cases

Handles empty sequences (length 0)
Preserves per-word latent vector prepending
Zero-pads unpacked positions (ignored by loss via attention mask)

Performance

Typical English text (simulated):

82.9% fewer tokens processed (2.1M → 359K)
91.0% fewer decoder passes (65,536 → 5,880)

Short words (maximum benefit):

90.8% token reduction
95.3% pass reduction

Long words (minimal benefit):

35.2% token reduction
61.5% pass reduction

Testing

Unit tests verify packing/unpacking correctness
Correctness tests confirm packed outputs match unpacked baseline within floating point precision
Demo script (examples/demo_packing_efficiency.py) illustrates efficiency gains across word length distributions

No changes required to training code—packing is automatic and preserves all model behavior.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

esm.ubuntu.com
- Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Train bytes decoder with on-the-fly packing</issue_title>
<issue_description>The bytes decoder is trained with the sequences up to length N being padded.
So for a batch with B samples and L words we create a batch of (BxL) words and decode those.
This number can be very large - 128 batch size of maximum 512 words for example creates a 65536 words batch size for the decoder.

Using pack_sequence, and some engineering, we can probably pack these words into fewer byte sequences, taking short words and joining them into a single training sample (for example, if we have the word "a" it is like 2 tokens, but the word "hello" is like 6, so we could fit 3 "a"s in the same training sample)</issue_description>

<agent_instructions>This is an interesting proposition. Figure out how to implement in a clean way, and test your solution</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Train bytes decoder with on-the-fly packing #25

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Initial plan

6fc445b

Copilot AI assigned Copilot and AmitMY Nov 8, 2025

Copilot started work on behalf of AmitMY November 8, 2025 05:08 View session

Copilot AI and others added 6 commits November 8, 2025 05:26

Implement on-the-fly sequence packing for bytes decoder training

2044bde

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Add correctness tests and documentation for sequence packing

50c9d64

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Add edge case handling for empty sequences in packing

15bda89

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Remove unused variable in packing function

04fa4fd

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Add demonstration script for packing efficiency

abff624

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Add comprehensive documentation for sequence packing feature

d5165a1

Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Train bytes decoder with on-the-fly packing~~ Implement on-the-fly sequence packing for bytes decoder training Nov 8, 2025

Copilot AI requested a review from AmitMY November 8, 2025 05:36

Copilot finished work on behalf of AmitMY November 8, 2025 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement on-the-fly sequence packing for bytes decoder training #40

Implement on-the-fly sequence packing for bytes decoder training #40

Uh oh!

Copilot AI commented Nov 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Implement on-the-fly sequence packing for bytes decoder training #40

Are you sure you want to change the base?

Implement on-the-fly sequence packing for bytes decoder training #40

Uh oh!

Conversation

Copilot AI commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance

Testing

I tried to connect to the following addresses, but was blocked by firewall rules:

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 8, 2025 •

edited

Loading