Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 8, 2025

The bytes decoder flattens (B, L, T) inputs to (B×L, T) for training, padding each word to max length T. With B=128, L=512, T=32, this creates 65,536 sequences of 32 tokens each (2.1M tokens total), despite most words being 2-5 tokens long.

Changes

Core Implementation (welt/model.py)

  • _pack_sequences_for_decoding(): Greedily packs sequences until reaching max_packed_length = T × 2, tracking indices for unpacking
  • _unpack_logits(): Reconstructs original (B, L, T, vocab_size) shape from packed decoder outputs
  • Modified parallel_causal_decode(): Routes through packing pipeline transparently

Edge Cases

  • Handles empty sequences (length 0)
  • Preserves per-word latent vector prepending
  • Zero-pads unpacked positions (ignored by loss via attention mask)

Performance

Typical English text (simulated):

  • 82.9% fewer tokens processed (2.1M → 359K)
  • 91.0% fewer decoder passes (65,536 → 5,880)

Short words (maximum benefit):

  • 90.8% token reduction
  • 95.3% pass reduction

Long words (minimal benefit):

  • 35.2% token reduction
  • 61.5% pass reduction

Testing

  • Unit tests verify packing/unpacking correctness
  • Correctness tests confirm packed outputs match unpacked baseline within floating point precision
  • Demo script (examples/demo_packing_efficiency.py) illustrates efficiency gains across word length distributions

No changes required to training code—packing is automatic and preserves all model behavior.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • esm.ubuntu.com
    • Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Train bytes decoder with on-the-fly packing</issue_title>
<issue_description>The bytes decoder is trained with the sequences up to length N being padded.
So for a batch with B samples and L words we create a batch of (BxL) words and decode those.
This number can be very large - 128 batch size of maximum 512 words for example creates a 65536 words batch size for the decoder.

Using pack_sequence, and some engineering, we can probably pack these words into fewer byte sequences, taking short words and joining them into a single training sample (for example, if we have the word "a" it is like 2 tokens, but the word "hello" is like 6, so we could fit 3 "a"s in the same training sample)</issue_description>

<agent_instructions>This is an interesting proposition. Figure out how to implement in a clean way, and test your solution</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 6 commits November 8, 2025 05:26
Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>
Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>
Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>
Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>
Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>
Co-authored-by: AmitMY <5757359+AmitMY@users.noreply.github.com>
Copilot AI changed the title [WIP] Train bytes decoder with on-the-fly packing Implement on-the-fly sequence packing for bytes decoder training Nov 8, 2025
Copilot AI requested a review from AmitMY November 8, 2025 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Train bytes decoder with on-the-fly packing

2 participants