-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
For example, a long input string that is all letters takes a very long time to run because it ends up as one giant mergeBytePairs call:
$input = implode('', array_map(fn() => 'abcdefghijklmnopqrstuvwxyz'[mt_rand(0, 25)], range(1, 100_000)));
(new \Yethee\Tiktoken\EncoderProvider)->getForModel('gpt-4o-mini')->encode($input) // takes > 5 minutes
I'm curious what other BPE implementations do here. Perhaps there is a way to handle long inputs to mergeBytePairs with a different algorithm that scales better?
One easy (but perhaps undesirable) solution is to limit the length of inputs to mergeBytePairs. For example if $piece is > n chars we break it into smaller chunks and merge each one. This will be fast, but might not find the optimal encoding (it might report more tokens than strictly required). However, for large enough chunk sizes it should be close to optimal. Maybe this could be an optional arg to the encode method ->encode($text, exact: false)?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels