Skip to content

Performance degrades to quadratic for some input strings #25

@mikethea1

Description

@mikethea1

For example, a long input string that is all letters takes a very long time to run because it ends up as one giant mergeBytePairs call:

$input = implode('', array_map(fn() => 'abcdefghijklmnopqrstuvwxyz'[mt_rand(0, 25)], range(1, 100_000)));
(new \Yethee\Tiktoken\EncoderProvider)->getForModel('gpt-4o-mini')->encode($input) // takes > 5 minutes

I'm curious what other BPE implementations do here. Perhaps there is a way to handle long inputs to mergeBytePairs with a different algorithm that scales better?

One easy (but perhaps undesirable) solution is to limit the length of inputs to mergeBytePairs. For example if $piece is > n chars we break it into smaller chunks and merge each one. This will be fast, but might not find the optimal encoding (it might report more tokens than strictly required). However, for large enough chunk sizes it should be close to optimal. Maybe this could be an optional arg to the encode method ->encode($text, exact: false)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions