Potential performance improvement: cache the result of mergeBytePairs()

When tokenizing large input strings, `mergeBytePairs` seems to be the bottleneck even when it isn't [degrading to quadratic[(https://github.com/yethee/tiktoken-php/issues/25).

On my workload, I found that a small change of caching the result of mergeBytePairs resulted in a ~33% speedup since larger files tend to have many repeat pieces.

This isn't a free improvement since it adds some overhead due to checking the cache.

Have you considered this optimization before? Would you entertain a PR to add it?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance improvement: cache the result of mergeBytePairs() #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Potential performance improvement: cache the result of mergeBytePairs() #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions