When tokenizing large input strings, mergeBytePairs seems to be the bottleneck even when it isn't [degrading to quadratic[(https://github.com//issues/25).
On my workload, I found that a small change of caching the result of mergeBytePairs resulted in a ~33% speedup since larger files tend to have many repeat pieces.
This isn't a free improvement since it adds some overhead due to checking the cache.
Have you considered this optimization before? Would you entertain a PR to add it?