feat: memory-efficient token counting functionality by amalucelli · Pull Request #13 · tiktoken-go/tokenizer

amalucelli · 2025-02-10T14:22:53Z

This PR adds a new Count() method to the Codec interface. This method allows counting tokens without allocating memory (len(Encode())) for the actual token IDs and strings. This is particularly useful when you only need to know the token count of a text, such as when checking if text fits within a model's context window.

bluescreen10 · 2025-02-12T23:01:04Z

Hi, thanks for this pull request I'll look into this.

This uses iterators for the common logic and the function Count and Encode would do different things

bluescreen10 · 2025-02-13T17:52:48Z

Hey @amalucelli can you take a look, I've modified the code slightly to put common logic between encode and count into a method.

amalucelli

LGTM, thank you!

amalucelli · 2025-02-21T12:03:30Z

@bluescreen10 I was running this change in production and noticed the Count() is still allocating some significant amount of memory and CPU sometimes:

After reviewing the changes one more time, I think it's likely due to this since the returns are just discarded but still allocated:

	for _, _ = range c.tokenize(input) {
		count++
	}

In my initial suggestion, there was a different and isolated method for counting:

cd27b29#diff-8b594c387517838ab44fd9d9f7a5f8c1e3efa32486a1ba4b923858e4e6538955R22-R45
cd27b29#diff-8b594c387517838ab44fd9d9f7a5f8c1e3efa32486a1ba4b923858e4e6538955R163-R166

Would you reconsider maybe an approach to isolate Count to avoid this memory allocation?

bluescreen10 · 2025-02-21T15:21:52Z

Thanks, I will look into this. What I didn't like about the old approach was the fact that we had code duplication between count and the normal tokenizing process. I'll try to see if we can avoid some of that penalty.

I'm curious about your use case, can you tell me more?

amalucelli · 2025-02-21T19:27:57Z

Thanks for that! I can't get into details but we have some token budget constraints for the content we send to the LLM, and we rely on it for that, so for our case, we only ever need to know the count/total of tokens for a string.

bluescreen10 · 2025-02-21T19:37:46Z

Lol, I didn't mean to put you into trouble. I was thinking that maybe it's an embedded use case in which memory/cpu is critical.

amalucelli · 2025-02-21T19:47:02Z

That's all good, the thing is that we have frequent spikes in memory and frequently relate to this. If you want I can try to review it again and propose something, but I'm not that familiar with this code base, so I'm not sure how much optimization you want on that.

amalucelli and others added 2 commits February 13, 2025 11:50

feat: add count method for tokens

cd27b29

Refactored code to remove duplication

d47b680

This uses iterators for the common logic and the function Count and Encode would do different things

bluescreen10 force-pushed the count branch from 90e133b to d47b680 Compare February 13, 2025 17:51

amalucelli commented Feb 17, 2025

View reviewed changes

bluescreen10 merged commit 8ac6fdf into tiktoken-go:main Feb 17, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: memory-efficient token counting functionality#13

feat: memory-efficient token counting functionality#13
bluescreen10 merged 2 commits intotiktoken-go:mainfrom
amalucelli:count

amalucelli commented Feb 10, 2025 •

edited

Loading

Uh oh!

bluescreen10 commented Feb 12, 2025

Uh oh!

bluescreen10 commented Feb 13, 2025

Uh oh!

amalucelli left a comment

Uh oh!

Uh oh!

amalucelli commented Feb 21, 2025

Uh oh!

bluescreen10 commented Feb 21, 2025

Uh oh!

amalucelli commented Feb 21, 2025 •

edited

Loading

Uh oh!

bluescreen10 commented Feb 21, 2025

Uh oh!

amalucelli commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amalucelli commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluescreen10 commented Feb 12, 2025

Uh oh!

bluescreen10 commented Feb 13, 2025

Uh oh!

amalucelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amalucelli commented Feb 21, 2025

Uh oh!

bluescreen10 commented Feb 21, 2025

Uh oh!

amalucelli commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluescreen10 commented Feb 21, 2025

Uh oh!

amalucelli commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amalucelli commented Feb 10, 2025 •

edited

Loading

amalucelli commented Feb 21, 2025 •

edited

Loading