Skip to content

feat: memory-efficient token counting functionality#13

Merged
bluescreen10 merged 2 commits intotiktoken-go:mainfrom
amalucelli:count
Feb 17, 2025
Merged

feat: memory-efficient token counting functionality#13
bluescreen10 merged 2 commits intotiktoken-go:mainfrom
amalucelli:count

Conversation

@amalucelli
Copy link
Contributor

@amalucelli amalucelli commented Feb 10, 2025

This PR adds a new Count() method to the Codec interface. This method allows counting tokens without allocating memory (len(Encode())) for the actual token IDs and strings. This is particularly useful when you only need to know the token count of a text, such as when checking if text fits within a model's context window.

@bluescreen10
Copy link
Contributor

Hi, thanks for this pull request I'll look into this.

amalucelli and others added 2 commits February 13, 2025 11:50
This uses iterators for the common logic and the function Count and Encode would do different things
@bluescreen10
Copy link
Contributor

Hey @amalucelli can you take a look, I've modified the code slightly to put common logic between encode and count into a method.

Copy link
Contributor Author

@amalucelli amalucelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@bluescreen10 bluescreen10 merged commit 8ac6fdf into tiktoken-go:main Feb 17, 2025
1 check passed
@amalucelli
Copy link
Contributor Author

@bluescreen10 I was running this change in production and noticed the Count() is still allocating some significant amount of memory and CPU sometimes:

Screenshot 2025-02-21 at 08 46 34 Screenshot 2025-02-21 at 08 46 58

After reviewing the changes one more time, I think it's likely due to this since the returns are just discarded but still allocated:

	for _, _ = range c.tokenize(input) {
		count++
	}

In my initial suggestion, there was a different and isolated method for counting:

cd27b29#diff-8b594c387517838ab44fd9d9f7a5f8c1e3efa32486a1ba4b923858e4e6538955R22-R45
cd27b29#diff-8b594c387517838ab44fd9d9f7a5f8c1e3efa32486a1ba4b923858e4e6538955R163-R166

Would you reconsider maybe an approach to isolate Count to avoid this memory allocation?

@bluescreen10
Copy link
Contributor

Thanks, I will look into this. What I didn't like about the old approach was the fact that we had code duplication between count and the normal tokenizing process. I'll try to see if we can avoid some of that penalty.

I'm curious about your use case, can you tell me more?

@amalucelli
Copy link
Contributor Author

amalucelli commented Feb 21, 2025

Thanks for that! I can't get into details but we have some token budget constraints for the content we send to the LLM, and we rely on it for that, so for our case, we only ever need to know the count/total of tokens for a string.

@bluescreen10
Copy link
Contributor

Lol, I didn't mean to put you into trouble. I was thinking that maybe it's an embedded use case in which memory/cpu is critical.

@amalucelli
Copy link
Contributor Author

That's all good, the thing is that we have frequent spikes in memory and frequently relate to this. If you want I can try to review it again and propose something, but I'm not that familiar with this code base, so I'm not sure how much optimization you want on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants