[Question]  Possiblity of per-encoding imports so only selected vocabularies are embedded in the binary?


Hi, 

Thanks for the library, I’m experimenting with it to estimate token counts and I have a question about how encodings are linked into the final Go binary. Right now I’m trying to understand whether it’s possible to only embed the vocab for the encodings I actually use, or whether all supported encodings get pulled into the binary by default.

Concretely:

- I build a small test program that only uses a single encoding (o200k_base).
- The resulting binary is around 14 MB.
- I build another one which uses one more encoding along with above (cl100k_base).
- The resulting binary is still that size only.
- From looking at the repo, the o200k_base vocab itself is on the order of ~4 MB.

This makes me suspect that either:
- multiple vocabularies are being linked in even when I only use one encoding, or
- there is some shared structure that forces all encodings to be kept by the linker.

My questions:
1. Is the current design such that importing and using a single encoding (e.g. just o200k_base) will still embed all of the other vocabularies in the final binary, or are unused encodings actually stripped by the Go linker? 
2. Is there a recommended way today to build an application so that only one (or a small set) of encodings is embedded in the binary?

Related to this, a common pattern in Go for this kind of problem is the “driver style” layout (similar to database/sql):
- A small core package that defines types and a registration API (e.g. RegisterEncoding / GetEncoding).
- One subpackage per encoding (cl100k_base, o200k_base, etc.), each with its own embedded vocab and an init() that registers itself with the core.
- Optionally, an “all” package that simply imports all encoding subpackages for users who really want everything.

This allows:
- Users who only need one encoding to import only that single subpackage, and the linker can drop the rest.
- Users who need multiple encodings to import whichever ones they want.
- Users who want “everything” to import a single “all” package.

Would you be open to supporting a layout like this (or something similar) so that binary size can scale more linearly with the number of encodings actually used? I would be open to work on it, if you guide me. 

I looked at: https://github.com/pkoukk/tiktoken-go . They pull things from the internet in some cases and need dir variable etc, which means that shipping a static binary is not really directly possible there. 

_Even just clarifying how things work currently (whether only the used encodings are linked or all of them) would already be very helpful for understanding and profiling binary size_

Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Possiblity of per-encoding imports so only selected vocabularies are embedded in the binary? #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Possiblity of per-encoding imports so only selected vocabularies are embedded in the binary? #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions