Skip to content

[Question] Possiblity of per-encoding imports so only selected vocabularies are embedded in the binary? #22

@ppipada

Description

@ppipada

Hi,

Thanks for the library, I’m experimenting with it to estimate token counts and I have a question about how encodings are linked into the final Go binary. Right now I’m trying to understand whether it’s possible to only embed the vocab for the encodings I actually use, or whether all supported encodings get pulled into the binary by default.

Concretely:

  • I build a small test program that only uses a single encoding (o200k_base).
  • The resulting binary is around 14 MB.
  • I build another one which uses one more encoding along with above (cl100k_base).
  • The resulting binary is still that size only.
  • From looking at the repo, the o200k_base vocab itself is on the order of ~4 MB.

This makes me suspect that either:

  • multiple vocabularies are being linked in even when I only use one encoding, or
  • there is some shared structure that forces all encodings to be kept by the linker.

My questions:

  1. Is the current design such that importing and using a single encoding (e.g. just o200k_base) will still embed all of the other vocabularies in the final binary, or are unused encodings actually stripped by the Go linker?
  2. Is there a recommended way today to build an application so that only one (or a small set) of encodings is embedded in the binary?

Related to this, a common pattern in Go for this kind of problem is the “driver style” layout (similar to database/sql):

  • A small core package that defines types and a registration API (e.g. RegisterEncoding / GetEncoding).
  • One subpackage per encoding (cl100k_base, o200k_base, etc.), each with its own embedded vocab and an init() that registers itself with the core.
  • Optionally, an “all” package that simply imports all encoding subpackages for users who really want everything.

This allows:

  • Users who only need one encoding to import only that single subpackage, and the linker can drop the rest.
  • Users who need multiple encodings to import whichever ones they want.
  • Users who want “everything” to import a single “all” package.

Would you be open to supporting a layout like this (or something similar) so that binary size can scale more linearly with the number of encodings actually used? I would be open to work on it, if you guide me.

I looked at: https://github.com/pkoukk/tiktoken-go . They pull things from the internet in some cases and need dir variable etc, which means that shipping a static binary is not really directly possible there.

Even just clarifying how things work currently (whether only the used encodings are linked or all of them) would already be very helpful for understanding and profiling binary size

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions