Skip to content

bug: runtime download of tiktoken cl100k_base encoding breaks air-gapped deployments #5337

@jgarciao

Description

@jgarciao

System Info

I've found this problem when running 0.6.0.1+rhai0 in a disconnected OpenShift cluster running the Red Hat Llama Stack Distribution included in OpenShift AI 3.4 EA2 (nightly build).

The disconnected cluster had an https_proxy configured in order to call external LLMs. In the proxy logs I could see how https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken was being downloaded.
If the cluster was 100% disconnected and the models were deployed locally, vector stores wouldn't not work because of this.

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Description

Since PR #4870 (merged Feb 9, 2026), the first vector store file attachment triggers a runtime
HTTP download of the tiktoken cl100k_base encoding file (~1.7 MB) from
https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken.

This breaks air-gapped and network-restricted deployments where outbound internet access is
unavailable or limited to approved endpoints. The server fails on the first
vector_stores.files.create() call because it cannot reach Azure Blob Storage.

Root Cause

PR #4870 replaced the local Llama 3 tokenizer with tiktoken.get_encoding("cl100k_base") for
document chunking. Unlike the previous tokenizer — which loaded BPE ranks from a local .model
file via tiktoken.load_tiktoken_bpe() — the new code path requires a network fetch on first use:

# src/llama_stack/providers/utils/memory/vector_store.py

@cache
def _get_encoding(name: str) -> tiktoken.Encoding:
    return tiktoken.get_encoding(name)  # downloads from openaipublic.blob.core.windows.net

def make_overlapped_chunks(..., chunk_tokenizer_encoding: str = "cl100k_base"):
    encoding = _get_encoding(encoding_name)
    tokens = encoding.encode(text)
    ...

The previous code used Tokenizer.get_instance() from
llama_stack.models.llama.llama3.tokenizer, which required no network access at all.

Impact

  • Air-gapped and network-restricted deployments (common in enterprise OpenShift / Kubernetes
    environments) cannot use vector store file operations without granting egress to
    openaipublic.blob.core.windows.net.
  • The failure only manifests at runtime on the first file attachment, making it difficult to
    catch during deployment validation.
  • This is the same class of issue as BerriAI/litellm#23218.

Suggested Fix

Option A — Pre-cache at image build time (immediate, no code change)

Add two lines to containers/Containerfile,
after the cleanup step and before the entrypoint setup:

# Pre-cache tiktoken cl100k_base encoding to avoid runtime download
# from openaipublic.blob.core.windows.net (used by vector_store chunking)
ENV TIKTOKEN_CACHE_DIR="/.cache/tiktoken"
RUN python3 -c "import tiktoken; tiktoken.get_encoding('cl100k_base')"

This should also be:

  1. Added to the Building Custom Distributions documentation
    as a recommended step for custom Containerfiles, so downstream image builders are aware of
    the runtime network dependency.
  2. Mentioned in the Containerfile comments explaining why the pre-caching is needed.

Option B — Bundle the encoding in the package

Ship the cl100k_base.tiktoken file alongside the llama-stack source and point
TIKTOKEN_CACHE_DIR at it, removing the network dependency from both image builds and runtime.

Option C — Revert to a local tokenizer

Revert to a tokenizer that loads from local files (e.g. the Llama 3 tokenizer used before #4870),
or make the tokenizer fully configurable so deployments can choose a local-only option.

Environment

References

Expected behavior

vector stores in llama-stack should work on disconnected environments

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions