-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Problem:
Foundation Models in AnyLanguageModel have a hard context window limit (e.g., 4096 tokens). Currently, when this limit is exceeded, the session fails with an error:
exceededContextWindowSize(FoundationModels.LanguageModelSession.GenerationError.Context(debugDescription: "Content contains 4098 tokens, which exceeds the maximum allowed context size of 4096.", underlyingErrors: [Provided 4,098 tokens, but the maximum allowed is 4,096.]))
Other models typically have higher limits but will reach a limit at some point as well.
While this is something the developer could handle, since this library is treated as universal and backends like OpenAI, Claude, etc. often handle this kind of adjustment automatically, it would be ideal for the library to tackle this for local models.
This is not just a problem for long conversations — tool calls can also contribute large amounts of data. For example:
• Tool outputs containing structured data, summaries, or long document content
• Multiple tool calls within the same session, each adding their prompt and result
• Reference guides, lookup tables, or document embeddings stored in session history
All of these count toward the token limit, making it easy to exceed 4096 tokens even if the visible chat is short.
Proposed Enhancement:
1. Token Tracking per Session
• Automatically count tokens for every piece of context in a session: system prompts, user messages, assistant responses, tool calls, and tool outputs.
2. Automatic Trimming / Summarization
• When adding new content would exceed the model’s context window:
• Drop or summarize the oldest messages and tool outputs until the new input fits.
• Optionally allow developers to mark certain tool outputs or messages as “persistent” so they are never discarded.
3. Tool-Aware Handling
• Treat tool calls specially:
• Tool inputs and outputs may be large but can be summarized or compressed when stored in session history.
• For example, storing only essential fields or summaries instead of full JSON.
4. Configurable Strategy
• Developers could define custom trimming rules:
• Drop entire messages
• Summarize tool outputs
• Retain only the last N tool calls
• Compress historical context into a short summary
5. Unified Across Models
• Apply this strategy for all model types (Foundation, MLX, etc.), respecting their individual max token limits.
Benefits:
• Prevents exceededContextWindowSize errors without manual intervention.
• Makes long-running sessions with multiple tool calls robust.
• Enables developers to safely use heavy reference data, guides, or tool outputs without exceeding the model limit.
• Provides a clear framework for context management that can scale as models with larger windows are added.