Skip to content

sub(#249) Phase 1: Token estimation utility and context budget calculation #271

@mmogr

Description

@mmogr

Parent: #249 — Token-based context budgeting

Goal

Replace the character-based context budgeting (MAX_CONTEXT_CHARS = 180_000) with token-aware budgeting that uses the model's actual context window size.

Background

Currently agentLoop.ts uses pruneForBudget() with a fixed character limit:

const MAX_CONTEXT_CHARS = 180_000; // ~45K tokens at 4 chars/token

This is problematic because:

  • Different models have different context windows (4K, 8K, 32K, 128K)
  • 4 chars/token is a rough heuristic; actual ratio varies by language and content
  • No way to reserve space for the model's response
  • The context window size is available from the model metadata (GGUF) but not used

Implementation

1. Create src/utils/tokenEstimator.ts

/**
 * Approximate token count using multiple heuristics.
 * More accurate than simple char/4 for mixed content.
 */
export function estimateTokens(text: string): number {
  // Heuristic: count word boundaries + special tokens
  const words = text.split(/\s+/).length;
  const specialTokens = (text.match(/[{}[\](),;:'"<>]/g) || []).length;
  
  // Rough GPT-style tokenization: ~1.3 tokens per word + special chars
  return Math.ceil(words * 1.3 + specialTokens * 0.5);
}

/**
 * Estimate tokens for a message array using per-message overhead.
 */
export function estimateMessagesTokens(messages: any[]): number {
  const PER_MESSAGE_OVERHEAD = 4; // role, name, etc.
  return messages.reduce((sum, msg) => {
    const content = typeof msg.content === 'string' ? msg.content : JSON.stringify(msg.content);
    return sum + estimateTokens(content) + PER_MESSAGE_OVERHEAD;
  }, 0);
}

2. Get model context window size

Pull the actual context size from the server status API (already available from GGUF metadata):

interface ContextBudget {
  totalTokens: number;     // Model's context window (e.g., 8192)
  reserveForResponse: number; // Tokens reserved for model output (e.g., 2048)
  reserveForTools: number;    // Space for tool definitions (estimated)
  availableForHistory: number; // totalTokens - reserves
}

export function createContextBudget(modelContextSize: number): ContextBudget {
  const reserveForResponse = Math.min(modelContextSize * 0.25, 4096);
  const reserveForTools = 500; // Estimated based on tool count
  return {
    totalTokens: modelContextSize,
    reserveForResponse,
    reserveForTools,
    availableForHistory: modelContextSize - reserveForResponse - reserveForTools,
  };
}

3. Update pruneForBudget() in agentLoop.ts

Replace character-based pruning with token-aware pruning:

export function pruneForBudget(
  messages: any[],
  budget: ContextBudget,
): any[] {
  let totalTokens = estimateMessagesTokens(messages);
  
  if (totalTokens <= budget.availableForHistory) return messages;
  
  // Drop old tool result messages first (keep system + last N turns)
  // ... same logic but using token estimates instead of char counts
}

Files to Create/Modify

File Action
src/utils/tokenEstimator.ts CreateestimateTokens(), estimateMessagesTokens()
src/hooks/useGglibRuntime/agentLoop.ts Replace MAX_CONTEXT_CHARS with token-based ContextBudget
src/hooks/useGglibRuntime/agentLoop.ts Update pruneForBudget() to use token estimates
src/hooks/useGglibRuntime/runAgenticLoop.ts Pass model context size to budget calculation

Acceptance Criteria

  • estimateTokens() function with reasonable accuracy (within ~20% of actual for English text)
  • ContextBudget struct that uses model's actual context window size
  • Response space reserved to prevent context window overflow
  • pruneForBudget() uses token estimates instead of character counts
  • Fallback to MAX_CONTEXT_CHARS when model context size is unknown
  • Unit tests for token estimation with various content types (English, code, JSON, unicode)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions