feat: implement multi-provider tokenization system (#857) #871

nickna · 2025-11-17T04:37:56Z

This commit implements Phase 3 of the tokenization modernization epic (#857), providing significantly improved token counting accuracy for non-OpenAI models while maintaining backward compatibility.

What Changed

New Components

TokenCounterFactory (Services/TokenCounterFactory.cs)
- Factory pattern for selecting appropriate token counter per model
- Caches counter instances for performance
- Automatically selects based on TokenizerType enum
- Extensible design for future tokenizer additions
FallbackTokenCounter (Services/FallbackTokenCounter.cs)
- Universal token counter with model-family-specific ratios
- Replaces crude fixed 4:1 character estimation
- Model-specific ratios:
  - Claude: 3.8 chars/token
  - Gemini: 3.5 chars/token * LLaMA: 4.0 chars/token * OpenAI: Uses TiktokenCounter instead
- Conservative rounding to prevent undercharging
LlamaTokenCounter (Services/LlamaTokenCounter.cs)
- Placeholder for future Microsoft.ML.Tokenizers integration
- Currently falls back to character estimation
- Prepared for Phase 4 implementation with tokenizer model files

Package Updates

Added Microsoft.ML.Tokenizers v1.0.2 for future LLaMA/native support
Kept TiktokenSharp v1.1.8 for OpenAI model compatibility

Service Registration Updates

Updated ServiceCollectionExtensions.cs to:

Register TokenCounterFactory
Register all token counter implementations
Maintain backward compatibility with ITokenCounter interface
Default to FallbackTokenCounter (better than previous fixed ratio)

Documentation

Created comprehensive documentation at docs/claude/multi-provider-tokenization.md:

Architecture overview and diagrams
Component descriptions
Usage examples
Migration guide
Troubleshooting guide
Future enhancement roadmap

Impact

Improved Accuracy

Before:

OpenAI models: Accurate (TiktokenSharp)
All other models: Fixed 4:1 ratio (inaccurate)

After:

OpenAI models: Accurate (TiktokenSharp) ✅
Claude models: 3.8:1 ratio (better estimate) ✅
Gemini models: 3.5:1 ratio (better estimate) ✅
LLaMA models: 4.0:1 ratio (reasonable estimate) ✅
Conservative buffering: +10% to prevent undercharging ✅

Business Value

More accurate billing for non-OpenAI models
Reduced revenue loss from underestimation
Better context window management
Foundation for native tokenizer implementations

Performance

Counter instance caching reduces overhead
No breaking changes to existing code
Backward compatible ITokenCounter interface

Implementation Progress

As per issue #857:

✅ Phase 0: Infrastructure (DB schema, enum, interfaces)
✅ Phase 1: OpenAI tokenizers (TiktokenSharp)
✅ Phase 3: Improved fallback system (this commit)
⏳ Phase 2: Native LLaMA tokenizers (prepared, not yet complete)
🔲 Phase 4: Native Claude tokenizers (planned)
🔲 Phase 5: Native Gemini tokenizers (planned)

Testing Notes

All existing tests should pass (backward compatible)
New unit tests should be added for FallbackTokenCounter
Integration tests should verify factory pattern works correctly

This commit implements Phase 3 of the tokenization modernization epic (#857), providing significantly improved token counting accuracy for non-OpenAI models while maintaining backward compatibility. ## What Changed ### New Components 1. **TokenCounterFactory** (`Services/TokenCounterFactory.cs`) - Factory pattern for selecting appropriate token counter per model - Caches counter instances for performance - Automatically selects based on TokenizerType enum - Extensible design for future tokenizer additions 2. **FallbackTokenCounter** (`Services/FallbackTokenCounter.cs`) - Universal token counter with model-family-specific ratios - Replaces crude fixed 4:1 character estimation - Model-specific ratios: * Claude: 3.8 chars/token * Gemini: 3.5 chars/token * LLaMA: 4.0 chars/token * OpenAI: Uses TiktokenCounter instead - Conservative rounding to prevent undercharging 3. **LlamaTokenCounter** (`Services/LlamaTokenCounter.cs`) - Placeholder for future Microsoft.ML.Tokenizers integration - Currently falls back to character estimation - Prepared for Phase 4 implementation with tokenizer model files ### Package Updates - Added Microsoft.ML.Tokenizers v1.0.2 for future LLaMA/native support - Kept TiktokenSharp v1.1.8 for OpenAI model compatibility ### Service Registration Updates Updated `ServiceCollectionExtensions.cs` to: - Register TokenCounterFactory - Register all token counter implementations - Maintain backward compatibility with ITokenCounter interface - Default to FallbackTokenCounter (better than previous fixed ratio) ### Documentation Created comprehensive documentation at `docs/claude/multi-provider-tokenization.md`: - Architecture overview and diagrams - Component descriptions - Usage examples - Migration guide - Troubleshooting guide - Future enhancement roadmap ## Impact ### Improved Accuracy **Before**: - OpenAI models: Accurate (TiktokenSharp) - All other models: Fixed 4:1 ratio (inaccurate) **After**: - OpenAI models: Accurate (TiktokenSharp) ✅ - Claude models: 3.8:1 ratio (better estimate) ✅ - Gemini models: 3.5:1 ratio (better estimate) ✅ - LLaMA models: 4.0:1 ratio (reasonable estimate) ✅ - Conservative buffering: +10% to prevent undercharging ✅ ### Business Value - More accurate billing for non-OpenAI models - Reduced revenue loss from underestimation - Better context window management - Foundation for native tokenizer implementations ### Performance - Counter instance caching reduces overhead - No breaking changes to existing code - Backward compatible ITokenCounter interface ## Implementation Progress As per issue #857: - ✅ Phase 0: Infrastructure (DB schema, enum, interfaces) - ✅ Phase 1: OpenAI tokenizers (TiktokenSharp) - ✅ Phase 3: Improved fallback system (this commit) - ⏳ Phase 2: Native LLaMA tokenizers (prepared, not yet complete) - 🔲 Phase 4: Native Claude tokenizers (planned) - 🔲 Phase 5: Native Gemini tokenizers (planned) ## Testing Notes - All existing tests should pass (backward compatible) - New unit tests should be added for FallbackTokenCounter - Integration tests should verify factory pattern works correctly ## Related - Addresses core problem described in issue #857 - Builds on existing TokenizerType infrastructure - Extends IModelCapabilityService usage - Compatible with UsageEstimationService conservative buffering Co-authored-by: Claude (Anthropic AI Assistant)

…nsions

This commit completes Phase 2 of the tokenization modernization epic, implementing automatic downloading of LLaMA tokenizer model files from HuggingFace with intelligent caching and graceful fallback. ## What Changed ### New Components 1. **TokenizerModelLoader** (`Services/TokenizerModelLoader.cs`) - Automatic downloading from HuggingFace Hub - Local caching to avoid repeated downloads (2-4 MB files) - Retry logic with exponential backoff (2s, 4s, 8s) - Graceful fallback if downloads fail - Atomic file operations for safety 2. **TokenizationOptions** (`Options/TokenizationOptions.cs`) - Configuration for auto-download behavior - Customizable cache directory - Timeout and retry settings - Fallback behavior control ### Enhanced Components 3. **LlamaTokenCounter** (Updated) - Now uses TokenizerModelLoader for actual tokenization - Supports LLaMA 2 (~500 KB) and LLaMA 3/3.1 (~2.18 MB) - Falls back gracefully if download fails - Caches loaded tokenizers in memory for performance 4. **ServiceCollectionExtensions** (Updated) - Registers TokenizationOptions - Registers TokenizerModelLoader as singleton - Properly wires up dependency injection ### Documentation Updates - Updated `docs/claude/multi-provider-tokenization.md` with: - Auto-download feature description - Configuration examples - Environment variable overrides - File sizes and cache locations ## Features ### Automatic Download - Downloads tokenizer files from HuggingFace on first use - Files cached in `~/.local/share/Conduit/tokenizers` (Linux) - Or `%APPDATA%/Conduit/tokenizers` (Windows) - Configurable cache location (useful for Docker volumes) ### Intelligent Caching - Downloads only once per tokenizer version - Checks local cache before downloading - Atomic file operations prevent corruption - Minimal disk usage (~10 MB for all LLaMA versions) ### Production Safety - Retry logic with exponential backoff - Graceful fallback to character-based estimation - Configurable timeout (default 30 seconds) - No crashes if HuggingFace is unavailable - All failures logged with helpful messages ### Configuration Default behavior (appsettings.json): ```json { "ConduitLLM": { "Tokenization": { "AutoDownloadTokenizers": true, "CacheDirectory": null, "DownloadTimeoutMs": 30000, "RetryAttempts": 3, "FallbackOnDownloadFailure": true } } } ``` Environment variable overrides: ```bash CONDUITLLM__TOKENIZATION__AUTODOWNLOADTOKENIZERS=false CONDUITLLM__TOKENIZATION__CACHEDIRECTORY=/app/tokenizers ``` ## Impact ### User Experience - **Zero manual setup** - Tokenizers download automatically - **Works offline** after first download - **Fast** - 2-4 second download, then instant from cache - **Reliable** - Graceful fallback if download fails ### Production Considerations - **Small bandwidth cost** - 2-4 MB once per tokenizer - **Configurable** - Can disable for air-gapped environments - **Safe** - Won't crash production if HuggingFace is down - **Efficient** - Singleton loader, cached tokenizers ### Accuracy Improvement - LLaMA models now use **native tokenization** when possible - Falls back to improved character-based estimation (4:1 ratio) - Better than previous fixed-ratio approach for all models ## Implementation Details ### File Sizes (from HuggingFace) - LLaMA 2: ~500 KB (SentencePiece format) - LLaMA 3: ~2.18 MB (Tiktoken/BPE format) - LLaMA 3.1: ~2.18 MB (Tiktoken/BPE format) - Total: ~5 MB for all versions ### Download URLs - LLaMA 2: meta-llama/Llama-2-7b-hf - LLaMA 3: meta-llama/Meta-Llama-3-8B - LLaMA 3.1: meta-llama/Meta-Llama-3.1-8B ### Error Handling 1. Network failures → Retry with backoff 2. Timeout → Use fallback estimation 3. Invalid file → Use fallback estimation 4. HuggingFace down → Use fallback estimation All errors logged for monitoring ## Testing Notes ### Manual Testing ```bash # First request downloads tokenizer curl -X POST http://localhost:5000/api/chat \ -H "Content-Type: application/json" \ -d '{"model": "llama-3-8b", "messages": [{"role": "user", "content": "test"}]}' # Check cache directory ls ~/.local/share/Conduit/tokenizers/ # Should see: llama3.model (2.18 MB) # Second request uses cached file (instant) # Check logs for "Using cached tokenizer" ``` ### Configuration Testing ```bash # Disable auto-download export CONDUITLLM__TOKENIZATION__AUTODOWNLOADTOKENIZERS=false # Should use fallback estimation and log warning # Custom cache directory export CONDUITLLM__TOKENIZATION__CACHEDIRECTORY=/tmp/tokenizers # Should download to /tmp/tokenizers ``` ## Migration Notes - **No breaking changes** - Auto-download enabled by default - **Backward compatible** - Falls back gracefully if disabled - **Optional** - Can disable via configuration - **Safe** - Won't affect existing OpenAI tokenization ## Related - Completes Phase 2 of issue #857 - Builds on TokenCounterFactory from previous commit - Prepares for Phase 4/5 (Claude, Gemini native tokenizers) - Uses Polly for retry logic (already in dependencies) Co-authored-by: Claude (Anthropic AI Assistant)

claude added 3 commits November 17, 2025 04:32

fix: add missing using directive for ILogger in ServiceCollectionExte…

22c0a4b

…nsions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement multi-provider tokenization system (#857) #871

feat: implement multi-provider tokenization system (#857) #871

Uh oh!

nickna commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: implement multi-provider tokenization system (#857) #871

Are you sure you want to change the base?

feat: implement multi-provider tokenization system (#857) #871

Uh oh!

Conversation

nickna commented Nov 17, 2025

What Changed

New Components

Package Updates

Service Registration Updates

Documentation

Impact

Improved Accuracy

Business Value

Performance

Implementation Progress

Testing Notes

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants