-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Problem
As voice and audio models become more prevalent, the system needs to support audio-specific token types and billing. Current system only tracks text tokens, missing audio input/output tokens that have different pricing.
Current State
- No
audio_tokensfield in Usage model - No extraction for audio-specific usage data
- No cost fields for audio token rates
- Audio transcription/speech tracked as regular tokens (if at all)
Future Audio Models to Support
OpenAI Audio
- Whisper (transcription): Charges per minute of audio
- TTS (text-to-speech): Charges per character generated
- GPT-4o Audio (future): Native audio in/out with specific token rates
Other Providers
- ElevenLabs: Per character or per minute pricing
- Anthropic Claude Audio (future): Expected audio token support
- Google Gemini Audio: Already supports audio with different rates
Technical Requirements
1. Update Usage Model
// ConduitLLM.Core/Models/Usage.cs
/// <summary>
/// Number of audio input tokens (for models processing audio).
/// </summary>
[JsonPropertyName("audio_input_tokens")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public int? AudioInputTokens { get; set; }
/// <summary>
/// Number of audio output tokens (for models generating audio).
/// </summary>
[JsonPropertyName("audio_output_tokens")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public int? AudioOutputTokens { get; set; }
/// <summary>
/// Duration of audio processed/generated in seconds.
/// </summary>
[JsonPropertyName("audio_duration_seconds")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public double? AudioDurationSeconds { get; set; }
/// <summary>
/// Number of characters for TTS generation.
/// </summary>
[JsonPropertyName("tts_characters")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public int? TtsCharacters { get; set; }2. Update UsageExtractor
// Handle OpenAI Whisper/TTS format
if (usageElement.TryGetProperty("audio_seconds", out var audioSeconds))
usage.AudioDurationSeconds = audioSeconds.GetDouble();
// Handle audio token formats
if (usageElement.TryGetProperty("audio_input_tokens", out var audioInput))
usage.AudioInputTokens = audioInput.GetInt32();
if (usageElement.TryGetProperty("audio_output_tokens", out var audioOutput))
usage.AudioOutputTokens = audioOutput.GetInt32();3. Update ModelCost Entity
/// <summary>
/// Cost per million audio input tokens.
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? AudioInputCostPerMillionTokens { get; set; }
/// <summary>
/// Cost per million audio output tokens.
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? AudioOutputCostPerMillionTokens { get; set; }
/// <summary>
/// Cost per minute of audio (Whisper-style pricing).
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? AudioCostPerMinute { get; set; }
/// <summary>
/// Cost per 1000 characters (TTS-style pricing).
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? TtsCostPerThousandCharacters { get; set; }4. Update Cost Calculation
Handle different audio pricing models:
- Per token (GPT-4o audio)
- Per minute (Whisper)
- Per character (TTS)
5. Update PricingModel Enum
public enum PricingModel
{
Standard = 1,
// ... existing ...
AudioPerMinute = 10,
AudioPerCharacter = 11,
AudioTokenBased = 12
}Example Pricing
OpenAI Whisper
- $0.006 per minute of audio
OpenAI TTS
- TTS: $15 per 1M characters
- TTS HD: $30 per 1M characters
Future GPT-4o Audio (speculative)
- Audio input: Different rate than text input
- Audio output: Different rate than text output
Impact
- Future Revenue: Will miss audio billing when these models are added
- Affected Models: Whisper, TTS, future multimodal models with audio
- Severity: Low (future need, not current)
Testing Requirements
- Unit tests for audio token extraction
- Cost calculation tests for different audio pricing models
- Integration tests with mock audio API responses
- Refund calculation tests including audio tokens
Priority
Low - Future-proofing for when audio models are added to the system. Not causing current revenue loss.