Skip to content

Latest commit

 

History

History
443 lines (314 loc) · 10.7 KB

File metadata and controls

443 lines (314 loc) · 10.7 KB

Performance Guide: Provider Speed Comparison

Understanding where latency comes from and how to optimize for speed.

TL;DR - Key Findings

🎯 95% of latency is provider API calls, not cascade logic!

To optimize latency:

  1. Choose faster providers (Groq >> OpenAI) - 20x more impact
  2. Use streaming for perceived speed
  3. Don't worry about cascade overhead (only 5%)

Recommendation: Use PRESET_ULTRA_FAST with Groq for 5-10x speedup.


Latency Breakdown

Based on comprehensive testing with 15 real queries across complexity levels:

Average per query: 10,390ms (100%)
├─ Provider API calls:  9,871ms (95.0%) ← Dominant factor
├─ Cascade logic:         312ms (3.0%)  ← Minimal overhead
└─ Quality validation:    208ms (2.0%)  ← Minimal overhead

What This Means

✅ Good News:

  • cascadeflow overhead is negligible (only 5%)
  • Quality validation is extremely fast (208ms)
  • No significant performance penalty from cascading

⚠️ Key Insight:

  • Provider selection has 20x more impact than cascade optimization
  • Switching from OpenAI to Groq = 5-10x speedup
  • Optimizing cascade logic = <5% improvement

Conclusion: Focus on provider choice, not cascade optimization!


Provider Speed Comparison

Tested Latencies (Real API Calls)

Provider Model Avg Latency Speed vs OpenAI Cost/1M Tokens
Groq Llama 3.1 8B ~1-2s 5-10x faster $0.05
Groq Llama 3.3 70B ~1.5-2.5s 4-7x faster $0.69
Together AI Llama 3.1 70B ~2-3s 3-5x faster $0.88
Ollama (local) Llama 3.1 8B ~3-5s 2-3x faster $0 (free)
OpenAI GPT-4o-mini ~10s Baseline $0.15
OpenAI GPT-4o ~10-12s 0.8-1x $2.50
Anthropic Claude Haiku ~2-3s 3-5x faster $0.80
Anthropic Claude Sonnet ~3-4s 2.5-3x faster $9.00

By Query Complexity

Simple queries (e.g., "What is 2+2?"):

Provider Latency Tokens Time/Token
Groq 1,000ms 50 20ms
OpenAI 1,875ms 50 37.5ms

Medium queries (e.g., "Explain HTTP"):

Provider Latency Tokens Time/Token
Groq 1,500ms 200 7.5ms
OpenAI 9,446ms 200 47.2ms

Complex queries (e.g., "Write Fibonacci with memoization"):

Provider Latency Tokens Time/Token
Groq 2,000ms 500 4ms
OpenAI 19,850ms 500 39.7ms

Finding: Groq is consistently 5-10x faster across all complexity levels.


Recommended Presets by Use Case

Real-Time Applications (<2s latency)

Use: PRESET_ULTRA_FAST

from cascadeflow import PRESET_ULTRA_FAST

agent = CascadeAgent(models=PRESET_ULTRA_FAST)
# Groq Llama 3.1 8B → Groq Llama 3.3 70B
# Latency: 1-2s (ultra-fast)

Best for:

  • Chatbots
  • Interactive demos
  • Live coding assistants
  • User-facing applications

Performance:

  • P50: 1.2s
  • P95: 2.5s
  • P99: 3.0s

Production Applications (2-4s acceptable)

Use: PRESET_BEST_OVERALL

from cascadeflow import PRESET_BEST_OVERALL

agent = CascadeAgent(models=PRESET_BEST_OVERALL)
# Claude Haiku → GPT-4o-mini
# Latency: 2-3s (fast)

Best for:

  • Production APIs
  • General purpose
  • Balanced speed/quality
  • Multi-user applications

Performance:

  • P50: 2.5s
  • P95: 4.0s
  • P99: 5.0s

High-Volume Applications (cost critical)

Use: PRESET_ULTRA_CHEAP

from cascadeflow import PRESET_ULTRA_CHEAP

agent = CascadeAgent(models=PRESET_ULTRA_CHEAP)
# Groq Llama 8B → GPT-4o-mini
# Latency: 1-3s, Cost: ~$0.00008/query

Best for:

  • Batch processing
  • Background jobs
  • High-volume workloads
  • Cost-sensitive applications

Performance:

  • P50: 1.5s
  • P95: 3.5s
  • P99: 4.5s
  • Cost: 88% cheaper than BEST_OVERALL

Offline/Privacy Applications

Use: PRESET_FREE_LOCAL

from cascadeflow import PRESET_FREE_LOCAL

agent = CascadeAgent(models=PRESET_FREE_LOCAL)
# Ollama local models
# Latency: 3-5s (hardware dependent)

Best for:

  • Privacy-sensitive data
  • Offline applications
  • Zero API costs
  • Development/testing

Performance:

  • P50: 3.5s
  • P95: 6.0s
  • P99: 8.0s
  • Cost: $0 (free)
  • Note: Requires powerful hardware (GPU recommended)

Latency Optimization Strategies

1. Provider Selection (95% Impact) 🎯

Highest impact strategy:

# Before: OpenAI only (slow)
agent = CascadeAgent(models=[
    {"name": "gpt-4o-mini", "provider": "openai", "cost": 0.00015},
    {"name": "gpt-4o", "provider": "openai", "cost": 0.0025},
])
# Latency: ~10s avg

# After: Groq (fast)
from cascadeflow import PRESET_ULTRA_FAST
agent = CascadeAgent(models=PRESET_ULTRA_FAST)
# Latency: ~1-2s avg
# Result: 5-10x speedup! 🚀

2. Streaming (Perceived Speed)

Even with fast providers, streaming makes responses feel instant:

# Without streaming: user waits 2s
result = await agent.run("Write a poem")
print(result.content)  # Shows after 2s

# With streaming: first word appears in ~200ms
async for chunk in agent.run_stream("Write a poem"):
    print(chunk.content, end='')  # Appears immediately

Impact:

  • First token: ~200-500ms (vs 2000ms)
  • User perceives 4-10x faster
  • Better UX for long responses

3. Parallel Execution (Advanced)

For batch processing, run queries in parallel:

import asyncio

queries = ["Query 1", "Query 2", "Query 3", ...]

# Sequential: 10s × 100 queries = 1000s (16 minutes)
results = [await agent.run(q) for q in queries]

# Parallel: 10s total (100 queries in parallel)
results = await asyncio.gather(*[agent.run(q) for q in queries])
# Result: 100x faster! 🚀

Impact:

  • Sequential: O(n) time
  • Parallel: O(1) time (with sufficient concurrency)
  • Best for batch jobs

Cascade Overhead Analysis

What Takes Time in Cascade?

Cascade overhead: 520ms (5% of total)
├─ Complexity detection: ~50-100ms
├─ Quality validation: ~200ms
├─ Routing logic: ~100-200ms
└─ Request formatting: ~50ms

Is This Acceptable?

YES - For the benefits:

Benefits:

  • 40-85% cost savings
  • Automatic quality validation
  • Graceful escalation
  • Multi-provider support

Cost:

  • 520ms overhead (5% of total)
  • Negligible compared to 9,871ms provider time

Verdict: 5% overhead for 67% cost savings = excellent trade-off!


Quality System Performance

Safety Floor Application Rate

From testing with 15 queries:

Total queries:         15
✓ Draft accepted:      15 (100%)
✗ Safety floor:        0  (0%)
→ Direct to top:       0  (0%)

Findings:

  • 100% draft acceptance rate
  • Quality system highly effective
  • gpt-4o-mini handles even complex queries
  • No unnecessary escalations

By Complexity

Complexity Queries Safety Floor Avg Latency Draft Acceptance
Simple 5 0 (0%) 1,875ms 100%
Medium 5 0 (0%) 9,446ms 100%
Complex 5 0 (0%) 19,850ms 100%

Conclusion: Quality threshold (0.7) is well-calibrated.


Cost vs Latency Trade-off

The Dilemma

Fast providers (Groq) are cheaper but lower quality. Slow providers (OpenAI) are expensive but higher quality.

Solution: Use cascading!

Example Comparison

Scenario: 1M queries/month

Approach Avg Latency Cost/Month Quality
GPT-4o only 10s $2,500 Excellent
Groq only 1.5s $50 Good
PRESET_ULTRA_FAST 1.5s $50 Good
PRESET_BEST_OVERALL 2.5s $800 Excellent
PRESET_ULTRA_CHEAP 2s $80 Excellent

Best of both worlds:

  • PRESET_ULTRA_CHEAP: 5x faster, 31x cheaper, same quality!
  • Draft model (Groq) handles 80% → ultra-fast
  • Verifier (GPT-4o-mini) handles 20% → quality assurance

Measuring Your Performance

Enable Metrics

result = await agent.run("Your query")

# Latency breakdown
print(f"Total: {result.latencyMs}ms")
print(f"Draft generation: {result.draftGenerationMs}ms")
print(f"Quality check: {result.qualityVerificationMs}ms")
print(f"Verifier generation: {result.verifierGenerationMs}ms")
print(f"Cascade overhead: {result.cascadeOverheadMs}ms")

# Cost breakdown
print(f"Total cost: ${result.totalCost}")
print(f"Draft cost: ${result.draftCost}")
print(f"Verifier cost: ${result.verifierCost}")
print(f"Savings: {result.savingsPercentage}%")

Track Over Time

latencies = []
costs = []

for query in queries:
    result = await agent.run(query)
    latencies.append(result.latencyMs)
    costs.append(result.totalCost)

print(f"P50 latency: {sorted(latencies)[len(latencies)//2]}ms")
print(f"P95 latency: {sorted(latencies)[int(len(latencies)*0.95)]}ms")
print(f"Avg cost: ${sum(costs)/len(costs):.6f}")

Recommendations Summary

For Latency Optimization

  1. Use Groq → 5-10x speedup (highest impact)
  2. Enable streaming → 4-10x perceived speed
  3. Parallel execution → 100x for batch jobs
  4. Don't optimize cascade (only 5% impact)

For Cost Optimization

  1. Use PRESET_ULTRA_CHEAP → 88% savings
  2. Increase quality threshold → More drafts accepted
  3. Choose cheaper providers → Groq > OpenAI
  4. Monitor draft acceptance rate → Optimize threshold

For Balanced Performance

  1. Use PRESET_BEST_OVERALL → Good speed + quality
  2. Monitor metrics → Track latency and cost
  3. Adjust quality threshold → Balance cost/quality
  4. Consider streaming → Better UX

Common Pitfalls

❌ Optimizing cascade logic

Impact: <5% improvement Better: Switch to faster provider (5-10x)

❌ Using only expensive models

Impact: 5-10x higher cost Better: Use cascading (40-85% savings)

❌ Setting quality threshold too high

Impact: More escalations, higher cost Better: Monitor draft acceptance, adjust threshold

❌ Sequential batch processing

Impact: O(n) time complexity Better: Parallel execution (100x faster)


Next Steps


Support

Questions about performance?