Skip to content

Latest commit

 

History

History
145 lines (104 loc) · 6.58 KB

File metadata and controls

145 lines (104 loc) · 6.58 KB

Benchmarks & Performance

Performance data for Network-AI v5.0 deployments. Your swarm is only as fast as the backend it calls — this page helps you choose the right setup.

BlackboardValidator Throughput

Layer 1 validation (rule-based, zero LLM calls) measured on Node.js 20, Apple M2, single-thread:

Input size Ops/sec Latency
Small entry (~100 chars) ~1,000,000 < 1 µs
Medium entry (~1 KB) ~500,000 ~2 µs
Large entry (~10 KB) ~159,000 ~6 µs

Layer 2 (QualityGateAgent) adds LLM latency and is async — intended for high-value writes, not every write.


Cloud Provider Performance

Not all cloud APIs perform the same. Model size, inference infrastructure, and tier all affect how fast each agent gets a response — and that directly multiplies across every agent in your swarm.

Provider / Model Avg response (5-agent swarm) RPM limit (free/tier-1) Notes
OpenAI gpt-5.2 6–10s per call 3–6 RPM Flagship model, high latency, strict RPM
OpenAI gpt-4o-mini 2–4s per call 500 RPM Fast, cheap, good for reviewer agents
OpenAI gpt-4o 4–7s per call 60–500 RPM Balanced quality/speed
Anthropic Claude 3.5 Haiku 2–3s per call 50 RPM Fastest Claude, great for parallel agents
Anthropic Claude 3.7 Sonnet 4–8s per call 50 RPM Stronger reasoning, higher latency
Google Gemini 2.0 Flash 1–3s per call 15 RPM (free) Very fast inference, low RPM on free tier
Groq (Llama 3.3 70B) 0.5–2s per call 30 RPM Fastest cloud inference available
Together AI / Fireworks 1–3s per call Varies by plan Good for parallel workloads

Key insight: A 5-agent swarm using gpt-4o-mini at 500 RPM can fire all 5 agents truly in parallel and finish in ~4s total. The same swarm on gpt-5.2 at 6 RPM must go sequential and takes 60s. The model tier matters more than the orchestration framework.

Choosing a Model for Swarm Agents

  • Speed over depth (many agents, real-time) → gpt-4o-mini, claude-3.5-haiku, gemini-2.0-flash, groq/llama-3.3-70b
  • Depth over speed (few agents, high-stakes) → gpt-4o, claude-3.7-sonnet
  • Free / no-cost testing → Groq free tier, Gemini free tier, or Ollama locally
  • Production with budget → multiple keys across providers, route agents to different models

Rate Limit Patterns

When you run a 5-agent swarm sharing one API key and hit the RPM ceiling, the API silently returns empty responses — not a 429 error, just blank content. Network-AI's swarm demos handle this automatically with sequential dispatch and adaptive header-based pacing (reads x-ratelimit-reset-requests to wait exactly as long as needed).

You have What to expect
One cloud API key Sequential dispatch, 40–70s per 5-agent swarm — handled automatically
Multiple cloud keys Near-parallel, 10–15s — one key per adapter instance
Local GPU (Ollama, vLLM) True parallel, 5–20s depending on hardware
Home GPU + cloud mix Local agents never block — cloud agents rate-paced independently

Multiple Keys = True Parallel

import { CustomAdapter, AdapterRegistry } from 'network-ai';

const registry = new AdapterRegistry();

for (const reviewer of REVIEWERS) {
  const adapter = new CustomAdapter();
  const client  = new OpenAI({ apiKey: process.env[`OPENAI_KEY_${reviewer.id.toUpperCase()}`] });

  adapter.registerHandler(reviewer.id, async (payload) => {
    const resp = await client.chat.completions.create({ /* ... */ });
    return { findings: extractContent(resp) };
  });

  registry.register(reviewer.id, adapter);
}

// All 5 dispatch in parallel via Promise.all — ~8–12s instead of ~60s

Local GPU = Zero Rate Limits

const localClient = new OpenAI({
  apiKey : 'not-needed',
  baseURL: 'http://localhost:11434/v1',   // Ollama, vLLM, llama.cpp
});

adapter.registerHandler('reviewer', async (payload) => {
  const resp = await localClient.chat.completions.create({
    model   : 'llama3.2',
    messages: [/* ... */],
  });
  return { findings: extractContent(resp) };
});

Cloud GPU Instances (Self-Hosted)

Running your own model on AWS / GCP / Azure sits between managed APIs and local hardware:

Setup Speed vs managed API RPM
A100 (80GB) + vLLM, Llama 3.3 70B Faster — 0.5–2s/call None
H100 + vLLM, Mixtral 8x7B Faster — 0.3–1s/call None
T4 / V100 + Ollama, Llama 3.2 8B Comparable None

Cost: $1–5/hr for GPU VMs. For high-volume production swarms or teams that want no external API dependency, it is the fastest architecture available. The connection is identical to local Ollama — just point baseURL at your VM's IP.


max_completion_tokens — The Silent Truncation Trap

One of the most common failure modes in agentic output tasks. When a model hits the max_completion_tokens ceiling it stops mid-output and returns whatever it has — no error, no warning. The API call succeeds with finish_reason: "length" instead of "stop".

This is especially dangerous for code-rewrite agents where the output is a full file.

# Real numbers (gpt-5-mini, order-service.ts rewrite):
  Blockers section:  ~120 tokens
  Fixed code:        ~2,800 tokens  (213 lines with // FIX: comments)
  Total needed:      ~3,000 tokens  ← hits the cap exactly → empty output
  Fix: set to 16,000 → full rewrite delivered in one shot

Rule of Thumb by Task

Task Recommended cap
Short classification / sentiment 200–500
Code review findings (one reviewer) 400–800
Blocker summary (coordinator) 500–1,000
Full file rewrite (≤300 lines) 12,000–16,000
Full file rewrite (≤1,000 lines) 32,000–64,000
Document / design revision 16,000–32,000

All GPT-5 variants support 128,000 max output tokens — the ceiling is never the model, it is always the cap you set.

Lessons from Building the Code-Review Swarm

Issue Root cause Fix
Fixed code output was empty max_completion_tokens: 3000 too low Raise to 16000+ for any code-output agent
finish_reason: "length" silently discards Model hits cap, partial response, no error Always check choices[0].finish_reason and alert on "length"
Flagship model slow + expensive for reviewers High latency + $14/1M output tokens Use gpt-5-mini ($2/1M, same RPM) for reviewer/fixer agents
Coordinator + fixer as two calls Second call hits rate limit window, +60s Merge into one structured two-section call