|
4 | 4 |
|
5 | 5 | [](https://github.com/jovanSAPFIONEER/Network-AI/actions/workflows/ci.yml) |
6 | 6 | [](https://github.com/jovanSAPFIONEER/Network-AI/actions/workflows/codeql.yml) |
7 | | -[](https://github.com/jovanSAPFIONEER/Network-AI/releases) |
| 7 | +[](https://github.com/jovanSAPFIONEER/Network-AI/releases) |
8 | 8 | [](https://www.npmjs.com/package/network-ai) |
9 | 9 | [](https://clawhub.ai/skills/network-ai) |
10 | 10 | [](https://nodejs.org) |
@@ -542,6 +542,190 @@ class MyAdapter extends BaseAdapter { |
542 | 542 |
|
543 | 543 | See [references/adapter-system.md](references/adapter-system.md) for the full adapter architecture guide. |
544 | 544 |
|
| 545 | +## API Architecture & Performance |
| 546 | + |
| 547 | +**Your swarm is only as fast as the backend it calls into.** |
| 548 | + |
| 549 | +Network-AI is backend-agnostic — every agent in a swarm can call a cloud API, a different cloud API, or a local GPU model. That choice has a direct and significant impact on speed, parallelism, and reliability. |
| 550 | + |
| 551 | +### Why It Matters |
| 552 | + |
| 553 | +When you run a 5-agent swarm, Network-AI can dispatch all 5 calls simultaneously. Whether those calls actually execute in parallel depends entirely on what's behind each agent: |
| 554 | + |
| 555 | +| Backend | Parallelism | Typical 5-agent swarm | Notes | |
| 556 | +|---|---|---|---| |
| 557 | +| **Single cloud API key** (OpenAI, Anthropic, etc.) | Rate-limited | 40–70s sequential | RPM limits force sequential dispatch + retry waits | |
| 558 | +| **Multiple API keys / providers** | True parallel | 8–15s | Each agent hits a different key or provider | |
| 559 | +| **Local GPU** (Ollama, llama.cpp, vLLM) | True parallel | 5–20s depending on hardware | No RPM limit — all 5 agents fire simultaneously | |
| 560 | +| **Mixed** (some cloud, some local) | Partial | Varies | Local agents never block; cloud agents rate-paced | |
| 561 | + |
| 562 | +### The Single-Key Rate Limit Problem |
| 563 | + |
| 564 | +Cloud APIs enforce **Requests Per Minute (RPM)** limits per API key. When you run 5 agents sharing one key and hit the ceiling, the API silently returns empty responses — not a 429 error, just blank content. Network-AI's swarm demos handle this automatically with **sequential dispatch** (one agent at a time) and **adaptive header-based pacing** that reads the `x-ratelimit-reset-requests` header to wait exactly as long as needed before the next call. |
| 565 | + |
| 566 | +``` |
| 567 | +Single key (gpt-5.2, 6 RPM limit): |
| 568 | + Agent 1 ──call──▶ response (7s) |
| 569 | + wait 1s |
| 570 | + Agent 2 ──call──▶ response (7s) |
| 571 | + wait 1s |
| 572 | + ... (sequential) |
| 573 | + Total: ~60s for 5 agents + coordinator |
| 574 | +``` |
| 575 | + |
| 576 | +### Multiple Keys or Providers = True Parallel |
| 577 | + |
| 578 | +Register each reviewer agent against a different API key or provider and dispatch fires all 5 simultaneously: |
| 579 | + |
| 580 | +```typescript |
| 581 | +import { CustomAdapter, AdapterRegistry } from 'network-ai'; |
| 582 | + |
| 583 | +// Each agent points to a different OpenAI key |
| 584 | +const registry = new AdapterRegistry(); |
| 585 | + |
| 586 | +for (const reviewer of REVIEWERS) { |
| 587 | + const adapter = new CustomAdapter(); |
| 588 | + const client = new OpenAI({ apiKey: process.env[`OPENAI_KEY_${reviewer.id.toUpperCase()}`] }); |
| 589 | + |
| 590 | + adapter.registerHandler(reviewer.id, async (payload) => { |
| 591 | + const resp = await client.chat.completions.create({ ... }); |
| 592 | + return { findings: extractContent(resp) }; |
| 593 | + }); |
| 594 | + |
| 595 | + registry.register(reviewer.id, adapter); |
| 596 | +} |
| 597 | + |
| 598 | +// Now all 5 dispatch in parallel via Promise.all |
| 599 | +// Total: ~8-12s instead of ~60s |
| 600 | +``` |
| 601 | + |
| 602 | +### Local GPU = Zero Rate Limits |
| 603 | + |
| 604 | +Run Ollama or any OpenAI-compatible local server and drop it in as a backend. No RPM ceiling means every agent fires the moment the previous one starts — true parallel for free: |
| 605 | + |
| 606 | +```typescript |
| 607 | +// Point any agent at a local Ollama or vLLM server |
| 608 | +const localClient = new OpenAI({ |
| 609 | + apiKey : 'not-needed', |
| 610 | + baseURL : 'http://localhost:11434/v1', |
| 611 | +}); |
| 612 | + |
| 613 | +adapter.registerHandler('sec_review', async (payload) => { |
| 614 | + const resp = await localClient.chat.completions.create({ |
| 615 | + model : 'llama3.2', // or mistral, deepseek-r1, codellama, etc. |
| 616 | + messages: [...], |
| 617 | + }); |
| 618 | + return { findings: extractContent(resp) }; |
| 619 | +}); |
| 620 | +``` |
| 621 | + |
| 622 | +### Mixing Cloud and Local |
| 623 | + |
| 624 | +The adapter system makes it trivial to give some agents a cloud backend and others a local one: |
| 625 | + |
| 626 | +```typescript |
| 627 | +// Fast local model for lightweight reviewers |
| 628 | +registry.register('test_review', localAdapter); |
| 629 | +registry.register('arch_review', localAdapter); |
| 630 | + |
| 631 | +// Cloud model for high-stakes reviewers |
| 632 | +registry.register('sec_review', cloudAdapter); // GPT-4o / Claude |
| 633 | +``` |
| 634 | + |
| 635 | +Network-AI's orchestrator, blackboard, and trust model stay identical regardless of what's behind each adapter. The only thing that changes is speed. |
| 636 | + |
| 637 | +### Summary |
| 638 | + |
| 639 | +| You have | What to expect | |
| 640 | +|---|---| |
| 641 | +| One cloud API key | Sequential dispatch, 40–70s per 5-agent swarm — fully handled automatically | |
| 642 | +| Multiple cloud keys | Near-parallel, 10–15s — use one key per adapter instance | |
| 643 | +| Local GPU (Ollama, vLLM) | True parallel, 5–20s depending on hardware | |
| 644 | +| Home GPU + cloud mix | Local agents never block — cloud agents rate-paced independently | |
| 645 | + |
| 646 | +The framework doesn't get in the way of any of these setups. Connect whatever backend you have and the orchestration layer handles the rest. |
| 647 | + |
| 648 | +### Cloud Provider Performance |
| 649 | + |
| 650 | +Not all cloud APIs perform the same. Model size, inference infrastructure, and tier all affect how fast each agent gets a response — and that directly multiplies across every agent in your swarm. |
| 651 | + |
| 652 | +| Provider / Model | Avg response (5-agent swarm) | RPM limit (free/tier-1) | Notes | |
| 653 | +|---|---|---|---| |
| 654 | +| **OpenAI gpt-5.2** | 6–10s per call | 3–6 RPM | Flagship model, high latency, strict RPM | |
| 655 | +| **OpenAI gpt-4o-mini** | 2–4s per call | 500 RPM | Fast, cheap, good for reviewer agents | |
| 656 | +| **OpenAI gpt-4o** | 4–7s per call | 60–500 RPM | Balanced quality/speed | |
| 657 | +| **Anthropic Claude 3.5 Haiku** | 2–3s per call | 50 RPM | Fastest Claude, great for parallel agents | |
| 658 | +| **Anthropic Claude 3.7 Sonnet** | 4–8s per call | 50 RPM | Stronger reasoning, higher latency | |
| 659 | +| **Google Gemini 2.0 Flash** | 1–3s per call | 15 RPM (free) | Very fast inference, low RPM on free tier | |
| 660 | +| **Groq (Llama 3.3 70B)** | 0.5–2s per call | 30 RPM | Fastest cloud inference available | |
| 661 | +| **Together AI / Fireworks** | 1–3s per call | Varies by plan | Good for parallel workloads, competitive RPM | |
| 662 | + |
| 663 | +**Key insight:** A 5-agent swarm using `gpt-4o-mini` at 500 RPM can fire all 5 agents truly in parallel and finish in ~4s total. The same swarm on `gpt-5.2` at 6 RPM must go sequential and takes 60s. **The model tier matters more than the orchestration framework.** |
| 664 | + |
| 665 | +#### Choosing a Model for Swarm Agents |
| 666 | + |
| 667 | +- **Speed over depth** (many agents, real-time feedback) → `gpt-4o-mini`, `gpt-5-mini`, `claude-3.5-haiku`, `gemini-2.0-flash`, `groq/llama-3.3-70b` |
| 668 | +- **Depth over speed** (fewer agents, high-stakes output) → `gpt-4o`, `claude-3.7-sonnet`, `gpt-5.2` |
| 669 | +- **Free / no-cost testing** → Groq free tier, Gemini free tier, or Ollama locally |
| 670 | +- **Production swarms with budget** → Multiple keys across providers, route different agents to different models |
| 671 | + |
| 672 | +All of these plug into Network-AI through the `CustomAdapter` by swapping the client's `baseURL` and `model` string — no other code changes needed. |
| 673 | + |
| 674 | +### `max_completion_tokens` — The Silent Truncation Trap |
| 675 | + |
| 676 | +One of the most common failure modes in agentic output tasks is **silent truncation**. When a model hits the `max_completion_tokens` ceiling it stops mid-output and returns whatever it has — no error, no warning. The API call succeeds with a 200 and `finish_reason: "length"` instead of `"stop"`. |
| 677 | + |
| 678 | +**This is especially dangerous for code-rewrite agents** where the output is a full file. A fixed `max_completion_tokens: 3000` cap will silently drop everything after line ~150 of a 200-line fix. |
| 679 | + |
| 680 | +``` |
| 681 | +# What you set vs what you need |
| 682 | +
|
| 683 | +max_completion_tokens: 3000 → enough for a short blog post |
| 684 | + → NOT enough for a 200-line code rewrite |
| 685 | +
|
| 686 | +# Real numbers (gpt-5-mini, order-service.ts rewrite): |
| 687 | + Blockers section: ~120 tokens |
| 688 | + Fixed code: ~2,800 tokens (213 lines with // FIX: comments) |
| 689 | + Total needed: ~3,000 tokens ← hits the cap exactly, empty output |
| 690 | + Fix: set to 16,000 → full rewrite delivered in one shot |
| 691 | +``` |
| 692 | + |
| 693 | +**Lessons learned from building the code-review swarm:** |
| 694 | + |
| 695 | +| Issue | Root cause | Fix | |
| 696 | +|---|---|---| |
| 697 | +| Fixed code output was empty | `max_completion_tokens: 3000` too low for a full rewrite | Raise to `16000`+ for any code-output agent | |
| 698 | +| `finish_reason: "length"` silently discards output | Model hits cap, returns partial response with no error | Always check `choices[0].finish_reason` and alert on `"length"` | |
| 699 | +| `gpt-5.2` slow + expensive for reviewer agents | Flagship model = high latency + $14/1M output tokens | Use `gpt-5-mini` ($2/1M, 128k output, same RPM) for reviewer/fixer agents | |
| 700 | +| Coordinator + fixer as two separate calls | Second call hits rate limit window, adds 60s wait | Merge into one combined call with a structured two-section response format | |
| 701 | + |
| 702 | +**Rule of thumb for `max_completion_tokens` by task:** |
| 703 | + |
| 704 | +| Task | Recommended cap | |
| 705 | +|---|---| |
| 706 | +| Short classification / sentiment | 200–500 | |
| 707 | +| Code review findings (one reviewer) | 400–800 | |
| 708 | +| Blocker summary (coordinator) | 500–1,000 | |
| 709 | +| Full file rewrite (≤300 lines) | 12,000–16,000 | |
| 710 | +| Full file rewrite (≤1,000 lines) | 32,000–64,000 | |
| 711 | +| Document / design revision | 16,000–32,000 | |
| 712 | + |
| 713 | +All GPT-5 variants (`gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `gpt-5.2`) support **128,000 max output tokens** — the ceiling is never the model, it's always the cap you set. |
| 714 | + |
| 715 | +#### Cloud GPU Instances (Self-Hosted on AWS / GCP / Azure) |
| 716 | + |
| 717 | +Running your own model on a cloud GPU VM (e.g. AWS `p3.2xlarge` / A100, GCP `a2-highgpu`, Azure `NC` series) sits between managed APIs and local hardware: |
| 718 | + |
| 719 | +| Setup | Parallelism | Speed vs managed API | RPM limit | |
| 720 | +|---|---|---|---| |
| 721 | +| A100 (80GB) + vLLM, Llama 3.3 70B | True parallel | **Faster** — 0.5–2s per call | None | |
| 722 | +| H100 + vLLM, Mixtral 8x7B | True parallel | **Faster** — 0.3–1s per call | None | |
| 723 | +| T4 / V100 + Ollama, Llama 3.2 8B | True parallel | Comparable | None | |
| 724 | + |
| 725 | +Since you own the endpoint, there are no rate limits — all 5 agents fire at the same moment. At inference speeds on an A100, a 5-agent swarm can complete in **3–8 seconds** for a 70B model, comparable to Groq and faster than any managed flagship model. |
| 726 | + |
| 727 | +The tradeoff is cost (GPU VMs are $1–$5/hr) and setup (vLLM install, model download). For high-volume production swarms or teams that want no external API dependency, it's the fastest architecture available. The connection is identical to local Ollama — just point `baseURL` at your VM's IP. |
| 728 | + |
545 | 729 | ## Permission System |
546 | 730 |
|
547 | 731 | The AuthGuardian evaluates requests using: |
|
0 commit comments