Skip to content

Releases: janbalangue/async-bulkhead-llm

v3.0.0

16 Apr 00:00

Choose a tag to compare

[3.0.0] —

Breaking Changes

  • LLMStats shape changed. bulkhead.stats() no longer returns base Stats fields at the top level.
    Base bulkhead stats now live under stats().bulkhead, and LLM-layer counters now live under
    stats().llm.

  • Code that previously accessed:

    • stats().inFlight
    • stats().pending
    • stats().maxConcurrent
    • stats().maxQueue
    • stats().closed

    must now read:

    • stats().bulkhead.inFlight
    • stats().bulkhead.pending
    • stats().bulkhead.maxConcurrent
    • stats().bulkhead.maxQueue
    • stats().bulkhead.closed

Added

  • stats().llm block with LLM-layer request counters:
    • admitted
    • released
    • rejected
    • rejectedByReason

Changed

  • The run() callback signal type now derives from AcquireOptions["signal"]
    instead of referring to the global AbortSignal type directly.
  • Test utilities now avoid direct dependency on ambient AbortController globals.
  • Bumped async-bulkhead-ts to ^0.4.1. :contentReference[oaicite:1]{index=1}

Migration Guide

From v2 → v3, update stats access only.

Before:

const s = bulkhead.stats();
s.inFlight;
s.pending;

After:

const s = bulkhead.stats();
s.bulkhead.inFlight;
s.bulkhead.pending;

LLM-layer counters are now separate:

const s = bulkhead.stats();
s.llm.admitted;
s.llm.rejected;
s.llm.rejectedByReason.budget_limit;

Notes

  • No change to admission semantics, token budget semantics, deduplication behavior,
    or graceful shutdown behavior.
  • This release separates underlying bulkhead telemetry from LLM-layer request telemetry.

v2.0.0

03 Mar 21:44

Choose a tag to compare

async-bulkhead-llm v2.0.0

Fail-fast admission control for LLM workloads — now with token refunds, multimodal support, and model-aware routing.

This release significantly improves budget utilization and flexibility while preserving the simple v1 API.

🚀 Highlights

💸 Token Refund (Major Improvement)

v1 reserved input + max_tokens and held the full reservation until release.

v2 introduces post-completion refunds:

  • Report actual usage via getUsage() (or token.release(usage))
  • Unused output tokens are immediately returned to the budget
  • Improves throughput under tight token ceilings
  • No breaking changes — behavior matches v1 if usage isn’t provided
await bulkhead.run(
  request,
  async () => callLLM(request),
  {
    getUsage: (res) => ({
      input:  res.usage.input_tokens,
      output: res.usage.output_tokens,
    }),
  },
);

🖼 Multimodal Content Support

content may now be:

  • string
  • ContentBlock[]

Built-in estimators:

  • Count text blocks
  • Ignore non-text blocks
  • Provide lower-bound estimates for multimodal inputs

Custom estimators remain fully supported.

🧠 Per-Request Model Awareness

You can now route different models through a single bulkhead:

await bulkhead.run(
  { model: 'claude-haiku-4-5', messages, max_tokens: 512 },
  async () => callLLM(request),
);

Estimator behavior:

  • Uses request.model when present
  • Falls back to the bulkhead default model

🔁 In-Flight Deduplication Improvements

  • Default dedup key now includes:
    • messages
    • max_tokens
    • model
  • Prevents cross-model conflation
  • Custom keyFn supported
  • Return "" to opt a request out

📊 New Stats Fields

stats() now includes:

  • tokenBudget.totalRefunded
  • deduplication.active
  • deduplication.hits

Improved visibility into load shedding and savings.

🧩 Profiles

Two built-in presets:

  • interactive (default): fail-fast, no waiting
  • batch: bounded queue + timeout

Custom profile objects supported.

⚠️ Breaking Changes

None for typical usage.

If you relied on:

  • Exact token reservation behavior (no refunds)
  • Previous dedup key semantics

Review the updated behavior, but most users require no changes.

🛠 Migration from v1

Most callers need zero changes.

To benefit from refunds:

  • Provide getUsage() in run(), or
  • Pass usage to token.release()

See the README for full details.

🧱 What This Library Is (and Isn’t)

This library enforces:

  • Concurrency ceilings
  • Token budgets
  • Fail-fast load shedding
  • Backpressure at LLM boundaries

It does not:

  • Retry
  • Wrap provider SDKs
  • Perform distributed rate limiting
  • Perform cost accounting

📦 Compatibility

  • Node.js 20+
  • ESM + CJS builds
  • Zero runtime dependencies beyond async-bulkhead-ts

❤️ Why v2 Matters

Token refunds dramatically improve effective capacity under real-world workloads where:

  • max_tokens is over-provisioned
  • Outputs are shorter than caps
  • Budgets are tight
  • Multiple models share a boundary

v2 allows you to keep strict ceilings without sacrificing utilization.

v1.0.3

26 Feb 23:53

Choose a tag to compare

async-bulkhead-llm v1.0.3

Overview

  • Metadata-only maintenance release.
  • This version aligns the package.json license field with the repository’s Apache 2.0 license.

Changed

  • Updated license field in package.json to Apache-2.0
  • No runtime changes
  • No API changes
  • No type changes

Compatibility

Fully compatible with:

  • 1.0.0
  • 1.0.1
  • 1.0.2

Safe upgrade. No migration required.

v1.0.2

26 Feb 23:29

Choose a tag to compare

async-bulkhead-llm v1.0.2

Overview

Metadata-only maintenance release.

This version corrects the GitHub URLs in package.json to ensure the homepage, repository, and bugs fields point to the canonical repository.

Changed

  • Fixed GitHub URLs in package metadata
  • No runtime changes
  • No API changes
  • No type changes

Compatibility

Fully compatible with:

  • 1.0.0
  • 1.0.1

Safe upgrade. No migration required.

v1.0.1

26 Feb 23:09

Choose a tag to compare

async-bulkhead-llm v1.0.1

Overview

Maintenance release.

This version updates the underlying concurrency primitive dependency and includes packaging/CI hardening to ensure published artifacts always contain the correct ESM, CJS, and type outputs.

No API changes. No behavior changes. No migration required.

Changed

  • Bumped async-bulkhead-ts to ^0.3.0
  • Hardened packaging workflow:
    • Ensures dist/ is built before pack/publish
    • Added deterministic tarball verification in CI

Stability

  • No changes to:
    • Admission semantics
    • Token budget logic
    • Deduplication behavior
    • Rejection reasons
    • Public types
    • Runtime stats surface

Fully compatible with 1.0.0.

Upgrade

npm install async-bulkhead-llm@1.0.1

No code changes required.

v1.0.0

25 Feb 02:54

Choose a tag to compare

async-bulkhead-llm v1.0.0

Initial stable release.

async-bulkhead-llm provides fail-fast admission control for LLM workloads, built on async-bulkhead-ts. It is designed for services that need to enforce cost ceilings, concurrency limits, and backpressure at the boundary of their LLM calls.

🚀 Highlights

Hard Concurrency Limits

  • Strict maxConcurrent enforcement
  • Optional bounded queue via maxQueue
  • Fail-fast by default (maxQueue: 0)

Token-Aware Admission

  • Enforce a ceiling on total in-flight tokens
  • Reservations are calculated from input + maxOutput
  • Admission fails fast when the token budget is exceeded
  • Independent of concurrency headroom or queue configuration

Model-Aware Estimation

  • Built-in per-model character-to-token ratios
  • Longest-prefix matching for known model families
  • Exact override support
  • Fallback to flat 4.0 ratio for unknown models
  • Optional onUnknownModel hook

In-Flight Deduplication

  • Identical message payloads share a single LLM call
  • Reduces duplicate work under burst conditions
  • Dedup stats available via bulkhead.stats()

Clean API Surface

  • bulkhead.run(request, fn) — primary API (auto acquire + release)
  • bulkhead.acquire(request) — manual lifecycle control
  • LLMBulkheadRejectedError with structured reason
  • Runtime stats() with optional tokenBudget and deduplication blocks

Profiles

  • 'interactive' — fail-fast, no queue
  • 'batch' — bounded queue, 30s timeout
  • Escape hatch via plain preset object

📦 Runtime Characteristics

  • Node.js 20+
  • ESM + CommonJS builds
  • Full TypeScript typings
  • Zero dependencies beyond async-bulkhead-ts
  • No retries
  • No provider SDK coupling
  • No distributed coordination

⚠️ Design Constraints (Intentional)

  • Token estimation is approximate — suitable for load-shedding, not billing
  • Deduplication key in v1 is JSON.stringify(messages)
  • Multimodal (non-string content) is not supported by built-in estimators
  • Refund mechanism (adjusting reservations based on actual usage) is planned for v2

🎯 Intended Use

This library is for enforcing backpressure at the service boundary of LLM calls:

  • Prevent burst cost explosions
  • Enforce cost ceilings
  • Avoid cascading saturation
  • Shed excess load early

It does not replace:

  • Retry libraries
  • Cost accounting systems
  • Distributed rate limiting
  • Provider SDKs

🔒 Security

See SECURITY.md for the vulnerability disclosure process and defined threat surface.

🧪 Stability

This is the first stable release. The API surface is intentionally small and opinionated. Breaking changes will follow semantic versioning.