Releases: janbalangue/async-bulkhead-llm
v3.0.0
[3.0.0] —
Breaking Changes
-
LLMStatsshape changed.bulkhead.stats()no longer returns baseStatsfields at the top level.
Base bulkhead stats now live understats().bulkhead, and LLM-layer counters now live under
stats().llm. -
Code that previously accessed:
stats().inFlightstats().pendingstats().maxConcurrentstats().maxQueuestats().closed
must now read:
stats().bulkhead.inFlightstats().bulkhead.pendingstats().bulkhead.maxConcurrentstats().bulkhead.maxQueuestats().bulkhead.closed
Added
stats().llmblock with LLM-layer request counters:admittedreleasedrejectedrejectedByReason
Changed
- The
run()callback signal type now derives fromAcquireOptions["signal"]
instead of referring to the globalAbortSignaltype directly. - Test utilities now avoid direct dependency on ambient
AbortControllerglobals. - Bumped
async-bulkhead-tsto^0.4.1. :contentReference[oaicite:1]{index=1}
Migration Guide
From v2 → v3, update stats access only.
Before:
const s = bulkhead.stats();
s.inFlight;
s.pending;After:
const s = bulkhead.stats();
s.bulkhead.inFlight;
s.bulkhead.pending;LLM-layer counters are now separate:
const s = bulkhead.stats();
s.llm.admitted;
s.llm.rejected;
s.llm.rejectedByReason.budget_limit;Notes
- No change to admission semantics, token budget semantics, deduplication behavior,
or graceful shutdown behavior. - This release separates underlying bulkhead telemetry from LLM-layer request telemetry.
v2.0.0
async-bulkhead-llm v2.0.0
Fail-fast admission control for LLM workloads — now with token refunds, multimodal support, and model-aware routing.
This release significantly improves budget utilization and flexibility while preserving the simple v1 API.
🚀 Highlights
💸 Token Refund (Major Improvement)
v1 reserved input + max_tokens and held the full reservation until release.
v2 introduces post-completion refunds:
- Report actual usage via
getUsage()(ortoken.release(usage)) - Unused output tokens are immediately returned to the budget
- Improves throughput under tight token ceilings
- No breaking changes — behavior matches v1 if usage isn’t provided
await bulkhead.run(
request,
async () => callLLM(request),
{
getUsage: (res) => ({
input: res.usage.input_tokens,
output: res.usage.output_tokens,
}),
},
);🖼 Multimodal Content Support
content may now be:
- string
ContentBlock[]
Built-in estimators:
- Count text blocks
- Ignore non-text blocks
- Provide lower-bound estimates for multimodal inputs
Custom estimators remain fully supported.
🧠 Per-Request Model Awareness
You can now route different models through a single bulkhead:
await bulkhead.run(
{ model: 'claude-haiku-4-5', messages, max_tokens: 512 },
async () => callLLM(request),
);Estimator behavior:
- Uses
request.modelwhen present - Falls back to the bulkhead default model
🔁 In-Flight Deduplication Improvements
- Default dedup key now includes:
messagesmax_tokensmodel
- Prevents cross-model conflation
- Custom
keyFnsupported - Return "" to opt a request out
📊 New Stats Fields
stats() now includes:
tokenBudget.totalRefundeddeduplication.activededuplication.hits
Improved visibility into load shedding and savings.
🧩 Profiles
Two built-in presets:
interactive(default): fail-fast, no waitingbatch: bounded queue + timeout
Custom profile objects supported.
⚠️ Breaking Changes
None for typical usage.
If you relied on:
- Exact token reservation behavior (no refunds)
- Previous dedup key semantics
Review the updated behavior, but most users require no changes.
🛠 Migration from v1
Most callers need zero changes.
To benefit from refunds:
- Provide
getUsage()inrun(), or - Pass usage to
token.release()
See the README for full details.
🧱 What This Library Is (and Isn’t)
This library enforces:
- Concurrency ceilings
- Token budgets
- Fail-fast load shedding
- Backpressure at LLM boundaries
It does not:
- Retry
- Wrap provider SDKs
- Perform distributed rate limiting
- Perform cost accounting
📦 Compatibility
- Node.js 20+
- ESM + CJS builds
- Zero runtime dependencies beyond
async-bulkhead-ts
❤️ Why v2 Matters
Token refunds dramatically improve effective capacity under real-world workloads where:
max_tokensis over-provisioned- Outputs are shorter than caps
- Budgets are tight
- Multiple models share a boundary
v2 allows you to keep strict ceilings without sacrificing utilization.
v1.0.3
async-bulkhead-llm v1.0.3
Overview
- Metadata-only maintenance release.
- This version aligns the
package.jsonlicense field with the repository’s Apache 2.0 license.
Changed
- Updated license field in
package.jsonto Apache-2.0 - No runtime changes
- No API changes
- No type changes
Compatibility
Fully compatible with:
1.0.01.0.11.0.2
Safe upgrade. No migration required.
v1.0.2
async-bulkhead-llm v1.0.2
Overview
Metadata-only maintenance release.
This version corrects the GitHub URLs in package.json to ensure the homepage, repository, and bugs fields point to the canonical repository.
Changed
- Fixed GitHub URLs in package metadata
- No runtime changes
- No API changes
- No type changes
Compatibility
Fully compatible with:
1.0.01.0.1
Safe upgrade. No migration required.
v1.0.1
async-bulkhead-llm v1.0.1
Overview
Maintenance release.
This version updates the underlying concurrency primitive dependency and includes packaging/CI hardening to ensure published artifacts always contain the correct ESM, CJS, and type outputs.
No API changes. No behavior changes. No migration required.
Changed
- Bumped
async-bulkhead-tsto^0.3.0 - Hardened packaging workflow:
- Ensures
dist/is built before pack/publish - Added deterministic tarball verification in CI
- Ensures
Stability
- No changes to:
- Admission semantics
- Token budget logic
- Deduplication behavior
- Rejection reasons
- Public types
- Runtime stats surface
Fully compatible with 1.0.0.
Upgrade
npm install async-bulkhead-llm@1.0.1No code changes required.
v1.0.0
async-bulkhead-llm v1.0.0
Initial stable release.
async-bulkhead-llm provides fail-fast admission control for LLM workloads, built on async-bulkhead-ts. It is designed for services that need to enforce cost ceilings, concurrency limits, and backpressure at the boundary of their LLM calls.
🚀 Highlights
Hard Concurrency Limits
- Strict
maxConcurrentenforcement - Optional bounded queue via
maxQueue - Fail-fast by default (
maxQueue: 0)
Token-Aware Admission
- Enforce a ceiling on total in-flight tokens
- Reservations are calculated from
input + maxOutput - Admission fails fast when the token budget is exceeded
- Independent of concurrency headroom or queue configuration
Model-Aware Estimation
- Built-in per-model character-to-token ratios
- Longest-prefix matching for known model families
- Exact override support
- Fallback to flat 4.0 ratio for unknown models
- Optional
onUnknownModelhook
In-Flight Deduplication
- Identical message payloads share a single LLM call
- Reduces duplicate work under burst conditions
- Dedup stats available via
bulkhead.stats()
Clean API Surface
bulkhead.run(request, fn)— primary API (auto acquire + release)bulkhead.acquire(request)— manual lifecycle controlLLMBulkheadRejectedErrorwith structured reason- Runtime
stats()with optionaltokenBudgetanddeduplicationblocks
Profiles
- 'interactive' — fail-fast, no queue
- 'batch' — bounded queue, 30s timeout
- Escape hatch via plain preset object
📦 Runtime Characteristics
- Node.js 20+
- ESM + CommonJS builds
- Full TypeScript typings
- Zero dependencies beyond
async-bulkhead-ts - No retries
- No provider SDK coupling
- No distributed coordination
⚠️ Design Constraints (Intentional)
- Token estimation is approximate — suitable for load-shedding, not billing
- Deduplication key in v1 is
JSON.stringify(messages) - Multimodal (non-string content) is not supported by built-in estimators
- Refund mechanism (adjusting reservations based on actual usage) is planned for v2
🎯 Intended Use
This library is for enforcing backpressure at the service boundary of LLM calls:
- Prevent burst cost explosions
- Enforce cost ceilings
- Avoid cascading saturation
- Shed excess load early
It does not replace:
- Retry libraries
- Cost accounting systems
- Distributed rate limiting
- Provider SDKs
🔒 Security
See SECURITY.md for the vulnerability disclosure process and defined threat surface.
🧪 Stability
This is the first stable release. The API surface is intentionally small and opinionated. Breaking changes will follow semantic versioning.