-
Notifications
You must be signed in to change notification settings - Fork 46
AIT-221: Document how token streaming interacts with rate limits #3092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rainbowFi
wants to merge
9
commits into
AIT-129-AIT-Docs-release-branch
Choose a base branch
from
ait-221-rate-limits
base: AIT-129-AIT-Docs-release-branch
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+71
−0
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
68ec491
AIT-221: Document how token streaming interacts with rate limits
rainbowFi fe8a4a7
Fix article title in nav
rainbowFi 6dbe905
Update text to clarify language and remove duplicate link
rainbowFi b2b0d66
Update based on Paddy's comments
rainbowFi 10f0870
Update naming following review
rainbowFi a29374d
WIP - update with transport param
rainbowFi 135894c
Complete updates to include transport param documentation
rainbowFi 5764de1
fixup: clarify paragraph based on review and add note on server-side …
rainbowFi 04a4500
Add heading link tags
rainbowFi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
67 changes: 67 additions & 0 deletions
67
src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| --- | ||
| title: Token streaming limits | ||
| meta_description: "Learn how token streaming interacts with Ably message limits and how to ensure your application delivers consistent performance." | ||
| --- | ||
|
|
||
| LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, token streams risk triggering [rate limits](/docs/platform/pricing/limits). | ||
|
|
||
| Ably scales as your traffic grows, and rate limits exist to protect service quality in the case of accidental spikes or deliberate abuse. They also provide a level of protection to consumption rates if abuse does occur. On the correct package for your use case, hitting a limit is an infrequent occurrence. The approach to staying within limits when using AI Transport depends on which [token streaming pattern](/docs/ai-transport/features/token-streaming) you use. | ||
|
|
||
| ## Message-per-response <a id="per-response"/> | ||
|
|
||
| The [message-per-response](/docs/ai-transport/features/token-streaming/message-per-response) pattern includes automatic rate limit protection. AI Transport prevents a single response stream from reaching the message rate limit by rolling up multiple appends into a single published message: | ||
|
|
||
| 1. Your agent streams tokens to the channel at the model's output rate | ||
| 2. Ably publishes the first token immediately, then automatically rolls up subsequent tokens on receipt | ||
| 3. Clients receive the same number of tokens per second, delivered in fewer messages | ||
|
|
||
| By default, a single response stream will be delivered at 25 messages per second or the model output rate, whichever is lower. This means you can publish two simultaneous response streams on the same channel or connection with any [Ably package](/docs/platform/pricing#packages), because each stream is limited to 50% of the [connection inbound message rate](/docs/platform/pricing/limits#connection). You will be charged for the number of published messages, not for the number of streamed tokens. | ||
|
|
||
| ### Configuring rollup behaviour <a id="rollup"/> | ||
|
|
||
| Ably joins all appends for a single response that are received during the rollup window into one published message. You can specify the rollup window for a particular connection by setting the `appendRollupWindow` transport parameter. This allows you to determine how much of the connection message rate can be consumed by a single response stream and control your consumption costs. | ||
|
|
||
|
|
||
| | appendRollupWindow | Maximum message rate for a single response | | ||
| |---|---| | ||
| | 0ms | Model output rate | | ||
| | 20ms | 50 messages/s | | ||
| | 40ms *(default)* | 25 messages/s | | ||
| | 100ms | 10 messages/s | | ||
| | 500ms *(max)* | 2 messages/s | | ||
|
|
||
| The following example code demonstrates establishing a connection to Ably with `appendRollupWindow` set to 100ms: | ||
|
|
||
| <Code> | ||
| ```javascript | ||
| const ably = new Ably.Realtime( | ||
| { | ||
| key: 'your-api-key', | ||
| transportParams: { appendRollupWindow: 100 } | ||
| } | ||
| ); | ||
| ``` | ||
| </Code> | ||
|
|
||
| <Aside data-type="important"> | ||
| If you configure the `appendRollupWindow` to allow a single response to use more than your [connection inbound message rate](/docs/platform/pricing/limits#connection) then you will see [limit enforcement](/docs/platform/pricing/limits#hitting) behaviour if you stream tokens faster than the allowed message rate. | ||
| </Aside> | ||
|
|
||
| ## Message-per-token <a id="per-token"/> | ||
|
|
||
| The [message-per-token](/docs/ai-transport/features/token-streaming/message-per-token) pattern requires you to manage rate limits directly. Each token publishes as a separate message, so high-speed model output can consume message allowances quickly. | ||
|
|
||
| To stay within limits: | ||
|
|
||
| - Calculate your headroom by comparing your model's peak output rate against your package's [connection inbound message rate](/docs/platform/pricing/limits#connection) | ||
| - Account for concurrency by multiplying peak rates by the maximum number of simultaneous streams your application supports | ||
| - If required, batch tokens in your agent before publishing to the SDK, reducing message count while maintaining delivery speed | ||
| - Enable [server-side batching](/docs/messages/batch#server-side) to reduce the number of messages delivered to your subscribers | ||
|
|
||
| If your application requires higher message rates than your current package allows, [contact Ably](/contact) to discuss options. | ||
|
|
||
| ## Next steps <a id="next-steps"/> | ||
|
|
||
| - Review [Ably platform limits](/docs/platform/pricing/limits) to understand rate limit thresholds for your package | ||
| - Learn about the [message-per-response](/docs/ai-transport/features/token-streaming/message-per-response) pattern for automatic rate limit protection | ||
| - Learn about the [message-per-token](/docs/ai-transport/features/token-streaming/message-per-token) pattern for fine-grained control | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph talks about limits scaling as your traffic grows, and the importance of being on the correct package for your use case. That applies to whole-account / quota / extensive limits. But the rest of this document is about the connection client-to-server message rate limit and the channel message rate limit, which are both local, intensive limits: they don't scale as traffic grows and don't change with your quota. So this paragraph seems like it might cause confusion in this context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was aiming for this paragraph to provide general scene setting about rate limits. Channel and connection rate limits can be changed if you are an enterprise customer, so the principal of it being unusual to hit rate limits if you are on the correct package is true for all limit types. The existing rate limit documentation is pretty exhaustive, so I deliberately avoided repeating anything here, but is there particular information you would suggest adding to remove the ambiguity? Or perhaps the solution is to include discussion of channel / app / account limits in the sections below?