A near real-time load metric collection component, designed for intelligent inference scheduler in large-scale inference services.
English | 中文
Early & quick developing
Load metrics is very import for LLM inference scheduler.
Typically, the following four load metrics are very important: (for each engine level)
- Total number of requests
- Token usage (KVCache usage)
- Number of requests in Prefill
- Prompt length in Prefill
Timeliness is critical in large scale service. Poor timeliness will lead to large races, may choosing the same inference engine before the load metrics are updated.
There will be a fixed periodic delay, when polling metrics from engines. Especially in large-scale scenarios, as the QPS (throughput) increases, the race will also increase significantly.
Cooperating with Inference Gateway(i.e. AIGW), we can achieve near real-time load metric collection by the following steps:
-
Request proxy to Inference Engine:
a. prefill & total request number:
+1b. prefill prompt length:
+prompt-length -
First token responded
a. prefill request number:
-1b. prefill prompt length:
-prompt-length -
Request done
a. total request number:
-1
Even more, we can introduce CAS API to reduce race, when it is required in the feature.
This project is licensed under Apache 2.0.
