+**Talk Abstract:** This talk introduces the EAGLE series, a lossless acceleration algorithm for large language models that performs autoregression at a structured feature level rather than the token level, incorporating sampling results to eliminate uncertainty. These innovations make EAGLE’s draft model both lightweight and highly accurate, accelerating inference by 2.1x–3.8x while provably preserving the output distribution. EAGLE-2 enhances this with dynamic draft trees, leveraging confidence estimates to approximate draft token acceptance rates and dynamically adjusting tree structures to maximize acceptance length, achieving an additional 20%–40% speed boost over EAGLE-1 for a total acceleration of 2.5x–5.0x while maintaining the original output distribution. We will also introduce our latest algorithm, EAGLE-3. The EAGLE series has been widely adopted in the industry and integrated into open-source frameworks, including vLLM, SGLang, TensorRT-LLM, MLC-LLM, AWS NeuronX Distributed Core, Intel LLM Library for PyTorch, and Intel Extension for Transformers.
0 commit comments