-
Notifications
You must be signed in to change notification settings - Fork 246
Description
When using beachmark_e2e.py to run against phi3/phi4 model, we encountered a sharp pref regression with sequence length >= 4097.
Device: Mac with Apple M2 Pro chip and Windows with NV 5080 GPU chip
Repro steps:
- Download phi3/phi4 model
- Run "python benchmark/python/benchmark_e2e.py -i model_path -l 4096 -g 738 --use_prompt_set"
- Check the last token generation time(which was 0.028ms on my Mac machine)
- Run "python benchmark/python/benchmark_e2e.py -i model_path -l 3836 -g 739 --use_prompt_set"
- Check the last token generation time which should be much larger than the first run(which was 11.06ms on my Mac machine).
Looks like token 4097 generation goes into profiling process? Checked the code, we found it may be caused by this code change: Recompute KV cache for Phi3 when switching from short to long factor #1161. For phi3/phi4(model type is still phi3 in Edge on-device model), when prompt + generation sequence length hits 4097, the GenAI GenerateNextToken will has a special handling to recompute the Position IDs and KV Cache to switch from short factor to long factor which introduces a time-consuming profiling step. Per our experiments, it only hit once at sequence length 4097 no matter how much longer your sequence is.
Pls note that:
- The sequence length mentioned here is from GetSequenceLength funciton in GenAI, not the concept in ORT.
- When you set -l or --prompt_lengths in benchmark/python/benchmark_e2e.py cmd, the real sequence(tokens) length is not exact same as the specified prompt length, it's usually slightly smaller than prompt length.
- When you --use_prompt_set in cmd, the prompt will be read from the predefined strings in benchmark/python/prompts.json, otherwise it will use GenAI's GenerateNextToken to generate the input prompts, similarly, you cannot assume the prompt length you set in cmd is exactly the same as the tokens length it really generates.
I also tried to disable the special handing of #1161 and rebuild GenAI, then running beachmarl_e2e.py with --print_model_output in cmd, I can observe some chaotic outputs at the ending when the total sequence(prompt + generation) exceed the length 4097, like bellow(-l 4096 -g 1000 --use_prompt_set -mo). That's being said the fix of #1161 is still quite necessary for phi3/phi4 models so far and we cannot remove it.
...
...
...
The Enchanted Crystal Palace loomed majestically as Elara approached, its.,, and. and, and and. A and, and. He and the, and,her. He, she.., with, and., and. and. and. and, for a,, and. as a, and, and. with. the and. She. For, on. on the.
. She. They. of. She. She. She. As' to, and, and. A, to.... The, and. However. As. She,,. to, to,. to is. To to.,2. and her.. The, and. She's, and. and. and and. She, the, and. In. the. Al.. She. under, and,, and. and. upon. to, and. and. to, to, and, to the.. and. For but, to.. to. As for. the. she. and. to, to, and, and, and. she and, and. as. She (,, for a. and. and, and, and, and. The and, and, and and. and in and and., and, and, to. and, and, to. and. to and. and and, with the, and,3. and,,, and, and, was, and, and,