Skip to content

prefill的疑惑 #2

@lsm2842035890

Description

@lsm2842035890

Prefilling. Let Lprompt denote the lengths of the prompt, which includes the question, image, and
system prompts. Let Lanswer denote the length of the response. With vision tokens, the prompt can
become unignorably long. The use of a full attention mask in diffusion decoding results in a quadratic
complexity O((Lprompt + Lanswer)2) per decoding step. To alleviate this cost, we re-implement the
Prefilling strategy in the autoregressive models, which saves the key-value pairs of the prompt tokens
after the first generation step and reuse them in the following steps, reducing the complexity to
O(L2answer). But, due to the use of a full attention mask in DMLLM, the prefilling technique is not
strictly lossless.

我想知道我理解的对不对:在自回归模型中是causal mask不知道未生成的token,prompt需要计算fullattention,并且在接下来的decode阶段复用prompt的kvcache,和新来的token的qkv计算attention。
在您这篇文章里,由于duffison generation是双向的,prefill也是实现计算promptfullattention有且仅有一次,并且在之后的生成中多次复用,只需要生成answer部分的未被mask的token的qkv,然后对每个mask位置的attention进行计算。(实现了只需要每次计算answer中未被mask的token所有的qkv,因此复杂度是 length_answer的平方);您说不是严格无损的,是因为prompt计算qkv并不知道未来token的qkv,所以在生成阶段使用之前的prompt的qkv不包含未来token,因此在生成阶段使用是不准确的,所以您说 不是严格无损?~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions