During training, the encoder only sees up to 30% of the tokens, but during inference, it needs to handle 0 to 100% of the tokens. Why doesn't this inconsistency between training and inference cause problems? (In LLMs, length (OOD) often leads to inference failure.)