In the implementation code, the transform part uses the self-attention structure. In this case, the history steps must be the same as the prediction steps. So, if I want to use the data with 12 history steps to predict only one step, there might be something wrong with it.
However, since I have not read the original text, I am not sure whether this detail conforms to the method in the original paper.