1. Did you guys use sequence packing in your training? 2. If so, did you use special attention masks to "isolate each sample"? In other words, does tokens from one sample attend to tokens from another sample?