 In 228, why do you use only first token for classification?