There is a sentence in your paper:
We train our models using stochastic gradient descent (SGD) with 0.9 Nesterov momentum and 10-4 weight decay.
But in line 77 in train_imagenet.py, nesterov=True is not set in torch.optim.SGD(). Hence, is Nesterov momentum used in models for ImageNet on earth?