Take a look at what I did in here for delaycycles to get a nop template that doesn't simply unroll into a set of nops - it basically sets up a loop in asm to spin the cpu for the number or cycles you want:
https://github.com/FastLED/FastLED/blob/master/fastled_delay.h