I'm looking for a low-overhead method for my program to stall a few cycles on an Intel CPU, without causing memory accesses or side effects that could alter the CPU components' data (e.g. no usleep()).
What would be the best-fit instruction that has a consistent execution cycle-time and predictable behavior, so that I could use it once or numerous times, depending on how many cycles I'd like my program to stall (e.g. 5, 10, or 1000)? I can't trust nop as I've read it does not guarantee 1 cycle execution time and could be optimized away (0 cycles) throughout the pipeline's execution.
_mm_pausewill stall (the front-end?) for 5 or 100 cycles depending on CPU model (before vs. after Skylake on Intel), or for a BIOS-configurable amount on Zen._mm_lfence()will block the front-end until the back-end drains. Spinning onrdtsccan be viable if you want to wait for more than like 40 core clock cycles (How to calculate time for an asm delay loop on x86 linux?)pausewould always take ≈100 cycles?pauseprobably doesn't make thing any slower (except maybe by delaying independent work that will also stall and could have been running in parallel, e.g. another load from a separate address). Or not if it's not in the shadow of a stall that would happen anyway. I can't think of a mechanism that would make it slower by more than 100 core cycles.pausedifferent thanrep; nop? I recently came across this combination of instructions in a code base.