Return to Answer

Notice added Recommended answer in Intel

occurred Mar 22, 2022 at 14:58

link details about when it's fixed (Cannon Lake), and the identical lzcnt/tzcnt false dep. And update the Why section.

Source Link

edited Nov 27, 2019 at 9:56

Peter Cordes

edited Nov 27, 2019 at 9:56

Peter Cordes

378.4k
50
750
1k

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)

Skylake fixed this for lzcnt and tzcnt .
Cannon Lake (and Ice Lake) fixed this for popcnt.
bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)

(Yes, these instructions all run on the same execution unit ).

We can only speculate, but it's likely that Intel has: it runs on the same handling for a lot of two-operand instructions. Common instructions likeexecution unit as addbsf, / subbsr take two operands both of which are inputsdo have an output dependency. So (How is POPCNT implemented in hardware? ). For those instructions, Intel probably shoved popcnt intodocuments the same categoryinteger result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to keepavoid breaking old software: output unmodified. AMD documents this behaviour.

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the processor design simpleoutput but others not.

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.

We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

(Yes, these instructions all run on the same execution unit ).

We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency. (How is POPCNT implemented in hardware? ). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.

added 408 characters in body

Source Link

edited Jan 18, 2017 at 17:37

Cody Gray ♦

edited Jan 18, 2017 at 17:37

Cody Gray ♦

246.1k
53
514
591

popcnt src, dest

13 GB/s has a chain: popcnt-add-popcnt-popcnt popcnt-add-popcnt->popcnt → next iteration
15 GB/s has a chain: popcnt-add-popcnt-add popcnt-add-popcnt->add → next iteration
20 GB/s has a chain: popcnt-popcnt -popcnt->popcnt → next iteration
26 GB/s has a chain: popcnt-popcnt -popcnt->popcnt → next iteration

It seems that neither GCC, nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

(Update:As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)

popcnt src, dest

13 GB/s has a chain: popcnt-add-popcnt-popcnt --> next iteration
15 GB/s has a chain: popcnt-add-popcnt-add --> next iteration
20 GB/s has a chain: popcnt-popcnt --> next iteration
26 GB/s has a chain: popcnt-popcnt --> next iteration

popcnt src, dest

13 GB/s has a chain: popcnt-add-popcnt-popcnt → next iteration
15 GB/s has a chain: popcnt-add-popcnt-add → next iteration
20 GB/s has a chain: popcnt-popcnt → next iteration
26 GB/s has a chain: popcnt-popcnt → next iteration

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

Mod Moved Comments To Chat

occurred Dec 14, 2016 at 22:07

Bounty Awarded with 50 reputation awarded by Natan Streppel

occurred Aug 6, 2014 at 18:22

deleted 44 characters in body

Source Link

edited Aug 4, 2014 at 9:26

Peter Mortensen

edited Aug 4, 2014 at 9:26

Peter Mortensen

31.3k
22
110
134

.L4:
 movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq 16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq 4,ドル %rax
 popcnt %r8, %r8
 add %r8, %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10, %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11, %rsi
 cmpq 131072,ドル %rax
 jne .L4

.L9:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # This time reuse "rax" for all the popcnts.
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L9

.L14:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # Reuse "rax" for all the popcnts.
 xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L14

.L4:
 movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq 16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq 4,ドル %rax
 popcnt %r8, %r8
 add %r8, %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10, %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11, %rsi
 cmpq 131072,ドル %rax
 jne .L4

.L9:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # This time reuse "rax" for all the popcnts.
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L9

.L14:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # Reuse "rax" for all the popcnts.
 xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L14

We can only speculate, but it's likely that Intel has the same handling for a lot two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

.L4:
 movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq 16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq 4,ドル %rax
 popcnt %r8, %r8
 add %r8, %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10, %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11, %rsi
 cmpq 131072,ドル %rax
 jne .L4

.L9:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # This time reuse "rax" for all the popcnts.
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L9

.L14:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # Reuse "rax" for all the popcnts.
 xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L14