Skip to main content
Stack Overflow
  1. About
  2. For Teams

Return to Answer

Notice added Recommended answer in Intel
link details about when it's fixed (Cannon Lake), and the identical lzcnt/tzcnt false dep. And update the Why section.
Source Link
Peter Cordes
  • 378.4k
  • 50
  • 750
  • 1k

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)

Skylake fixed this for lzcnt and tzcnt .
Cannon Lake (and Ice Lake) fixed this for popcnt.
bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)

(Yes, these instructions all run on the same execution unit ).


We can only speculate, but it's likely that Intel has: it runs on the same handling for a lot of two-operand instructions. Common instructions likeexecution unit as addbsf, / subbsr take two operands both of which are inputsdo have an output dependency. So (How is POPCNT implemented in hardware? ). For those instructions, Intel probably shoved popcnt intodocuments the same categoryinteger result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to keepavoid breaking old software: output unmodified. AMD documents this behaviour.

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the processor design simpleoutput but others not.

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.

We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)

Skylake fixed this for lzcnt and tzcnt .
Cannon Lake (and Ice Lake) fixed this for popcnt.
bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)

(Yes, these instructions all run on the same execution unit ).


We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency. (How is POPCNT implemented in hardware? ). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.

added 408 characters in body
Source Link
Cody Gray
  • 246.1k
  • 53
  • 514
  • 591
popcnt src, dest
  • 13 GB/s has a chain: popcnt-add-popcnt-popcnt popcnt-add-popcnt->popcnt next iteration
  • 15 GB/s has a chain: popcnt-add-popcnt-add popcnt-add-popcnt->add next iteration
  • 20 GB/s has a chain: popcnt-popcnt -popcnt->popcnt next iteration
  • 26 GB/s has a chain: popcnt-popcnt -popcnt->popcnt next iteration

It seems that neither GCC, nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

(Update:As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)

popcnt src, dest
  • 13 GB/s has a chain: popcnt-add-popcnt-popcnt --> next iteration
  • 15 GB/s has a chain: popcnt-add-popcnt-add --> next iteration
  • 20 GB/s has a chain: popcnt-popcnt --> next iteration
  • 26 GB/s has a chain: popcnt-popcnt --> next iteration

It seems that neither GCC, nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

popcnt src, dest
  • 13 GB/s has a chain: popcnt-add-popcnt-popcnt next iteration
  • 15 GB/s has a chain: popcnt-add-popcnt-add next iteration
  • 20 GB/s has a chain: popcnt-popcnt next iteration
  • 26 GB/s has a chain: popcnt-popcnt next iteration

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

(Update:As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)

Mod Moved Comments To Chat
Bounty Awarded with 50 reputation awarded by Natan Streppel
deleted 44 characters in body
Source Link
Peter Mortensen
  • 31.3k
  • 22
  • 110
  • 134
.L4:
 movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq 16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq 4,ドル %rax
 popcnt %r8, %r8
 add %r8, %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10, %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11, %rsi
 cmpq 131072,ドル %rax
 jne .L4
.L9:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # This time reuse "rax" for all the popcnts.
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L9
.L14:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # Reuse "rax" for all the popcnts.
 xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L14

We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

.L4:
 movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq 16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq 4,ドル %rax
 popcnt %r8, %r8
 add %r8, %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10, %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11, %rsi
 cmpq 131072,ドル %rax
 jne .L4
.L9:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # This time reuse "rax" for all the popcnts.
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L9
.L14:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # Reuse "rax" for all the popcnts.
 xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L14

We can only speculate, but it's likely that Intel has the same handling for a lot two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

.L4:
 movq (%rbx,%rax,8), %r8
 movq 8(%rbx,%rax,8), %r9
 movq 16(%rbx,%rax,8), %r10
 movq 24(%rbx,%rax,8), %r11
 addq 4,ドル %rax
 popcnt %r8, %r8
 add %r8, %rdx
 popcnt %r9, %r9
 add %r9, %rcx
 popcnt %r10, %r10
 add %r10, %rdi
 popcnt %r11, %r11
 add %r11, %rsi
 cmpq 131072,ドル %rax
 jne .L4
.L9:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # This time reuse "rax" for all the popcnts.
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L9
.L14:
 movq (%rbx,%rdx,8), %r9
 movq 8(%rbx,%rdx,8), %r10
 movq 16(%rbx,%rdx,8), %r11
 movq 24(%rbx,%rdx,8), %rbp
 addq 4,ドル %rdx
 # Reuse "rax" for all the popcnts.
 xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
 popcnt %r9, %rax
 add %rax, %rcx
 popcnt %r10, %rax
 add %rax, %rsi
 popcnt %r11, %rax
 add %rax, %r8
 popcnt %rbp, %rax
 add %rax, %rdi
 cmpq 131072,ドル %rdx
 jne .L14

We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

added 805 characters in body
Source Link
Mysticial
  • 473k
  • 46
  • 343
  • 337
Loading
added 39 characters in body
Source Link
Mysticial
  • 473k
  • 46
  • 343
  • 337
Loading
deleted 133 characters in body
Source Link
Mysticial
  • 473k
  • 46
  • 343
  • 337
Loading
Source Link
Mysticial
  • 473k
  • 46
  • 343
  • 337
Loading
lang-cpp

AltStyle によって変換されたページ (->オリジナル) /