- 378.4k
- 50
- 750
- 1k
appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)
Skylake fixed this for lzcnt and tzcnt .
Cannon Lake (and Ice Lake) fixed this for popcnt.bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)
(Yes, these instructions all run on the same execution unit ).
We can only speculate, but it's likely that Intel has: it runs on the same handling for a lot of two-operand instructions. Common instructions likeexecution unit as addbsf, / subbsr take two operands both of which are inputsdo have an output dependency. So (How is POPCNT implemented in hardware? ). For those instructions, Intel probably shoved popcnt intodocuments the same categoryinteger result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to keepavoid breaking old software: output unmodified. AMD documents this behaviour.
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the processor design simpleoutput but others not.
appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.
We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.
appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)
Skylake fixed this for lzcnt and tzcnt .
Cannon Lake (and Ice Lake) fixed this for popcnt.bsf/bsr have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)
(Yes, these instructions all run on the same execution unit ).
We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency. (How is POPCNT implemented in hardware? ). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.
popcnt src, dest
- 13 GB/s has a chain: popcnt-add-popcnt-popcnt
popcnt-add-popcnt->popcnt→ next iteration - 15 GB/s has a chain: popcnt-add-popcnt-add
popcnt-add-popcnt->add→ next iteration - 20 GB/s has a chain: popcnt-popcnt -
popcnt->popcnt→ next iteration - 26 GB/s has a chain: popcnt-popcnt -
popcnt->popcnt→ next iteration
It seems that neither GCC, nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
(Update:As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
popcnt src, dest
- 13 GB/s has a chain: popcnt-add-popcnt-popcnt --> next iteration
- 15 GB/s has a chain: popcnt-add-popcnt-add --> next iteration
- 20 GB/s has a chain: popcnt-popcnt --> next iteration
- 26 GB/s has a chain: popcnt-popcnt --> next iteration
It seems that neither GCC, nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
popcnt src, dest
- 13 GB/s has a chain:
popcnt-add-popcnt-popcnt→ next iteration - 15 GB/s has a chain:
popcnt-add-popcnt-add→ next iteration - 20 GB/s has a chain:
popcnt-popcnt→ next iteration - 26 GB/s has a chain:
popcnt-popcnt→ next iteration
It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
(Update:As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq 4,ドル %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq 131072,ドル %rax
jne .L4
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq 4,ドル %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq 131072,ドル %rdx
jne .L9
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq 4,ドル %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq 131072,ドル %rdx
jne .L14
We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq 4,ドル %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq 131072,ドル %rax
jne .L4
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq 4,ドル %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq 131072,ドル %rdx
jne .L9
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq 4,ドル %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq 131072,ドル %rdx
jne .L14
We can only speculate, but it's likely that Intel has the same handling for a lot two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq 4,ドル %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq 131072,ドル %rax
jne .L4
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq 4,ドル %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq 131072,ドル %rdx
jne .L9
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq 4,ドル %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq 131072,ドル %rdx
jne .L14
We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.