Linked Questions
1029
votes
66
answers
675k
views
Count the number of set bits in a 32-bit integer
8 bits representing the number 7 look like this:
00000111
Three bits are set.
What are the algorithms to determine the number of set bits in a 32-bit integer?
959
votes
11
answers
188k
views
Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?
I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
...
177
votes
36
answers
238k
views
What is the fastest/most efficient way to find the position of the highest set bit (msb) in an integer in C?
If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that ...
146
votes
23
answers
117k
views
Position of least significant bit that is set
I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial implementation is this:
unsigned ...
350
votes
4
answers
51k
views
Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs
I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:
Your ...
160
votes
6
answers
117k
views
What is the purpose of XORing a register with itself? [duplicate]
xor eax, eax will always set eax to zero, right? So, why does MSVC++ sometimes put it in my executable's code? Is it more efficient that mov eax, 0?
012B1002 in al,dx
012B1003 push ...
37
votes
5
answers
10k
views
What is the efficient way to count set bits at a position or lower?
Given std::bitset<64> bits with any number of bits set and a bit position X (0-63)
What is the most efficient way to count bits at position X or lower or return 0 if the bit at X is not set
...
34
votes
4
answers
4k
views
Is there a faster algorithm for max(ctz(x), ctz(y))?
For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))?
ctz represents "count trailing zeros".
C++ version (Compiler Explorer)
#include ...
42
votes
4
answers
9k
views
why is c++ std::max_element so slow?
I need to find the max element in the vector so I'm using std::max_element, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is ...
67
votes
1
answer
8k
views
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
I'm a newbie at instruction optimization.
I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays.
The C code is as follows:
float dotp( ...
42
votes
2
answers
5k
views
Why does breaking the "output dependency" of LZCNT matter?
While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following ...
24
votes
2
answers
5k
views
AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code
I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit ...
23
votes
2
answers
10k
views
How is POPCNT implemented in hardware?
According to http://www.agner.org/optimize/instruction_tables.pdf, the POPCNT instruction (which returns the number of set bits in a 32-bit or 64-bit register) has a throughput of 1 instruction per ...
13
votes
6
answers
3k
views
How do I sum the four 2-bit bitfields in a single 8-bit byte?
I have four 2-bit bitfields stored in a single byte. Each bitfield can thus represent 0, 1, 2, or 3. For example, here are the 4 possible values where the first 3 bitfields are zero:
00 00 00 00 = 0 ...
12
votes
5
answers
5k
views
Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?
The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_fast32_t are both 8 bytes in size. However, multiplication ...