Linked Questions

Ask Question

31 questions linked to/from Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs

Hot Newest Score Active Unanswered

1029 votes

66 answers

675k views

Count the number of set bits in a 32-bit integer

8 bits representing the number 7 look like this: 00000111 Three bits are set. What are the algorithms to determine the number of set bits in a 32-bit integer?

Community wiki

15 revs, 9 users 40%
Matt Howells

959 votes

11 answers

188k views

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with: ...

rosghub's user avatar

rosghub

9,324

asked Nov 1, 2016 at 6:12

177 votes

36 answers

238k views

What is the fastest/most efficient way to find the position of the highest set bit (msb) in an integer in C?

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that ...

Zxaos's user avatar

Zxaos

8,199

asked Mar 22, 2009 at 23:37

146 votes

23 answers

117k views

Position of least significant bit that is set

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4. A trivial implementation is this: unsigned ...

peterchen's user avatar

peterchen

41.4k

asked Apr 16, 2009 at 16:54

350 votes

4 answers

51k views

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions: Your ...

Cowmoogun's user avatar

Cowmoogun

2,577

asked May 21, 2016 at 9:29

160 votes

6 answers

117k views

What is the purpose of XORing a register with itself? [duplicate]

xor eax, eax will always set eax to zero, right? So, why does MSVC++ sometimes put it in my executable's code? Is it more efficient that mov eax, 0? 012B1002 in al,dx 012B1003 push ...

devoured elysium's user avatar

devoured elysium

106k

asked Sep 8, 2009 at 21:54

37 votes

5 answers

10k views

What is the efficient way to count set bits at a position or lower?

Given std::bitset<64> bits with any number of bits set and a bit position X (0-63) What is the most efficient way to count bits at position X or lower or return 0 if the bit at X is not set ...

Glenn Teitelbaum's user avatar

Glenn Teitelbaum

10.4k

asked Dec 22, 2015 at 2:09

34 votes

4 answers

4k views

Is there a faster algorithm for max(ctz(x), ctz(y))?

For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))? ctz represents "count trailing zeros". C++ version (Compiler Explorer) #include ...

QuarticCat's user avatar

QuarticCat

1,566

asked Jun 1, 2023 at 11:05

42 votes

4 answers

9k views

why is c++ std::max_element so slow?

I need to find the max element in the vector so I'm using std::max_element, but I've found that it's a very slow function, so I wrote my own version and manage to get x3 better performance, here is ...

MoonBun's user avatar

MoonBun

4,412

asked Sep 2, 2014 at 11:16

67 votes

1 answer

8k views

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

I'm a newbie at instruction optimization. I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays. The C code is as follows: float dotp( ...

Forward's user avatar

Forward

asked Jul 15, 2017 at 1:14

42 votes

2 answers

5k views

Why does breaking the "output dependency" of LZCNT matter?

While benchmarking something I measured a much lower throughput than I had calculated, which I narrowed down to the LZCNT instruction (it also happens with TZCNT), as demonstrated in the following ...

user555045's user avatar

user555045

65.8k

asked Jan 27, 2014 at 19:45

24 votes

2 answers

5k views

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit ...

BlueStrat's user avatar

BlueStrat

2,324

asked Jul 17, 2015 at 0:51

23 votes

2 answers

10k views

How is POPCNT implemented in hardware?

According to http://www.agner.org/optimize/instruction_tables.pdf, the POPCNT instruction (which returns the number of set bits in a 32-bit or 64-bit register) has a throughput of 1 instruction per ...

Siqi Lin's user avatar

Siqi Lin

1,277

asked Mar 2, 2015 at 4:23

13 votes

6 answers

3k views

How do I sum the four 2-bit bitfields in a single 8-bit byte?

I have four 2-bit bitfields stored in a single byte. Each bitfield can thus represent 0, 1, 2, or 3. For example, here are the 4 possible values where the first 3 bitfields are zero: 00 00 00 00 = 0 ...

Nathan Kurz's user avatar

Nathan Kurz

1,729

asked Jul 26, 2013 at 11:29

12 votes

5 answers

5k views

Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_fast32_t are both 8 bytes in size. However, multiplication ...

Luís Fernando Schultz Xavier's user avatar

Luís Fernando Schultz Xavier

asked Nov 7, 2010 at 2:48

15 30 50 per page

2 3 Next

CollectivesTM on Stack Overflow

Linked Questions

Count the number of set bits in a 32-bit integer

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

What is the fastest/most efficient way to find the position of the highest set bit (msb) in an integer in C?

Position of least significant bit that is set

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

What is the purpose of XORing a register with itself? [duplicate]

What is the efficient way to count set bits at a position or lower?

Is there a faster algorithm for max(ctz(x), ctz(y))?

why is c++ std::max_element so slow?

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

Why does breaking the "output dependency" of LZCNT matter?

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

How is POPCNT implemented in hardware?

How do I sum the four 2-bit bitfields in a single 8-bit byte?

Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

Hot Network Questions