Faster division and modulo operation - the power of two

The power of two - fast division and modulo operations

There are some - admittedly rare - cases, when the division and modulo operations are responsible for a great percent of the total runtime.

Such case might be when a high performance circular array positioning is calculated and the pointers use the modulo (ptr = curr % ArraySize) operation to calculate the real position in the current array.

The good news is, that if all the variables are known at compile time, the compiler will help to optimise the division/modulo. The bad news is, that with a dynamic sized array, the compiler won't help anything.

Even worse news is that both the DIV and MOD operations are pretty slow on every CPU (or at least slower than other atomic operations).

Speeding it all up

If the pointer calculation in the previous example is a concern, there is a way to speed it up: use the powers of two! If the divider is a power of two, both the division and modulo operations become extremely simple:

The number X
in decimal: 143
in binary: 1000 1111

The divisor D
in decimal: 4
in binary: 0100

The result (with integers):
X / D = 0010 0011 = 35
X % D = 0000 0011 = 3

However, because D is a power of two, these can be rewritten as:

X / D = X >> 2 = 0010 0011
 (eliminating the last 2 bits; if the divider is 8 then the last 3 bits...)
X % D = X & 0011 = 0000 0011
 (looking at the bits on the right of the divider)

As both the shift (>>) and AND (&) operations are extremely fast, the division/modulo operations can be tuned if the divider is a power of two and the code is accordingly (the current CPUs will not do this on the fly for you, you have to rewrite the code).

Rewritten division

If the input can be anything, we need to determine whether the divider is a power of two:

if ((divider & (divider - 1)) == 0) { …

If a number is a power of two, it means it has only a single bit set to 1. If we subtract 1 from this number, we have all the positions set to 1 on the right of the original bit:

Number: 1000
Number-1: 0111

If we AND these numbers and the result is zero, it was a power of two!

Be warned, that this operation takes time so do it only once, outside the loop otherwise all performance gain will be lost!

DIV

For dividing, we need to find the number of shifts:

int shift = 0;
int divtmp = divider;
while (divtmp > 1) {
 divtmp = divtmp >> 1;
 shift++;
}

Be warned, that this operation takes time so do it only once, outside the loop otherwise all performance gain will be lost!


If the divider is 8, we will have 3 if the "shift" variable (shifting 1 right is the same as dividing by 2).

From here on, all we need to do is substitute the division with the shifting:

// int x = i / divider;
int x = i >> shift;

MOD

For modulo operation, it's even easier. After deciding whether the number is a power of two, just create a simple AND mask:

int and = (divider) – 1;

Then replace the MOD operation with and AND:

// int x = i % divider;
int x = i & and;

Results

The results are actually quite impressive. I've tested on both an i7 and an Atom processor and the results where the between 10-20x faster than with simple division or modulo (Ref is the empty cycle to give reference for standard speed; Pow2DivMod ran when the divider was power of two and DivMod ran in all other cases).


Comments

  1. I'd like to know why it has been so difficult to find an algebraic solution to the modular operation. Wouldn't this speed things up?

    Reply Delete

Post a Comment

[フレーム]

Popular posts from this blog

MurMurHash3, an ultra fast hash algorithm for C# / .NET

Finding good hash functions for larger data sets is always challenging. The well know hashes, such as MD5, SHA1, SHA256 are fairly slow with large data processing and their added extra functions (such as being cryptographic hashes) isn’t always required either. Performance and low collision rate on the other hand is very important, so many new hash functions were inverted in the past few years. One of the most notable ones is MurMurHash3, which is an improved version of its predecessor (v2). The new version can create both 32 bit and 128 bit hash values, making it suitable for a wide range of applications. 64 bit architecture The 32 bit hash isn’t too important for large amount of data matching, so I only worked with and tested the 128 bit version. The latter one is optimized for x86-64 architecture, giving amazing speeds for data processing. As a 128 bit value consists of two 64 integers, the processor is fairly good with working with such large hash values – they are not byte...

Convert animated WEBP to MP4

 Animated WebP, similar to animated GIF, is not the best video format to work with, but this is what the usual SVD (image or text to video) and ComfyUI will generate. Unfortunately, FFMPEG does not support WebP to MP4 conversion yet, so this simple frame-by-frame conversation had to be done. As it's generating a PNG for each frame, it's rather slow, so it's only suitable for short animations. The below simple python code converts the animated WebP into MP4, using the predefined FPS. Don't forget to  pip3 install pillow   moviepy import os import shutil import tempfile import argparse from moviepy.video.io.ImageSequenceClip import ImageSequenceClip import PIL.Image def analyse_image (path): im = PIL . Image . open(path) results = { 'size' : im . size, 'mode' : 'full' , } try : while True : if im . tile: tile = im . tile[ 0 ] update_region = t...

Quick select algorithm - find the Kth element in a list in linear time

Quick select algorithm (Hoare's selection algorithm) – select the Kth element or  the first K element  from a list in linear time Working with large datasets is always painful, especially when it needs to be displayed in a ‘human readable’ format. It is a very frequent task to display only the largest, newest, most expensive etc. items. While sorting the whole dataset definitely gives a correct result, it is much slower than it needs to be – it needs at least O(n*log(n)) time and an it often uses recursion for the sorting, so in practice it can be quite slow. The quick select algorithm can get the top K element from a list of N items in linear time, O(n), with a very reasonable multiplication factor. The quick select does not use recursion so the performance is great for even large datasets. Algorithm The idea of the quick select is quite simple: just like with quicksort, select a random element from the list, and place every item that is smaller to the first half o...