Return to Answer

Commonmark migration

Source Link

edited Jun 10, 2020 at 13:24

Community Bot

edited Jun 10, 2020 at 13:24

Community Bot

Performance

##Performance (Without any profiling data, I'm guessing blindly here)

Performance

(Without any profiling data, I'm guessing blindly here)

Source Link

answered Jun 3, 2015 at 8:29

Emily L.

answered Jun 3, 2015 at 8:29

Emily L.

16.7k
1
39
89

##Performance (Without any profiling data, I'm guessing blindly here)

You have a linked data structure (a tree), traversing the tree has poor cache locality and this is where your performance will likely suffer. Because every node could be a cache miss at worst. But please do measure this first, if you see that most of your time is spent getting the next node or close to a line of code that does, you're having cache issues.

One thing you can do to try to improve your cache performance is to encode your tree into a linear sequence (std::vector preferrably). And then measure to see if it actually is faster.

As for the encoding, we'll refer to a level of the tree as k and the root has k=0. This means that the level k has at most 8^k nodes. If we refer to each node in a level by it's grid coordinate (x,y, z) then the node n(k,x,y,z) (at level k and position (x,y,z)) will have it's child nodes on:

n(k+1, 2*x, 2*y, 2*z)
... (all combinations of +1 to the x,y,z) ...
n(k+1, 2*x + 1, 2*y + 1, 2*z + 1)

We can order all the nodes into a linear random access sequence v (std::vector) like this:

level_base = ipow(8,k) - 1;
level_width = ipow(2,k);
grid_index = x + level_width * y + level_width*level_width*z;
n(k, x, y) = v[level_base + grid_index];

Note that the width of the grid at level k is 2^k (example: 1, 2, 4, 8). Here ipow is an optimized integer power function. If ipow proves to be a bottle neck, you can use a LUT for the powers of 8 and 2.

Benefits

The benefits to doing this is that Breadth First Traversal/Search becomes a linear access pattern which your CPU's pre-fetcher will love you for. This is very fast.
If you know before hand which node you want (k,x,y,z), you can get it directly.
You can get all the neighbours easily.
Given your (r,g,b) triplet, you may be able to just directly compute the index of the destination node, write the value and you're done. No iteration to the right node.
Less memory fragmentation.
For dense trees this actually takes less memory as you don't have the heap bookkeeping overhead etc.
You do not need the child[] array making your data structure smaller and more of them can fit in each cache line.

Drawbacks

The std::vector is dense, if your tree is very sparse, you may end up allocating a lot more nodes than you need. On the other hand, if you have the memory it might be acceptable.
It's a more complex indexing which can result in bugs or more complex code if proper caution isn't taken.
If you need to distinguish between empty and non-empty nodes (as opposed to having value == 0), you need to introduce a has_data flag. But the nodes are still smaller than having the child array.
Not as resilient against off-by-one type of bugs. In a linked-tree type structure if you manage to somehow write outside of the node's structure you'll write into the canary values on the heap or in the heap book-keeping. Under debug mode your RT should let you know you have an issue. With this approach you will silently corrupt nearby nodes. Not necessarily likely but something to be aware of.

lang-cpp