Job Title: Sarcastic Architect
Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
As we’re developing our own allocator for (Re)Actors (which will be published later <wink />), we run into a need to test it, and more importantly – to benchmark it against currently existing ones. Apparently, while there are quite a few allocator test programs out there, most of these tests are not representative of real-world loads, and therefore, can only provide very cursory ideas about allocator performance in real programs.
As one example, rather popular t-test1.c (see, for example, [t-test1.c]) is routinely used as a unit test, so it is bound to test stuff such as calloc() and realloc(); on the other hand, these functions are very rarely used in the modern C++ programs, so benchmarking them effectively skews the test against likely real-world conditions (at least for C++). Also, uniform random distributions (both of the allocated sizes, and of the relative frequency of creating/destroying items), while being the easiest to simulate, are not representative of the real-world conditions (in fact, uniform random distribution over a large chunk of memory is known to be the best way to effectively disable caching, but – fortunately for performance – this rarely happens in the real world). And last but not least, testing only allocations/deallocations, without any access to the allocated memory, is as unrealistic as it gets; see, for example, discussion on the effects of better locality provided by allocator, on overall performance, in [Lakos17].
Goals and Requirements
[画像:Hare pointing out:]“Our goal is to develop a test for allocators, which on the one hand would behave as close to real-world C++ programs as possible.As a result, we decided to develop our own test, which tries to address all the issues listed above. Our goal is to develop a test for allocators, which on the one hand would behave as close to real-world C++ programs as possible, but on the other hand will allow measuring performance differences where they exist (without degrading into “all allocators are equal because our test spends too much time in the non-allocating code”). When trying to go a bit further down to earth, it resulted in the following set of the requirements:
- we want our test to use only new/delete (or malloc/free). Everything else is so rarely used in C++, that benchmarking it doesn’t go well with “close to real-world C++ program” goal.
- Or the same thing from another angle: if you really need lots of realloc() calls for your program, then to find the optimum allocator for your app, you will need a different benchmark – and quite likely, a different allocator.
- we want our test to use somewhat-realistic statistical distributions – both for distribution of allocated sizes, and for distribution of “life times of allocated items”.
- We DO realize that it is bound to be a very rough approximation, but for vast majority of programs out there, any honest attempt to become more realistic will be better than using cache-killing uniform distributions.
- we want our test to access all the allocated memory: we should write allocated memory at least once, and to read it back at least once too.
- We do acknowledge that there can be MUCH more than one write/read, but we feel that at least once is an absolute minimum for anything aiming to be at least somewhat realistic.
Basic Idea
When looking from a 30’000-feet height, a very basic idea of our test is very similar to that of [t-test1.c]:
- we have a bunch of ‘slots’, each ‘slot’ representing a potential allocated item.
- on each iteration, we’re randomly selecting on slot:
- if it is empty – we’re allocating an item and are placing it into the slot
- if it is free – we’re deallocating the item and are making the slot empty
- if there is more than one thread, each of the threads is working independently along the lines above.
That’s pretty much it.
Making Distributions More Realistic
In this model, there are two separate distributions: (a) distribution of sizes of the allocated items, and (b) distribution of the random selection within our slots (which translates into a distribution of the relative lifetimes of the allocated items).
For distribution of allocated sizes, we felt that a simple
p ~ 1/size
Piecewise linear function is a real-valued function defined on the real numbers or a segment thereof, whose graph is composed of straight-line sections.— Wikipedia —(where ‘~’ means ‘proportional’ and size <= max_size) is good enough to represent some more-or-less realistic scenario. In other words, the probability of allocating 30 bytes is twice more than probability of allocating 60 bytes, which in turn is twice more than the probability of allocating 120 bytes, and so on. In practice, to ensure that calculations are fast enough, we had to use approximation with a piecewise linear function, but we hope we didn’t deviate too much from our 1/size goal.
Pareto distribution is a power-law probability distribution that is used in description of social, scientific, geophysical, actuarial, and many other types of observable phenomena.— Wikipedia —For distribution of slot-being-selected-for-manipulation, we experimented quite a bit to ensure that on the one hand, probabilities of accessing less-frequent slots decrease fast (in particular, significantly faster than 1/x which is too slow to represent real-world distributions which we’ve seen in our careers), but OTOH that they don’t come to virtual zero too fast. In the end, we had to resort to a so-called Pareto distribution (in its classic version with “20% of people drink 80% of beer” rule). In addition to being a very good approximation for quite a few different non-computer-related real-world distributions, Pareto distribution did provide what we considered a reasonably-close-to-real-world approximation of access frequencies; as a side bonus, we observed that with certain parameters (number of slots and max_size) it did not deviate too far1 from accesses to L1 being ~= accesses to L2 being ~= accesses to L3 being ~= accesses to main RAM, which we considered to be a good sign.2
As with allocated sizes, we did have to go for a piecewise linear function to make things reasonably fast while testing, but we do hope it is not too bad.
1 within 3x
2 those hardware developers wouldn’t spend time and efforts on all those caches unless they’re used, won’t they?
Accessing Allocated Memory
As noted above, we DID want to access allocated memory; for our test, we did it as writing the whole block right after the allocation (using memset()) – and reading it right before deallocation. It is a bare minimum of the accesses we could imagine for the real-world (hey, if we did want to allocate, probably we did want to write something there?).
Compensation Mechanism
All the stuff we did (especially with regards to distribution) did take its tall CPU-wise (while we did take lots of efforts to avoid indirections as much as possible, in-register calculations are still significant <sad-face />). Also, we did want to subtract the ideal-case cost of memset() and memory reading, to see the difference between different allocators more clearly.
To do it, we’re always running two tests: (a) allocator-under-test, and (b) dummy “void” allocator (which does absolutely nothing), and then we subtract time spent in (b) from time spent in (a) to get the performance of allocator-under-test without the costs of the test itself.
Testing
Test. We run our test app, designed along the lines discussed above, and available at [Github.alloc-test]
System. We run all our tests on a more or less typical 2-socket server box, with two E5645, each of E5645’s having 6 cores with Hyperthreading (~=”2-way SMT”), 12M of L3 cache, and capable of running at 2.4GHz (2.67Ghz in turbo mode). The box has 32G of RAM. OS: Debian 9 “Stretch” (as of the moment of this writing, it is current Debian “stable”). Very briefly: it is a pretty typical (even if a bit outdated) real-world “workhorse” server box.
Allocators Under Test. Our first set of tests was run over 4 popular Linux allocators:
- built-in glibc allocator (common wisdom says it is heavily modified ptmalloc2)
- [hoard]. Obtained from https://github.com/emeryberger/Hoard and compiled.
- [tcmalloc]. Obtained via apt-get install google-perftools
- [jemalloc]. Obtained via apt-get install libjemalloc-dev, apt-get install libjemalloc1
For all the allocators, we DID NOT play games with LD_PRELOAD; instead we (for all allocators except for built-in one):
- used -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free flags when compiling our test app (as recommended in [perftools.README]; we had problems without these at least with tcmalloc).
- linked our test app with -lhoard or -ltcmalloc or -ljemalloc
Test Settings. To run our tests, we had to choose certain parameters, most importantly – amount of RAM we’re going to allocate. In line with trying to keep our tests as close to real-world as possible, we decided to use about 1G of allocated RAM for our testing; in fact, due to us using powers of two wherever possible, we ended up with 1.3G of allocated RAM. It is important to note that when dealing with multiple threads, we decided to split the same amount of RAM over all the threads (this is important as we did want to exclude differences due to different size-of-RAM-being-worked-with compared to CPU-cache).
Another parameter (which varied from test to test) was number of threads. As the box we had has hardware support for 24 threads (2 sockets*6 cores/socket*2 threads/core = 24 threads), we ran our tests using 1-23 threads (keeping the last hardware thread for OS own needs, so interrupts and background activities don’t affect our testing too much).
What to Measure. For our first set of tests, we tested only two parameters:
- Time spent while the test is run. Then, we used this data to derive “CPU clock cycles per malloc()/free() pair” metric).
- Memory usage by the program (measured as maximum of RSS=”Resident Set Size”; without swap, it is a reasonably good metrics of the amount of RAM allocated by the program from OS). We used this data to calculate “memory overhead” (as a ratio of “amount of RAM allocated from OS” to “amount of RAM requested by app level via malloc()”).
Test Results
Now, with test conditions being more or less described (if something is missing – give us a shout, we’ll be happy to elaborate further) we can get to the juiciest part of this post – to the test results.
Here is the data we’ve got:
This is more or less consistent with the previous testing and our expectations, however:
- with our attempts to simulate real-world behavior, performance differences between modern state-of-the-art mallocs are not that drastic as it is sometimes claimed. In particular, the largest observed performance difference between tcmalloc and ptmalloc2 was tcmalloc outperforming ptmalloc2 by a factor of 1.7x, though it happened only on one single thread, and with larger number of threads it was ptmalloc2 outperforming tcmalloc (by up to 1.2x)
- it is interesting to note that those mallocs which were performing better with fewer threads (tcmalloc and hoard), started to perform gradually worse at about 12 threads, and by about 18-20 threads got worse than ptmalloc2 and jemalloc.
- Whether this is related to SMT (on this box, 12 threads is a point where SMT comes into play), or to anything else – is a subject for further research. In particular, it is interesting to find out whether this effect sustains if we run our test threads in different processes.
The second parameter we measured, was “memory overhead”. This is very important for overall performance, as the less RAM overhead we incur, the more efficient our caching is. To have an extremely rough approximation of this effect, we can think of allocator-having-2x-overhead, effectively wasting half of each level of caching, so instead of 12M L3 cache, with such an 2x-overhead-allocator we’ll run “as if” we have only 6M.3
As we can see, from this perspective jemalloc comes out as a clear winner, with ptmalloc2 being the second-best, and tcmalloc trailing quite far behind for any number of threads > 1.
3 in practice, it all depends on how the RAM is wasted, and often it won’t be as bad, but to get an idea of why memory overheads are important – this approximation is ok
Conclusions
[画像:Tired hare:]Now, it is time to come up with our inherently subjective and inherently preliminary conclusions from this data:
- without specifying exact app, modern mallocs we tested are not too far from each other, and more importantly, each of them can outperform another one under certain conditions. In other words: if you DO want to get performance from changing mallocs, make sure to test them with your own app.
- IF we do NOT know an exact app (like choosing default malloc for an OS distribution), then our suggestions would go as follows:
- given weird memory overheads, we’d stay away from hoard
- tcmalloc and ptmalloc2 are pretty similar (at this point, it seems that tcmalloc has an edge for desktops, and ptmalloc2 – for servers, but this is still too early days to make any definite conclusions)
- but in our humble opinion, overall winner so far (by a nose) is jemalloc. Its low memory overhead is expected to improve the way caches are handled, and for an “average app out there” we’d expect it to perform at least not worse than alternatives. YMMV, batteries not included, and see “make sure to test them with your own app” part above.
Acknowledgement
Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.