Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc While Trying to Simulate Real-World Loads

posted July 4, 2018 by "No Bugs" Hare , translated by Sergey Ignatchenko^IRL, Dmytro Ivanchykhin^IRL, and Maxim Blashchuk

“No Bugs” Hare Author: “No Bugs” Hare Follow: Twitter Facebook

Job Title: Sarcastic Architect

Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek

As we’re developing our own allocator for (Re)Actors (which will be published later <wink />), we run into a need to test it, and more importantly – to benchmark it against currently existing ones. Apparently, while there are quite a few allocator test programs out there, most of these tests are not representative of real-world loads, and therefore, can only provide very cursory ideas about allocator performance in real programs.

As one example, rather popular t-test1.c (see, for example, [t-test1.c]) is routinely used as a unit test, so it is bound to test stuff such as calloc() and realloc(); on the other hand, these functions are very rarely used in the modern C++ programs, so benchmarking them effectively skews the test against likely real-world conditions (at least for C++). Also, uniform random distributions (both of the allocated sizes, and of the relative frequency of creating/destroying items), while being the easiest to simulate, are not representative of the real-world conditions (in fact, uniform random distribution over a large chunk of memory is known to be the best way to effectively disable caching, but – fortunately for performance – this rarely happens in the real world). And last but not least, testing only allocations/deallocations, without any access to the allocated memory, is as unrealistic as it gets; see, for example, discussion on the effects of better locality provided by allocator, on overall performance, in [Lakos17].

Goals and Requirements

[画像:Hare pointing out:]“Our goal is to develop a test for allocators, which on the one hand would behave as close to real-world C++ programs as possible.As a result, we decided to develop our own test, which tries to address all the issues listed above. Our goal is to develop a test for allocators, which on the one hand would behave as close to real-world C++ programs as possible, but on the other hand will allow measuring performance differences where they exist (without degrading into “all allocators are equal because our test spends too much time in the non-allocating code”). When trying to go a bit further down to earth, it resulted in the following set of the requirements:

we want our test to use only new/delete (or malloc/free). Everything else is so rarely used in C++, that benchmarking it doesn’t go well with “close to real-world C++ program” goal.
- Or the same thing from another angle: if you really need lots of realloc() calls for your program, then to find the optimum allocator for your app, you will need a different benchmark – and quite likely, a different allocator.
we want our test to use somewhat-realistic statistical distributions – both for distribution of allocated sizes, and for distribution of “life times of allocated items”.
- We DO realize that it is bound to be a very rough approximation, but for vast majority of programs out there, any honest attempt to become more realistic will be better than using cache-killing uniform distributions.
we want our test to access all the allocated memory: we should write allocated memory at least once, and to read it back at least once too.
- We do acknowledge that there can be MUCH more than one write/read, but we feel that at least once is an absolute minimum for anything aiming to be at least somewhat realistic.

Basic Idea

When looking from a 30’000-feet height, a very basic idea of our test is very similar to that of [t-test1.c]:

we have a bunch of ‘slots’, each ‘slot’ representing a potential allocated item.
on each iteration, we’re randomly selecting on slot:
- if it is empty – we’re allocating an item and are placing it into the slot
- if it is free – we’re deallocating the item and are making the slot empty
if there is more than one thread, each of the threads is working independently along the lines above.

That’s pretty much it.

Making Distributions More Realistic

In this model, there are two separate distributions: (a) distribution of sizes of the allocated items, and (b) distribution of the random selection within our slots (which translates into a distribution of the relative lifetimes of the allocated items).

For distribution of allocated sizes, we felt that a simple

p ~ 1/size

Piecewise linear function is a real-valued function defined on the real numbers or a segment thereof, whose graph is composed of straight-line sections.— Wikipedia —(where ‘~’ means ‘proportional’ and size <= max_size) is good enough to represent some more-or-less realistic scenario. In other words, the probability of allocating 30 bytes is twice more than probability of allocating 60 bytes, which in turn is twice more than the probability of allocating 120 bytes, and so on. In practice, to ensure that calculations are fast enough, we had to use approximation with a piecewise linear function, but we hope we didn’t deviate too much from our 1/size goal.

Pareto distribution is a power-law probability distribution that is used in description of social, scientific, geophysical, actuarial, and many other types of observable phenomena.— Wikipedia —For distribution of slot-being-selected-for-manipulation, we experimented quite a bit to ensure that on the one hand, probabilities of accessing less-frequent slots decrease fast (in particular, significantly faster than 1/x which is too slow to represent real-world distributions which we’ve seen in our careers), but OTOH that they don’t come to virtual zero too fast. In the end, we had to resort to a so-called Pareto distribution (in its classic version with “20% of people drink 80% of beer” rule). In addition to being a very good approximation for quite a few different non-computer-related real-world distributions, Pareto distribution did provide what we considered a reasonably-close-to-real-world approximation of access frequencies; as a side bonus, we observed that with certain parameters (number of slots and max_size) it did not deviate too far¹ from accesses to L1 being ~= accesses to L2 being ~= accesses to L3 being ~= accesses to main RAM, which we considered to be a good sign.²

As with allocated sizes, we did have to go for a piecewise linear function to make things reasonably fast while testing, but we do hope it is not too bad.

¹ within 3x

² those hardware developers wouldn’t spend time and efforts on all those caches unless they’re used, won’t they?

Accessing Allocated Memory

As noted above, we DID want to access allocated memory; for our test, we did it as writing the whole block right after the allocation (using memset()) – and reading it right before deallocation. It is a bare minimum of the accesses we could imagine for the real-world (hey, if we did want to allocate, probably we did want to write something there?).

Compensation Mechanism

All the stuff we did (especially with regards to distribution) did take its tall CPU-wise (while we did take lots of efforts to avoid indirections as much as possible, in-register calculations are still significant <sad-face />). Also, we did want to subtract the ideal-case cost of memset() and memory reading, to see the difference between different allocators more clearly.

To do it, we’re always running two tests: (a) allocator-under-test, and (b) dummy “void” allocator (which does absolutely nothing), and then we subtract time spent in (b) from time spent in (a) to get the performance of allocator-under-test without the costs of the test itself.

Testing

Test. We run our test app, designed along the lines discussed above, and available at [Github.alloc-test]

System. We run all our tests on a more or less typical 2-socket server box, with two E5645, each of E5645’s having 6 cores with Hyperthreading (~=”2-way SMT”), 12M of L3 cache, and capable of running at 2.4GHz (2.67Ghz in turbo mode). The box has 32G of RAM. OS: Debian 9 “Stretch” (as of the moment of this writing, it is current Debian “stable”). Very briefly: it is a pretty typical (even if a bit outdated) real-world “workhorse” server box.

Allocators Under Test. Our first set of tests was run over 4 popular Linux allocators:

built-in glibc allocator (common wisdom says it is heavily modified ptmalloc2)
[hoard]. Obtained from https://github.com/emeryberger/Hoard and compiled.
[tcmalloc]. Obtained via apt-get install google-perftools
[jemalloc]. Obtained via apt-get install libjemalloc-dev, apt-get install libjemalloc1

For all the allocators, we DID NOT play games with LD_PRELOAD; instead we (for all allocators except for built-in one):

used -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free flags when compiling our test app (as recommended in [perftools.README]; we had problems without these at least with tcmalloc).
linked our test app with -lhoard or -ltcmalloc or -ljemalloc

Test Settings. To run our tests, we had to choose certain parameters, most importantly – amount of RAM we’re going to allocate. In line with trying to keep our tests as close to real-world as possible, we decided to use about 1G of allocated RAM for our testing; in fact, due to us using powers of two wherever possible, we ended up with 1.3G of allocated RAM. It is important to note that when dealing with multiple threads, we decided to split the same amount of RAM over all the threads (this is important as we did want to exclude differences due to different size-of-RAM-being-worked-with compared to CPU-cache).

Another parameter (which varied from test to test) was number of threads. As the box we had has hardware support for 24 threads (2 sockets*6 cores/socket*2 threads/core = 24 threads), we ran our tests using 1-23 threads (keeping the last hardware thread for OS own needs, so interrupts and background activities don’t affect our testing too much).

What to Measure. For our first set of tests, we tested only two parameters:

Time spent while the test is run. Then, we used this data to derive “CPU clock cycles per malloc()/free() pair” metric).
Memory usage by the program (measured as maximum of RSS=”Resident Set Size”; without swap, it is a reasonably good metrics of the amount of RAM allocated by the program from OS). We used this data to calculate “memory overhead” (as a ratio of “amount of RAM allocated from OS” to “amount of RAM requested by app level via malloc()”).

Test Results

Now, with test conditions being more or less described (if something is missing – give us a shout, we’ll be happy to elaborate further) we can get to the juiciest part of this post – to the test results.

Here is the data we’ve got:

ptmalloc2 vs tcmalloc vs hoard vs jemalloc: CPU clock cycles

This is more or less consistent with the previous testing and our expectations, however:

with our attempts to simulate real-world behavior, performance differences between modern state-of-the-art mallocs are not that drastic as it is sometimes claimed. In particular, the largest observed performance difference between tcmalloc and ptmalloc2 was tcmalloc outperforming ptmalloc2 by a factor of 1.7x, though it happened only on one single thread, and with larger number of threads it was ptmalloc2 outperforming tcmalloc (by up to 1.2x)
it is interesting to note that those mallocs which were performing better with fewer threads (tcmalloc and hoard), started to perform gradually worse at about 12 threads, and by about 18-20 threads got worse than ptmalloc2 and jemalloc.
- Whether this is related to SMT (on this box, 12 threads is a point where SMT comes into play), or to anything else – is a subject for further research. In particular, it is interesting to find out whether this effect sustains if we run our test threads in different processes.

ptmalloc2 vs tcmalloc vs hoard vs jemalloc: memory overhead

The second parameter we measured, was “memory overhead”. This is very important for overall performance, as the less RAM overhead we incur, the more efficient our caching is. To have an extremely rough approximation of this effect, we can think of allocator-having-2x-overhead, effectively wasting half of each level of caching, so instead of 12M L3 cache, with such an 2x-overhead-allocator we’ll run “as if” we have only 6M.³

As we can see, from this perspective jemalloc comes out as a clear winner, with ptmalloc2 being the second-best, and tcmalloc trailing quite far behind for any number of threads > 1.

³ in practice, it all depends on how the RAM is wasted, and often it won’t be as bad, but to get an idea of why memory overheads are important – this approximation is ok

Conclusions

[画像:Tired hare:]Now, it is time to come up with our inherently subjective and inherently preliminary conclusions from this data:

without specifying exact app, modern mallocs we tested are not too far from each other, and more importantly, each of them can outperform another one under certain conditions. In other words: if you DO want to get performance from changing mallocs, make sure to test them with your own app.
IF we do NOT know an exact app (like choosing default malloc for an OS distribution), then our suggestions would go as follows:
- given weird memory overheads, we’d stay away from hoard
- tcmalloc and ptmalloc2 are pretty similar (at this point, it seems that tcmalloc has an edge for desktops, and ptmalloc2 – for servers, but this is still too early days to make any definite conclusions)
- but in our humble opinion, overall winner so far (by a nose) is jemalloc. Its low memory overhead is expected to improve the way caches are handled, and for an “average app out there” we’d expect it to perform at least not worse than alternatives. YMMV, batteries not included, and see “make sure to test them with your own app” part above.

Don't like this post? Comment↯ below . You do?! Please share: ...on LinkedIn ...on Reddit ...on Twitter ...on Facebook

[+]References

[t-test1.c] “t-test1.c”

[Lakos17] John Lakos, “Local ('Arena') Memory Allocators”, CppCon2017

[hoard] “hoard memory allocator”, Wikipedia

[jemalloc] “jemalloc”

[perftools.README] “README”

[Github.alloc-test] “alloc-test”

Acknowledgement

Cartoons by Sergey Gordeev^IRL from Gordeev Animation Graphics, Prague.

‘Speedy Gonzales’ Serializing (Re)Actors via Allocators Real-World 802.11ac Wi-Fi Testing: ×ばつ6 Routers-x-Adapters Matrix. Part III. Results and Conclusions Real-World 802.11ac Wi-Fi Testing: 7×6 Routers-x-Adapters Matrix. Part I. Routers. Real-World 802.11ac Wi-Fi Testing: 7×6 Routers-x-Adapters Matrix. Part II. Adapters #CPPCON2017. Day 2. Why Local Allocators are a Good Thing(tm) Performance-Wise, and Why I am Very Cautious about C++17 STL parallelized algos

Join our mailing list:

ITHare (22)

Comments

Josh Triplett says

July 5, 2018 at 7:15 am

glibc 2.26 introduced a newer, much more scalable implementation of malloc, with a per-thread cache.

Would you consider re-running your comparison with current glibc? (Installing the libc6 package from Debian testing or sid should suffice.) I’d love to see how well that competes with the other allocator libraries.

Reply
- "No Bugs" Hare says
  
  July 6, 2018 at 6:47 am
  
  Thanks for the suggestion (the one to use Debian testing), we are planning to do it.
  
  Reply
  - Martin Svoboda says
    
    July 8, 2018 at 1:56 pm
    
    Hi, I researched and ran benchmarks on some memory allocators, including glibc >= 2.26, and while it’s better than older glibc, it was still worse than tcmalloc/jemalloc/tbbmalloc in almost every aspect.
    
    Reply
Yiannis Papadopoulos says

July 5, 2018 at 5:12 pm

Have you considered capturing a trace of allocations/deallocations in applications / kernels of interest and then replaying them under the different allocators?

Reply
- "No Bugs" Hare says
  
  July 6, 2018 at 6:46 am
  
  Not really; in general, interplays between the app and allocator happen to be very subtle (including locality, cache usage, etc. – see for example talk by John Lakos at CPPCON17), and we don’t feel that performance info from such replays can be meaningfully interpreted (at least we don’t know how to interpret it).
  
  Reply
qm2k says

July 5, 2018 at 7:05 pm

I understood that you are allocating, accessing and deallocating each specific memory block in the same thread. Although this often happens in real life, if performance degradation from simultaneous access to pool ever becomes problem one can counteract it by moving to thread-local pools. It makes your charts somewhat less useful. It would be interesting to repeat tests with references to allocated memory passed between threads, and compare the results. It would require extra synchronization, but the overhead can be made small by passing a lot of references at once.

Reply
- "No Bugs" Hare says
  
  July 6, 2018 at 6:44 am
  
  As we’re very much committed to the Shared-Nothing model of “sharing the memory by communicating instead of communicating by sharing memory” – using allocs for inter-thread communications is pretty much beyond our scope. While we’ll be happy to accept pull request with such additional functionality into our test – we don’t feel like spending time on it.
  
  P.S. IONSHO, mutexes at app-level are evil. And as soon as this is acknowledged (and most of opinion leaders are currently leaning towards it, see, for example, recent CPPCON and ACCU talks) – the importance of inter-thread allocs goes down very significantly (which is a Good Thing(tm)).
  
  P.P.S. “one can counteract it by moving to thread-local pools” – one SHOULD counteract it by moving to thread-local pools; however, writing a good thread-local pool is not really that easy (and in any case its perforance will degrade as the number of threads grow), so they also need to be tested.
  
  Reply
Martin Svoboda says

July 8, 2018 at 1:58 pm

Maybe consider adding tbbmalloc from Intel? Hoard is really archaic and not really used anymore. For us, tbbmalloc showed promising results under some workloads (even though it tends to have higher memory overhead).

Reply
- "No Bugs" Hare says
  
  July 11, 2018 at 8:16 am
  
  Well, maybe, but it will take some time. Also, adding another malloc with its own fan base will further expose us to even more complaints along the lines of “hey, there is the latest greatest version of <my-favorite-allocator> which got released yesterday, you guys should run and re-test it right away, it will FOR SURE show that <my-favorite-allocator> is SO MUCH BETTER than competition” </rant> ;-( .
  
  BTW, some guy who really wanted to test his-favorite-allocator, have already submitted a pull request to our alloc-test project, which did help us to add it (and we’ll run its tests in the next test run). </hint> 😉 .
  
  Reply
mda says

July 10, 2018 at 11:22 am

Consider using gperftools-2.7 for the newest tcmalloc. Also consider giving information about which versions are used in the test.

Reply
- "No Bugs" Hare says
  
  July 11, 2018 at 8:09 am
  
  > Consider using gperftools-2.7 for the newest tcmalloc.
  
  We may update to the newer stuff in the future, but it is not what we’re getting paid for, so if you really need this information fast – feel free to run tests yourself with whatever-allocator-you-prefer (test source code is open source). Also let’s note that Debian “stable” is one of de-facto standards for serious server-side deployments, so we feel our tests already represent real-world pretty well (and BTW ptmalloc fans also complain that version being used is not latest-greatest).
  
  > Also consider giving information about which versions are used in the test.
  
  I don’t feel providing versions of the _libraries_ is a good idea (version as such does not provide exhaustive information, and there are many other parameters in play beyond the mere version, so it can easily be misleading); OTOH, we will consider adding version of specific _package_. On the third hand (yes, we hare have three hands you know), see above about this not being our primary job, so it can take quite a while.
  
  Reply
  - mda says
    
    July 11, 2018 at 11:49 am
    
    It is a common practice to give exact versions of libraries you use for benchmarks for the reason of reproducing it by others. It does not have to be exhaustive, but at least allocator and kernel versions should be there. It helps to find out if your benchmark has an obvious flaw.
    
    As for latest versions, memory allocation is still an active area, a small version difference might make your test results obsolete. It could still be a good idea to test latest stable versions, at least to verify the conclusions.
    
    Reply
    - "No Bugs" Hare says
      
      July 14, 2018 at 12:48 pm
      
      To be reproducible is indeed a common practice, and is a Good Thing(tm); however – for the results to be reproducible, simple version number of the source code is certainly not sufficient, and “such and such package from such and such distribution” is a much better basis for reproducing the tests than simple “such and such source code version” – simply because there are many other variables beyond source code which are involved in producing compiled binary (compiler flags to start with). That’s why we’re reluctant to provide source code versions (at least for ptmalloc and tcmalloc it is not what we’re working with – and neither is what majority of real-world app-level devs out there are working with).
      
      > a small version difference might make your test results obsolete.
      
      Sure, but this comes inevitably (and therefore implicitly) for _any_ test results.
      
      Reply
Roman Leventov says

July 14, 2018 at 4:40 pm

I doubt realloc is that rare as you describe. It’s behind std::vector when it grows

Reply
- "No Bugs" Hare says
  
  July 15, 2018 at 5:28 am
  
  Short answer: realloc() MIGHT have been used for resizing std::vector, but to the best of my understanding, it is not 🙁 .
  
  Long answer: Every time I looked at the std::vector implementation (and I am doing it every 5 years or so), resizing vectors was using realloc() neither in MSVC nor in Linux world. Formally speaking, we cannot possibly use realloc() if move constructor is non-trivial (which is rare), but why they didn’t use realloc() for types with trivial move constructors (as indicated by traits), is beyond me. Things MIGHT have changed since last time I checked it – but the question is “where exactly did you see realloc() in std::vector implementation?” (in other words, that realloc() MIGHT be used for std::vector, is not enough). Moreover, given https://en.cppreference.com/w/cpp/memory/allocator – which does NOT have a function which would benefit from being implemented using realloc() – seems to imply that standard std::vector cannot possibly use realloc() (standard containers are restricted to the API of std::allocator, which doesn’t expose realloc()); of course, homegrown vectors can still use realloc() but these do qualify as “rare” as suggested in the OP.
  
  As a side note, optimizing allocator for realloc() is going to be tough, and I am not sure if I know of any allocator which does a good job in this regard 🙁 (Knuth derivatives certainly don’t, and neither do bucket allocators), so intuitively I’d expect that for _any_ allocator out there 95+% of reallocs will be equivalent to malloc()+memcpy()+free() anyway (maybe IF we separate allocations into “those with potential realloc()” and “those without need to realloc()” we could optimize for realloc(), but this won’t fit into existing APIs 🙁 ).
  
  Reply
Mark Callaghan says

December 25, 2018 at 9:28 pm

I have seen small differences in CPU overhead but huge differences in the ability to avoid fragmentation.

RSS is about 2X larger with glibc malloc compared to jemalloc and tcmalloc each time I have tested this using MyRocks and RocksDB. Note that RocksDB can be a great stress test for fragmentation. A recent result is at http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-fragmentation-strong.html

Reply
- "No Bugs" Hare says
  
  January 3, 2019 at 8:22 am
  
  App-specific tests are interesting, but they hugely depend on the way how app uses memory, so it is not easy to generalize it to any other app.
  
  As for RSS – tests above (under “Memory Overhead”) do indeed show that jemalloc is the clear winner (though in our tests tcmalloc starts to lose as soon as you’re using more than one thread).
  
  Reply
Anon says

May 31, 2019 at 11:21 am

Here is a classic paper arguing why you should not use artificial distributions in memory allocator and GC evaluations:

http://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pdf

Reply
- "No Bugs" Hare says
  
  June 12, 2019 at 1:53 pm
  
  Thanks! About the article, it doesn’t look too applicable to the real world.
  
  In particular, it is very easy to say “the only reliable method currently available for studying allocators is tracedriven simulation with real traces” – but it is not very practical (one immediate question is “hey, WHICH of the real traces is representative?”). Also, complaining about “ordering errors when the allocators were within a few percent of each other” is ridiculous (few percent is a Damn Good(tm) approximation for all the practical intents and purposes). In addition, we have to keep in mind that “real traces” will inevitably introduce lots of very-difficult-to-deal-with biases in case of performance benchmarking (the classical problem of the sensor interfering with measurements – and real traces will inevitably use lots of RAM which is DUT and is going to introduce lots of interferences such as cache poisoning by reading real access data etc. etc. etc.).
  
  Reply
Johannes Steffens says

July 30, 2019 at 4:31 pm

Great Analysis – thank you.

I wonder if you could include tbman (https://github.com/johsteffens/tbman) in your test framework. Would be interesting to know how it compares to those well established managers.

Thanks & best regards

Johannes

Reply
Dimiter Stanev says

April 17, 2020 at 2:37 pm

Here is another great contender – https://github.com/microsoft/mimalloc

Thanks!

Reply