I got inspired by a recent question to do some benchmarking on the different methods to copy data and this is what I have come up with:
#include <iostream>
#include <iomanip>
#include <chrono>
#include <vector>
#include <cstring>
class TimedTest
{
public:
TimedTest(const std::string& name)
: name{ name },
time{ 0 },
total{ 0 },
testCount{ 0 }
{
}
void run(int* dst, int* src, std::size_t count)
{
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
test(dst, src, count);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
time = (double)std::chrono::duration_cast<std::chrono::milliseconds> (end - begin).count();
total += time;
testCount += 1.0;
}
virtual void test(int* dst, int* src, std::size_t count) = 0;
double average()
{
return total / testCount;
}
public:
std::string name;
double time;
double total;
double testCount;
};
class TimedTestMemCpy : public TimedTest
{
using TimedTest::TimedTest;
virtual void test(int* dst, int* src, std::size_t count) override
{
memcpy(dst, src, sizeof(int) * count);
}
};
class TimedTestStdCopy : public TimedTest
{
using TimedTest::TimedTest;
virtual void test(int* dst, int* src, std::size_t count) override
{
std::copy(src, src + count, dst);
}
};
class TimedTestSimpleLoop : public TimedTest
{
using TimedTest::TimedTest;
virtual void test(int* dst, int* src, std::size_t count) override
{
for (size_t i = 0; i < count; i++)
dst[i] = src[i];
}
};
class TimedTestPointerCopy : public TimedTest
{
using TimedTest::TimedTest;
virtual void test(int* dst, int* src, std::size_t count) override
{
int* end = dst + count;
while (dst != end)
*dst++ = *src++;
}
};
class TimedTestOMPCopy : public TimedTest
{
using TimedTest::TimedTest;
virtual void test(int* dst, int* src, std::size_t count) override
{
#pragma omp parallel for
for (int i = 0; i < (int)count; i++)
dst[i] = src[i];
}
};
int main()
{
constexpr std::size_t length = 200'000'000;
int* src = new int[length];
for (int i = 0; i < length; i++)
src[i] = i;
int* dst = new int[length];
std::vector<TimedTest*> tests;
tests.push_back(new TimedTestMemCpy("memcpy"));
tests.push_back(new TimedTestStdCopy("std::copy"));
tests.push_back(new TimedTestSimpleLoop("simpleLoop"));
tests.push_back(new TimedTestPointerCopy("pointerCopy"));
tests.push_back(new TimedTestOMPCopy("OMPCopy"));
std::cout << std::setw(5) << "Test#";
for (auto test : tests)
std::cout << std::setw(12) << test->name << std::setw(9) << "Avg";
std::cout << "\n";
for (int i = 0; i < 100; i++)
{
std::cout << std::setw(5) << i;
for (auto test : tests)
{
test->run(dst, src, length);
std::cout << std::setw(12) << test->time << std::setw(9) << test->average();
}
std::cout << "\n";
}
for (auto test : tests)
delete test;
delete[] src;
delete[] dst;
}
I would appreciate any comments on the results or suggestions on improving the benchmarking / general code.
4 Answers 4
Avoid new
and delete
There is often no need to use new
and delete
manually in C++, containers do their own memory management, and in almost all other cases std::unique_ptr
can be used. In your program, src
and dst
can be made std::vector
s:
std::vector<int> src(length);
std::vector<int> dst(length);
For the vector of test cases, you can use std::unique_ptr
like so:
std::vector<std::unique_ptr<TimedTest>> tests;
tests.push_back(std::make_unique<TimedTestMemcpy>("memcpy"));
...
Use composition instead of inheritance
The problem with your approach of using inheritance is that you have to make a new class
for every test case. And the only thing they all add is a single function. Instead of inheriting, you could make TimedTest
a non-virtual class, and have it store the function as a member variable of type std::function
:
class TimedTest
{
// A type alias to avoid repeating the full type
using Function = std::function<void(int*, int*, std::size_t)>;
public:
TimedTest(const std::string& name, Function test)
: name{name}, test{test}
{
}
...
private:
std::string name;
Function test;
}
Note that your run()
function doesn't have to be changed at all! And since the class is now no longer virtual, you don't need to store pointers to it in the vector tests
, and can instead write:
static void test_memcpy(int* src, int* dst, std::size_t count) {
memcpy(dst, src, sizeof(*dst) * count);
}
...
std::vector<TimedTest> tests = {
{"memcpy", test_memcpy},
...
};
Avoid using classes at all
Even better, as user673679 and JDługosz mentioned, there is no need to have a class TimedTest
, you can just write a function that takes another function as an argument and runs that in a loop. The only difference with your code would then be that you interleave running each test once, and accumulate results, and at the end calculate all the averages.
Use auto
and using
to avoid writing long type names
Instead of writing out full type names like in this line of code:
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
I recommend you use using
to create a type alias for the clock type you want to use, and auto
to avoid having to specify types at all where appropriate. The above then becomes:
using clock = std::chrono::steady_clock;
auto start = clock::now();
...
auto stop = clock::now();
avoid manual memory management
int* src = new int[length];
for (int i = 0; i < length; i++)
src[i] = i;
int* dst = new int[length];
We could use two std::vector<int>
s for this and avoid the manual memory management. Pointers to the underlying storage can be obtained using the .data()
member function.
std::vector<TimedTest*> tests;
Similarly we should use std::vector<std::unique_ptr<TimedTest>>
here so we don't have to manually delete
them.
add a virtual destructor
We're deleting classes through a base-class pointer (TimedTest*
), so we must have a virtual
destructor to ensure proper cleanup.
use std::iota
for (int i = 0; i < length; i++)
src[i] = i;
We could use std::iota
in the <numeric>
header for this.
preserve timing accuracy
(double)std::chrono::duration_cast<std::chrono::milliseconds> (end - begin).count();
std::chrono::milliseconds
is an integer type. So doing this loses a lot of accuracy.
We should accumulate time in a std::chrono::steady_clock::duration
(which is likely nanoseconds) member, and then do the conversion to milliseconds (or whatever) once after all the test runs are complete.
If we want to output a floating point result at the end, we need to use something like: std::chrono::duration_cast<std::chrono::milliseconds<double>>(...).count()
on the final time.
simplify
class TimedTest;
This class (and it's children) isn't really necessary. The timing members don't need to exist before the test is run, and are not used again after output. They should be local variables in the test-running loop.
Using a virtual
function for every test
call may also add overhead to the functions being tested. We could instead write a template function to run the tests a set number of times and return timings. Perhaps something like:
template<class F>
std::chrono::steady_clock::duration run_test(std::size_t num_runs, F test_func)
{
auto begin = std::chrono::steady_clock::now();
for (auto i = std::size_t { 0 }; i != num_runs; ++i)
test_func();
auto end = std::chrono::steady_clock::now();
return (end - begin);
}
That could be called with a lambda like so: run_test(100, [&] () { test_memcpy(src.data(), dst.data(), length); })
which should be easy to optimize for the compiler.
misc. other things:
In C++ we should technically use
std::memcpy
, not justmemcpy
(because that's where the<cstring>
header declares it and the other might not exist).It might affect the timings to share source and destination memory between the different tests. Each one should perhaps do its own allocation and cleanup (outside the timed section of code).
If we don't want to roll our own, we could use Google Bench for making these kinds of measurements, which has an online platform here.
-
1\$\begingroup\$
int* dst = new int[length];
is different fromstd::vector<int> dst(length)
in that it won't have touched the memory pages to zero them. So if you wanted to time the page faults on first access to the new memory, you need to usenew
, or some allocator tricks to getstd::vector
to keep its hands off the memory instead of constructing each element. Especially if you do a new alloc for each test! See Why is iterating though `std::vector` faster than iterating though `std::array`? on SO for a microbenchmark that fell into that trap. \$\endgroup\$Peter Cordes– Peter Cordes2021年10月21日 00:28:16 +00:00Commented Oct 21, 2021 at 0:28 -
1\$\begingroup\$ It might affect the timings to share source and destination memory between the different tests. - Yes, in a good way, if you want to time the case where memory is pre-faulted, TLB is hot, and (depending on copy size) data is hot in some level of cache. Copying to untouched memory could amplify the benefit of OpenMP doing page faults in parallel on multiple cores, especially on a desktop CPU where a single core can nearly saturate DRAM bandwidth. (Unlike on a big Xeon where single-core BW is actually lower, but aggregate is higher with 6 or 8 memory controller channels.) \$\endgroup\$Peter Cordes– Peter Cordes2021年10月21日 00:33:06 +00:00Commented Oct 21, 2021 at 0:33
-
\$\begingroup\$ @PeterCordes Good points. I guess
std::unique_ptr<int[]>
would be the thing to use fordst
then. \$\endgroup\$user673679– user6736792021年10月21日 07:29:01 +00:00Commented Oct 21, 2021 at 7:29 -
1\$\begingroup\$ Well, if you want to test fresh-allocation write performance, i.e. page faults moreso than than memory bandwidth, then yeah maybe. \$\endgroup\$Peter Cordes– Peter Cordes2021年10月21日 07:34:26 +00:00Commented Oct 21, 2021 at 7:34
I think your approach is overcomplicated. You don't need to create/free objects and maintain a collection of objects. You just have the various different implementations that have the same signature. So give each implementation a different function name, and write:
test ("memcpy", &TestMemCpy);
test ("simple loop", &TestSimpleLoop);
// etc.
This makes the other issues, like using a naked new
and having to explicitly code a delete
loop, just go away.
Likewise, your src
and dst
memory does not need to be allocated on the heap. It could just be global variables declared as arrays.
I'm worried that the benchmark won't be timing things to the necessary precision to notice anything useful. At the very least, just doing the test once may have an unknown error or jitter in the typical timing that you won't know about. Normal benchmarking frameworks will run the thing many times. You're running through all the tests 100 times, rather than running each test 100 times before moving on to the next. The latter is better because of code cache issues.
about memory copying in particular
As the conversation in the comments points out, benchmarking memory copying in particular has its own issues unrelated to the benchmarking framework.
Look at your functions under Compiler Explorer, with optimizations turned on.
You can comment out the push_back
lines and un-comment one at a time, to clearly see generated code for each option.
The compiler generates pretty much the same code for all of those! Memory copying is something it watches out for, and once the compiler understands what you meant, it throws away what you wrote and puts in optimal code for that result. "Optimal" is based on various flags you can use to control it, including target architecture.
With compilers like this, you should focus on clearly expressing your intent rather than imagining that the assembly language is a rough line-by-line translation of what you wrote.
This means using memcpy
or std::copy
. Since these are standard, the compiler can have built-in advanced understanding of what they do. In real code, you work with typed objects not raw bytes, so always use std::copy
. Note that std::copy
and related functions can also be given an Execution Policy argument, so it can parallelized as with your last example.
-
\$\begingroup\$ The latter is better because of code cache issues. - That's assuming you want to test the case where the copy loop itself is already hot in I-cache, and branch-predictors are trained for that copy size. I think in most real code, it's more normal to do one large copy between a bunch of other work. So depending what you want to measure, it might or might not be more realistic to alternate between tests. OTOH this is pretty synthetic anyway, and hopefully (with
-O2
or-O3
) most of them just compile to a call to the same libc memcpy function anyway. \$\endgroup\$Peter Cordes– Peter Cordes2021年10月21日 00:36:37 +00:00Commented Oct 21, 2021 at 0:36 -
\$\begingroup\$ Right; the real fastest code will copy (on x86-64 chips) 128 bits at a time. It may take pains to align the reads even if that mis-aligns the writes. Future architectures may benefit from copying 256 or 512 bits at a time, and recompiling with a newer compiler will take care of that without touching my source code. \$\endgroup\$JDługosz– JDługosz2021年10月21日 14:20:31 +00:00Commented Oct 21, 2021 at 14:20
-
\$\begingroup\$ You're talking about DRAM bandwidth? Current CPUs benefit from 256-bit vectors when data is hot in L2 cache, or maybe even L3 cache. Maybe even a small benefit on DRAM bandwidth from the same number of in-flight loads/stores looking twice as far ahead for triggering HW prefetch from the next physical page. Note that if the compiler doesn't inline memcpy, then you don't even need to recompile. glibc dispatches to
__memmove_avx_erms
when resolving memcpy at dynamic link time on CPUs that have those features, so you already benefit from CPU features for large or variable-size copies. \$\endgroup\$Peter Cordes– Peter Cordes2021年10月21日 20:42:01 +00:00Commented Oct 21, 2021 at 20:42 -
\$\begingroup\$ @PeterCordes I've observed gcc generate 128-bit copying when the
-march
specified a processor that had AVX-512. The answer I found was that it wasn't any faster but adding another word size to deal with the alignment and total copy size logic increased code size and added cycles. I'm sure CPU generations keep on getting better. \$\endgroup\$JDługosz– JDługosz2021年10月21日 22:21:46 +00:00Commented Oct 21, 2021 at 22:21 -
\$\begingroup\$ Ok yes, if you're talking about inlining memcpy for small copies, then the decision is up to gcc rather than glibc, and yes is fixed at compile time. Yes, sometimes GCC's inline expansion of memcpy / memset and sometimes other string functions works differently auto-vectorization's
-mprefer-vector-width=256
default on most recent CPUs. I think that sounds familiar. \$\endgroup\$Peter Cordes– Peter Cordes2021年10月21日 22:24:37 +00:00Commented Oct 21, 2021 at 22:24
When running your code (after changing from millisecond to native resolution on the timing) I get the following:
clang++ -O3
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
0 626.327 626 40.8343 40 59.0428 59 58.9443 58 58.962 58
1 41.0973 333.5 39.1295 39.5 59.3185 59 59.0245 58.5 59.0766 59
...
99 41.1162 41.17 39.8022 39.46 57.1298 57.08 57.4694 57.16 57.1774 57.26
notice anything odd? The first test ran is 15 times slower than the next test... this is because that test is the first test to touch the newly allocated memory. This has to do with how the OS (win10 in this case) allocates memory. It will just give you a memory range in virtual memory when you allocate. The first time you touch that memory the OS will generate a page fault, and perform physical memory allocation for that memory. This takes a hot minute and you see this in the test.
There are two ways you can solve this:
Specific solution
We know this is caused by page allocation. In the loop where you prime the source vector, also write something to the destination vector to make sure it's paged in.
Generic solution
There are other causes that my result in the first, last or a random run being notably faster or slower than the rest. In statistics these would be called outliers. A good benchmarking suite will remove the n
slowest and k
fastest runs from the stats for each function under test. This removes the effects from hot/cold cache lines at start, page faults etc.
Danger Will Robinson!
All the inputs to your program are deterministic and the output is never used. A C++ compiler (and C for the matter) is allowed under the as-if rule to completely elide everything in this program (including the memory allocations!) except printing the output table if it can determine that the output didn't change by the elide.
In this case you could argue that the times would be vastly different if the compiler did that, but under that argument the compiler would be prevented from doing ANY optimizations at all as soon as you did anything time related and that doesn't hold, and in fact I've seen clang do this. I had a benchmark test where it removed my entire code under test (memory allocations and all!) and replaced it with returning a constant instead of my code.
In your case, I'd make both the destination and source pointers point to volatile int
. This forces the compiler to perform the copies, but I'm not sure if this also
forbids some forms of optimizations like (SIMD vectorization).
TL;DR: It's hard to create a good benchmark, use a library.
Performance
I'm surprised to see that pointerCopy
and simpleLoop
are slower than memcpy
typically compilers will detect simple copy loops and replace them with calls to memcpy
. But then I remembered that little detail about pointer aliasing... and the C restrict keyword that tells the compiler: "Look, these pointers don't alias, ok?". It's technically nonstandard but all C++ compilers that matter support some form of restrict
for C++. I used __restrict__
for the pointer arguments in the test
functions and this is what I got:
clang++ -O3
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
0 41.0359 41 39.2272 39 39.1399 39 56.4088 56 39.4728 39`
...
99 40.8882 41.38 40.5645 39.7 39.9101 39.59 56.9585 57.48 39.4448 39.6
Notice how everything except memcpy
and std::copy
got faster? This is possible because the compiler can now generate much more efficient code when it knows that the source and destination memory doesn't overlap (alias). I'm not completely sure but I believe that the library functions do not have overloads with restrict
(in fact I do not believe that's even possible) so they internally determine if the source and destination alias by some math, and if they don't, they run a fast non-alias copy. But if they do alias they run another code path that is slower.
I think this additional math and branching is the slight edge that simpleLoop
has over std::copy
(not sure why memcpy
is a full 1.5 ms slower than std::copy
). I believe that the pointerCopy
is slower because of the way you increment. I've seen this happen before where the post increment is slower than the pre increment and you could probably make this as fast as simpleLoop
but I don't think it's worth the effort.
At the end of the day, this reinforces the old mantra: Write clear code, and give the compiler as much information as possible and let it do it's job.
For fun I also ran with
g++ -O3
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
0 41.2107 41 39.4616 39 39.2258 39 39.1013 39 39.4802 39
...
99 40.8445 41.01 39.2478 39.43 39.1988 39.32 39.1874 39.27 39.1505 39.25
which interestingly generates better code for pointerCopy
. I also ran with:
clang++ -O3 -march=native
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
0 41.0537 41 39.3157 39 39.2856 39 56.9359 56 39.5525 39
...
99 40.8602 41 39.1795 39.34 39.1104 39.31 56.5426 56.8 39.738 39.29
where -march=native
made slight improvements.
-
\$\begingroup\$ Thanks for the insightful observations!regarding the pointer copy, I have noticed that with gcc - O3 testB is much faster than testA TIO. With msvc and the intel compiler it does not look like it makes a difference. \$\endgroup\$jdt– jdt2021年10月23日 18:06:39 +00:00Commented Oct 23, 2021 at 18:06