8
\$\begingroup\$

I got inspired by a recent question to do some benchmarking on the different methods to copy data and this is what I have come up with:

#include <iostream>
#include <iomanip>
#include <chrono>
#include <vector>
#include <cstring>
class TimedTest
{
public:
 TimedTest(const std::string& name)
 : name{ name },
 time{ 0 },
 total{ 0 },
 testCount{ 0 }
 {
 }
 void run(int* dst, int* src, std::size_t count)
 {
 std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
 test(dst, src, count);
 std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
 time = (double)std::chrono::duration_cast<std::chrono::milliseconds> (end - begin).count();
 total += time;
 testCount += 1.0;
 }
 virtual void test(int* dst, int* src, std::size_t count) = 0;
 double average()
 {
 return total / testCount;
 }
public:
 std::string name;
 double time;
 double total;
 double testCount;
};
class TimedTestMemCpy : public TimedTest
{
 using TimedTest::TimedTest;
 virtual void test(int* dst, int* src, std::size_t count) override
 {
 memcpy(dst, src, sizeof(int) * count);
 }
};
class TimedTestStdCopy : public TimedTest
{
 using TimedTest::TimedTest;
 virtual void test(int* dst, int* src, std::size_t count) override
 {
 std::copy(src, src + count, dst);
 }
};
class TimedTestSimpleLoop : public TimedTest
{
 using TimedTest::TimedTest;
 virtual void test(int* dst, int* src, std::size_t count) override
 {
 for (size_t i = 0; i < count; i++)
 dst[i] = src[i];
 }
};
class TimedTestPointerCopy : public TimedTest
{
 using TimedTest::TimedTest;
 virtual void test(int* dst, int* src, std::size_t count) override
 {
 int* end = dst + count; 
 while (dst != end)
 *dst++ = *src++;
 }
};
class TimedTestOMPCopy : public TimedTest
{
 using TimedTest::TimedTest;
 virtual void test(int* dst, int* src, std::size_t count) override
 {
#pragma omp parallel for 
 for (int i = 0; i < (int)count; i++)
 dst[i] = src[i];
 }
};
int main()
{
 constexpr std::size_t length = 200'000'000;
 int* src = new int[length];
 for (int i = 0; i < length; i++)
 src[i] = i;
 int* dst = new int[length];
 std::vector<TimedTest*> tests;
 tests.push_back(new TimedTestMemCpy("memcpy"));
 tests.push_back(new TimedTestStdCopy("std::copy"));
 tests.push_back(new TimedTestSimpleLoop("simpleLoop"));
 tests.push_back(new TimedTestPointerCopy("pointerCopy"));
 tests.push_back(new TimedTestOMPCopy("OMPCopy"));
 std::cout << std::setw(5) << "Test#";
 for (auto test : tests)
 std::cout << std::setw(12) << test->name << std::setw(9) << "Avg";
 std::cout << "\n";
 for (int i = 0; i < 100; i++)
 {
 std::cout << std::setw(5) << i;
 for (auto test : tests)
 {
 test->run(dst, src, length);
 std::cout << std::setw(12) << test->time << std::setw(9) << test->average();
 }
 std::cout << "\n";
 }
 for (auto test : tests)
 delete test;
 delete[] src;
 delete[] dst;
}

I would appreciate any comments on the results or suggestions on improving the benchmarking / general code.

asked Oct 19, 2021 at 15:42
\$\endgroup\$

4 Answers 4

12
\$\begingroup\$

Avoid new and delete

There is often no need to use new and delete manually in C++, containers do their own memory management, and in almost all other cases std::unique_ptr can be used. In your program, src and dst can be made std::vectors:

std::vector<int> src(length);
std::vector<int> dst(length);

For the vector of test cases, you can use std::unique_ptr like so:

std::vector<std::unique_ptr<TimedTest>> tests;
tests.push_back(std::make_unique<TimedTestMemcpy>("memcpy"));
...

Use composition instead of inheritance

The problem with your approach of using inheritance is that you have to make a new class for every test case. And the only thing they all add is a single function. Instead of inheriting, you could make TimedTest a non-virtual class, and have it store the function as a member variable of type std::function:

class TimedTest
{
 // A type alias to avoid repeating the full type
 using Function = std::function<void(int*, int*, std::size_t)>;
public:
 TimedTest(const std::string& name, Function test)
 : name{name}, test{test}
 {
 }
...
private:
 std::string name;
 Function test;
}

Note that your run() function doesn't have to be changed at all! And since the class is now no longer virtual, you don't need to store pointers to it in the vector tests, and can instead write:

static void test_memcpy(int* src, int* dst, std::size_t count) {
 memcpy(dst, src, sizeof(*dst) * count);
}
...
std::vector<TimedTest> tests = {
 {"memcpy", test_memcpy},
 ...
};

Avoid using classes at all

Even better, as user673679 and JDługosz mentioned, there is no need to have a class TimedTest, you can just write a function that takes another function as an argument and runs that in a loop. The only difference with your code would then be that you interleave running each test once, and accumulate results, and at the end calculate all the averages.

Use auto and using to avoid writing long type names

Instead of writing out full type names like in this line of code:

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();

I recommend you use using to create a type alias for the clock type you want to use, and auto to avoid having to specify types at all where appropriate. The above then becomes:

using clock = std::chrono::steady_clock;
auto start = clock::now();
...
auto stop = clock::now();
Toby Speight
87.5k14 gold badges104 silver badges322 bronze badges
answered Oct 19, 2021 at 17:59
\$\endgroup\$
8
\$\begingroup\$

avoid manual memory management

int* src = new int[length];
for (int i = 0; i < length; i++)
 src[i] = i;
int* dst = new int[length];

We could use two std::vector<int>s for this and avoid the manual memory management. Pointers to the underlying storage can be obtained using the .data() member function.

std::vector<TimedTest*> tests;

Similarly we should use std::vector<std::unique_ptr<TimedTest>> here so we don't have to manually delete them.


add a virtual destructor

We're deleting classes through a base-class pointer (TimedTest*), so we must have a virtual destructor to ensure proper cleanup.


use std::iota

for (int i = 0; i < length; i++)
 src[i] = i;

We could use std::iota in the <numeric> header for this.


preserve timing accuracy

(double)std::chrono::duration_cast<std::chrono::milliseconds> (end - begin).count();

std::chrono::milliseconds is an integer type. So doing this loses a lot of accuracy.

We should accumulate time in a std::chrono::steady_clock::duration (which is likely nanoseconds) member, and then do the conversion to milliseconds (or whatever) once after all the test runs are complete.

If we want to output a floating point result at the end, we need to use something like: std::chrono::duration_cast<std::chrono::milliseconds<double>>(...).count() on the final time.


simplify

class TimedTest;

This class (and it's children) isn't really necessary. The timing members don't need to exist before the test is run, and are not used again after output. They should be local variables in the test-running loop.

Using a virtual function for every test call may also add overhead to the functions being tested. We could instead write a template function to run the tests a set number of times and return timings. Perhaps something like:

template<class F>
std::chrono::steady_clock::duration run_test(std::size_t num_runs, F test_func)
{
 auto begin = std::chrono::steady_clock::now();
 
 for (auto i = std::size_t { 0 }; i != num_runs; ++i)
 test_func();
 
 auto end = std::chrono::steady_clock::now();
 
 return (end - begin);
}

That could be called with a lambda like so: run_test(100, [&] () { test_memcpy(src.data(), dst.data(), length); }) which should be easy to optimize for the compiler.


misc. other things:

  • In C++ we should technically use std::memcpy, not just memcpy (because that's where the <cstring> header declares it and the other might not exist).

  • It might affect the timings to share source and destination memory between the different tests. Each one should perhaps do its own allocation and cleanup (outside the timed section of code).

  • If we don't want to roll our own, we could use Google Bench for making these kinds of measurements, which has an online platform here.

answered Oct 19, 2021 at 17:59
\$\endgroup\$
4
  • 1
    \$\begingroup\$ int* dst = new int[length]; is different from std::vector<int> dst(length) in that it won't have touched the memory pages to zero them. So if you wanted to time the page faults on first access to the new memory, you need to use new, or some allocator tricks to get std::vector to keep its hands off the memory instead of constructing each element. Especially if you do a new alloc for each test! See Why is iterating though `std::vector` faster than iterating though `std::array`? on SO for a microbenchmark that fell into that trap. \$\endgroup\$ Commented Oct 21, 2021 at 0:28
  • 1
    \$\begingroup\$ It might affect the timings to share source and destination memory between the different tests. - Yes, in a good way, if you want to time the case where memory is pre-faulted, TLB is hot, and (depending on copy size) data is hot in some level of cache. Copying to untouched memory could amplify the benefit of OpenMP doing page faults in parallel on multiple cores, especially on a desktop CPU where a single core can nearly saturate DRAM bandwidth. (Unlike on a big Xeon where single-core BW is actually lower, but aggregate is higher with 6 or 8 memory controller channels.) \$\endgroup\$ Commented Oct 21, 2021 at 0:33
  • \$\begingroup\$ @PeterCordes Good points. I guess std::unique_ptr<int[]> would be the thing to use for dst then. \$\endgroup\$ Commented Oct 21, 2021 at 7:29
  • 1
    \$\begingroup\$ Well, if you want to test fresh-allocation write performance, i.e. page faults moreso than than memory bandwidth, then yeah maybe. \$\endgroup\$ Commented Oct 21, 2021 at 7:34
6
\$\begingroup\$

I think your approach is overcomplicated. You don't need to create/free objects and maintain a collection of objects. You just have the various different implementations that have the same signature. So give each implementation a different function name, and write:

test ("memcpy", &TestMemCpy);
test ("simple loop", &TestSimpleLoop);
 // etc.

This makes the other issues, like using a naked new and having to explicitly code a delete loop, just go away.

Likewise, your src and dst memory does not need to be allocated on the heap. It could just be global variables declared as arrays.

I'm worried that the benchmark won't be timing things to the necessary precision to notice anything useful. At the very least, just doing the test once may have an unknown error or jitter in the typical timing that you won't know about. Normal benchmarking frameworks will run the thing many times. You're running through all the tests 100 times, rather than running each test 100 times before moving on to the next. The latter is better because of code cache issues.

about memory copying in particular

As the conversation in the comments points out, benchmarking memory copying in particular has its own issues unrelated to the benchmarking framework.

Look at your functions under Compiler Explorer, with optimizations turned on. You can comment out the push_back lines and un-comment one at a time, to clearly see generated code for each option.

The compiler generates pretty much the same code for all of those! Memory copying is something it watches out for, and once the compiler understands what you meant, it throws away what you wrote and puts in optimal code for that result. "Optimal" is based on various flags you can use to control it, including target architecture.

With compilers like this, you should focus on clearly expressing your intent rather than imagining that the assembly language is a rough line-by-line translation of what you wrote.

This means using memcpy or std::copy. Since these are standard, the compiler can have built-in advanced understanding of what they do. In real code, you work with typed objects not raw bytes, so always use std::copy. Note that std::copy and related functions can also be given an Execution Policy argument, so it can parallelized as with your last example.

answered Oct 19, 2021 at 17:59
\$\endgroup\$
6
  • \$\begingroup\$ The latter is better because of code cache issues. - That's assuming you want to test the case where the copy loop itself is already hot in I-cache, and branch-predictors are trained for that copy size. I think in most real code, it's more normal to do one large copy between a bunch of other work. So depending what you want to measure, it might or might not be more realistic to alternate between tests. OTOH this is pretty synthetic anyway, and hopefully (with -O2 or -O3) most of them just compile to a call to the same libc memcpy function anyway. \$\endgroup\$ Commented Oct 21, 2021 at 0:36
  • \$\begingroup\$ Right; the real fastest code will copy (on x86-64 chips) 128 bits at a time. It may take pains to align the reads even if that mis-aligns the writes. Future architectures may benefit from copying 256 or 512 bits at a time, and recompiling with a newer compiler will take care of that without touching my source code. \$\endgroup\$ Commented Oct 21, 2021 at 14:20
  • \$\begingroup\$ You're talking about DRAM bandwidth? Current CPUs benefit from 256-bit vectors when data is hot in L2 cache, or maybe even L3 cache. Maybe even a small benefit on DRAM bandwidth from the same number of in-flight loads/stores looking twice as far ahead for triggering HW prefetch from the next physical page. Note that if the compiler doesn't inline memcpy, then you don't even need to recompile. glibc dispatches to __memmove_avx_erms when resolving memcpy at dynamic link time on CPUs that have those features, so you already benefit from CPU features for large or variable-size copies. \$\endgroup\$ Commented Oct 21, 2021 at 20:42
  • \$\begingroup\$ @PeterCordes I've observed gcc generate 128-bit copying when the -march specified a processor that had AVX-512. The answer I found was that it wasn't any faster but adding another word size to deal with the alignment and total copy size logic increased code size and added cycles. I'm sure CPU generations keep on getting better. \$\endgroup\$ Commented Oct 21, 2021 at 22:21
  • \$\begingroup\$ Ok yes, if you're talking about inlining memcpy for small copies, then the decision is up to gcc rather than glibc, and yes is fixed at compile time. Yes, sometimes GCC's inline expansion of memcpy / memset and sometimes other string functions works differently auto-vectorization's -mprefer-vector-width=256 default on most recent CPUs. I think that sounds familiar. \$\endgroup\$ Commented Oct 21, 2021 at 22:24
3
\$\begingroup\$

When running your code (after changing from millisecond to native resolution on the timing) I get the following:

clang++ -O3
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
 0 626.327 626 40.8343 40 59.0428 59 58.9443 58 58.962 58
 1 41.0973 333.5 39.1295 39.5 59.3185 59 59.0245 58.5 59.0766 59
...
 99 41.1162 41.17 39.8022 39.46 57.1298 57.08 57.4694 57.16 57.1774 57.26

notice anything odd? The first test ran is 15 times slower than the next test... this is because that test is the first test to touch the newly allocated memory. This has to do with how the OS (win10 in this case) allocates memory. It will just give you a memory range in virtual memory when you allocate. The first time you touch that memory the OS will generate a page fault, and perform physical memory allocation for that memory. This takes a hot minute and you see this in the test.

There are two ways you can solve this:

Specific solution

We know this is caused by page allocation. In the loop where you prime the source vector, also write something to the destination vector to make sure it's paged in.

Generic solution

There are other causes that my result in the first, last or a random run being notably faster or slower than the rest. In statistics these would be called outliers. A good benchmarking suite will remove the n slowest and k fastest runs from the stats for each function under test. This removes the effects from hot/cold cache lines at start, page faults etc.

Danger Will Robinson!

All the inputs to your program are deterministic and the output is never used. A C++ compiler (and C for the matter) is allowed under the as-if rule to completely elide everything in this program (including the memory allocations!) except printing the output table if it can determine that the output didn't change by the elide.

In this case you could argue that the times would be vastly different if the compiler did that, but under that argument the compiler would be prevented from doing ANY optimizations at all as soon as you did anything time related and that doesn't hold, and in fact I've seen clang do this. I had a benchmark test where it removed my entire code under test (memory allocations and all!) and replaced it with returning a constant instead of my code.

In your case, I'd make both the destination and source pointers point to volatile int. This forces the compiler to perform the copies, but I'm not sure if this also forbids some forms of optimizations like (SIMD vectorization).

TL;DR: It's hard to create a good benchmark, use a library.

Performance

I'm surprised to see that pointerCopy and simpleLoop are slower than memcpy typically compilers will detect simple copy loops and replace them with calls to memcpy. But then I remembered that little detail about pointer aliasing... and the C restrict keyword that tells the compiler: "Look, these pointers don't alias, ok?". It's technically nonstandard but all C++ compilers that matter support some form of restrict for C++. I used __restrict__ for the pointer arguments in the test functions and this is what I got:

clang++ -O3
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
 0 41.0359 41 39.2272 39 39.1399 39 56.4088 56 39.4728 39`
...
 99 40.8882 41.38 40.5645 39.7 39.9101 39.59 56.9585 57.48 39.4448 39.6

Notice how everything except memcpy and std::copy got faster? This is possible because the compiler can now generate much more efficient code when it knows that the source and destination memory doesn't overlap (alias). I'm not completely sure but I believe that the library functions do not have overloads with restrict (in fact I do not believe that's even possible) so they internally determine if the source and destination alias by some math, and if they don't, they run a fast non-alias copy. But if they do alias they run another code path that is slower.

I think this additional math and branching is the slight edge that simpleLoop has over std::copy (not sure why memcpy is a full 1.5 ms slower than std::copy). I believe that the pointerCopy is slower because of the way you increment. I've seen this happen before where the post increment is slower than the pre increment and you could probably make this as fast as simpleLoop but I don't think it's worth the effort.

At the end of the day, this reinforces the old mantra: Write clear code, and give the compiler as much information as possible and let it do it's job.

For fun I also ran with

g++ -O3
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
 0 41.2107 41 39.4616 39 39.2258 39 39.1013 39 39.4802 39
...
 99 40.8445 41.01 39.2478 39.43 39.1988 39.32 39.1874 39.27 39.1505 39.25

which interestingly generates better code for pointerCopy. I also ran with:

clang++ -O3 -march=native
Test# memcpy Avg std::copy Avg simpleLoop Avg pointerCopy Avg OMPCopy Avg
 0 41.0537 41 39.3157 39 39.2856 39 56.9359 56 39.5525 39
...
 99 40.8602 41 39.1795 39.34 39.1104 39.31 56.5426 56.8 39.738 39.29

where -march=native made slight improvements.

answered Oct 23, 2021 at 12:17
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for the insightful observations!regarding the pointer copy, I have noticed that with gcc - O3 testB is much faster than testA TIO. With msvc and the intel compiler it does not look like it makes a difference. \$\endgroup\$ Commented Oct 23, 2021 at 18:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.