How is std::vector faster than a plain array?

Question 1

I stumbled across this while benchmarking a circular buffer. Can anyone explain how a std::vector manages to outperform a plain array in this instance?

#include <iostream>
#include <vector>
struct uint_pair {
 unsigned int a, b;
 uint_pair (unsigned int x = 0, unsigned int y = 0) : a(x), b(y) {}
};
struct container {
 unsigned int pos;
#ifdef USE_VECTOR
 std::vector<uint_pair> data;
 container() : pos(0) { data.resize(16); }
#else
 uint_pair data[16];
 container() : pos(0) {}
#endif
 void add(uint_pair val) {
 data[++pos % 16] = val;
 }
};
int main() {
 container c;
 for (unsigned int i = 0; i < 1000000000; i++) c.add(uint_pair{i, i});
 std::cout << c.data[0].a << " " << c.data[0].b << std::endl;
}

These are the results I'm getting using GCC (similar with Clang):

g++ -o bench -std=c++0x -Os main.cpp -D'USE_VECTOR'
real 0m8.757s
user 0m8.750s
sys 0m0.002s
g++ -o bench -std=c++0x -Os main.cpp
real 0m9.215s
user 0m9.209s
sys 0m0.002s

Question 2

Could be just the way allocations line up with other data in the cache. P.S. you want resize not reserve.

Question 3

@MarkRansom Thanks, updated the code. Results still hold though.

Question 4

GCC 4.8 brings an even larger difference. I see 0.6s for vector and 1.8s for array. Optimization level matters, -O3 gets 0.9s for the vector.

Question 5

There was a half-right answer that was just deleted but that had a good observation. The array elements are in your container struct, while the vector elements are somewhere else and accessed through a pointer. That has some effect on optimization as well. You can narrow the gap by changing your array to a uint_pair* and allocate it in the constructor. In my test that doubles its performance, but it's still a little slower than a vector.

Question 6

Tried it with another compiler, the disassembly showed the problem. Issue is that the array is stored in the same class of memory as the loop variable, the compiler is conservative about aliasing assumptions and treated the loop variable volatile. Not an issue with vector, it is stored on the heap instead of the stack. Using uint_pair* data instead of uint_pair data[16] eliminated the difference.

Question 7

Here's how you can eliminate the difference. Instead of your add, use a function like this:

void set(unsigned int x, unsigned int y) {
 ++pos;
 data[pos % 16].a = x;
 data[pos % 16].b = y;
}

called like this:

for (unsigned int i = 0; i < 1000000000; i++) c.set(i, i);

This does exactly the same thing as yours, but it avoids semantically creating a temporary object. It looks like when you use a vector the compiler is better able to optimize out the temporary.

$ g++-4.8 -o bench -std=c++11 -Os main.cpp -DUSE_VECTOR
$ time ./bench 
999999999 999999999
real 0m0.635s
user 0m0.630s
sys 0m0.002s
$ g++-4.8 -o bench -std=c++11 -Os main.cpp
$ time ./bench 
999999999 999999999
real 0m0.644s
user 0m0.639s
sys 0m0.002s

On my machine both the set and add methods yield identical performance with vectors. Only the array shows a difference. To further lend credence to optimization, if you compile with -O0 then the array method is slightly faster (but both over 10x slower than with -Os).

This doesn't explain why the compiler treats these two differently. A vector is, after all, backed by an array. Also, an std::array behaves identically to your C-style array.

Question 8

Interestingly, performance wise, std::array is more like using the C-style array than using std::vector.

Question 9

@5gon12eder correct, it's just a STL-like wrapper around a C-style array. I tried that too, and in this case it behaves just like the C-style array.

Question 10

On my machine, I observe somewhat different results. The loop for the std::vector has always 5 instructions. The array needs 7 with the OP's code but only 4 with yours, so it's even faster (also backed by timing results) than std::vector. std::array always produces identical assembly code than the C-style array. [GCC 4.9.1 20140903 (prerelease) on x86_64 GNU/Linux]

Question 11

One problem is with the placement of the "pos" member in your struct.

For the c-array, remember that it is stored contiguously in memory adjacent to your "pos" member. When data is pushed into the c-array, extra instructions have to be issued to offset into the structure past the "pos" member. However, writing to the vector makes no such restriction since its memory is located elsewhere.

To squeeze out more performance, make sure your hottest data is at the front of a cache line.

Edit:

To get the c-array to perform just as fast as the vector, the c-array must be allocated on 8 byte boundaries on a 64 bit machine. So something like:

uint_pair* data;
unsigned int pos;
container() : pos(0) {
 std::size_t bufSize = sizeof(uint_pair) * 17;
 void* p = new char[bufSize];
 p = std::align(8, sizeof(uint_pair), p, bufSize);
 data = reinterpret_cast<uint_pair*>(p);
}

With a slightly modified add function:

void add(unsigned int x, unsigned int y) {
 auto& ref = data[pos++ % 16];
 ref.a = x;
 ref.b = y;
}

The c-array now times:

real 0m0.735s
user 0m0.730s
sys 0m0.002s

And the std::vector:

real 0m0.743s
user 0m0.736s
sys 0m0.004s

The standard library implementers are pulling out all the stops for you :)

Question 12

Your claim is that the problem is the memory alignment, but you don't show that. You used a similar add function that I did, and I show that that change alone erases the performance difference. So the alignment changes have no effect at all (in other words, the compiler already took care of that).

Question 13

You are right, there is a difference between an array that's built into the structure and one accessed through a pointer. But that does not explain the entire performance difference (see my comment on the original question). I'd also like to see some proof for your cache claim. The data totals less than 20 ints. All approaches should be in cache.

Question 14

We must be getting different results. Using your "set" or my altered "add," the performance difference between the heap-allocated c-array and the std::vector is not equal -- there is still a ~0.04s slowdown on the c-array. This difference can be completely eliminated using a properly aligned heap allocation, which is something the compiler will not do for you, by the way. Hence, both the modified "add" as well as the aligned heap allocation are necessary.

Question 15

It seems C++11 compiler generates better code for vector because of operator=(rvalue reference). Firstly, In C++03 compiler plain array twice faster than vector. Secondly, tehre are no difference if you use void set(unsigned int x, unsigned int y) suggested by Adam.

Assembler code for vector

.L49:
leal (%rdi,%rax), %esi
andl 15,ドル %esi
leaq (%rdx,%rsi,8), %rsi
movl %eax, (%rsi)
movl %eax, 4(%rsi)
incq %rax
cmpq 1000000000,ドル %rax
jne .L49

for plain array

.L3:
movl 12(%rsp), %edx
incl %edx
movl %edx, 12(%rsp)
andl 15,ドル %edx
leaq 12(%rsp,%rdx,8), %rdx
movl %eax, 4(%rdx)
movl %eax, 8(%rdx)
incl %eax
cmpl 1000000000,ドル %eax
jne .L3

Question 16

I'm not convinced that a move kicks in. First of all uint_pair declares a constructor, so it does not have a default move constructor. Second: the parameter of the operator= in the add function is an lvalue. Third: even defining a move ctor, the two unsigned members would still have to be copied.

Adam Adam 17.4k8 gold badges55 silver badges103 bronze badges · Accepted Answer · 2014-10-04 04:46:29Z

Here's how you can eliminate the difference. Instead of your add, use a function like this:

void set(unsigned int x, unsigned int y) {
 ++pos;
 data[pos % 16].a = x;
 data[pos % 16].b = y;
}

called like this:

for (unsigned int i = 0; i < 1000000000; i++) c.set(i, i);

This does exactly the same thing as yours, but it avoids semantically creating a temporary object. It looks like when you use a vector the compiler is better able to optimize out the temporary.

$ g++-4.8 -o bench -std=c++11 -Os main.cpp -DUSE_VECTOR
$ time ./bench 
999999999 999999999
real 0m0.635s
user 0m0.630s
sys 0m0.002s
$ g++-4.8 -o bench -std=c++11 -Os main.cpp
$ time ./bench 
999999999 999999999
real 0m0.644s
user 0m0.639s
sys 0m0.002s

On my machine both the set and add methods yield identical performance with vectors. Only the array shows a difference. To further lend credence to optimization, if you compile with -O0 then the array method is slightly faster (but both over 10x slower than with -Os).

This doesn't explain why the compiler treats these two differently. A vector is, after all, backed by an array. Also, an std::array behaves identically to your C-style array.

Interestingly, performance wise, std::array is more like using the C-style array than using std::vector.
@5gon12eder correct, it's just a STL-like wrapper around a C-style array. I tried that too, and in this case it behaves just like the C-style array.
On my machine, I observe somewhat different results. The loop for the std::vector has always 5 instructions. The array needs 7 with the OP's code but only 4 with yours, so it's even faster (also backed by timing results) than std::vector. std::array always produces identical assembly code than the C-style array. [GCC 4.9.1 20140903 (prerelease) on x86_64 GNU/Linux]

CollectivesTM on Stack Overflow

How is std::vector faster than a plain array?

3 Answers 3

3 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

3 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related