Fastest way to extract n bytes from uint8_t buffer

Question 1

I work a lot with byte buffers and have to extract different parts. In this example, it's 4 byte, but it ranges from a single bit to 128 bit. Speed is the most important metric here. See the code for a MWE. I'd like to know if there is a better way.

#include <stdint.h>
static uint32_t get_data(uint8_t *buf, size_t off)
{
 return ((uint32_t)(buf[off + 0]) << 24) +
 ((uint32_t)(buf[off + 1]) << 16) +
 ((uint32_t)(buf[off + 2]) << 8) +
 ((uint32_t)(buf[off + 3]));
}
int main(int argc, char **argv)
{
 uint8_t buf[128];
 /* get some example data */
 for (uint8_t i = 0; i < 128; ++i)
 buf[i] = i;
 /* we want the data from offset 10 as an uint32_t */
 uint32_t res = get_data(buf, 10);
}

Question 2

Do you mean bit or byte at the end?

Question 3

I used both bit and byte in my question intentionally, although I usually work with bytes. But in rare cases I also need to know the value of individual bits, hence the range 1 to 128.

Question 4

Since you want low level operations, I'd suggest memmove

#include <time.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
static uint32_t get_data(uint8_t *buf, size_t off)
{
 return ((uint32_t)(buf[off + 0]) << 24) +
 ((uint32_t)(buf[off + 1]) << 16) +
 ((uint32_t)(buf[off + 2]) << 8) +
 ((uint32_t)(buf[off + 3]));
}
int main(int argc, char **argv)
{
 uint8_t buf[128];
 /* get some example data */
 for (uint8_t i = 0; i < 128; ++i)
 buf[i] = i;
 clock_t t = clock();
 uint32_t res;
 for(int i=0; i<10000; i++)
 memmove(&res, buf+10, sizeof(uint32_t));
 t = clock() -t;
 printf("Time %lf\n", (double)t/CLOCKS_PER_SEC);
 t = clock();
 for(int i=0; i<10000; i++)
 res = get_data(buf, 10);
 t = clock() -t;
 printf("Time %lf\n", (double)t/CLOCKS_PER_SEC);
}

Because a single copy doesn't show any difference I tried with 10.000 and my results were:

Time 0.000049
Time 0.000090

Almost double the speedup

EDIT 1: As mentioned in the comments, memcpy is a viable alternative to memmove.
EDIT 2: The speed difference in this example cannot be observed with -O flag as the compiler executes the loop only one time.

Question 5

Might want to compare with memcpy, as the buffers don't overlap.

Question 6

Tried this with clang, from -O0 to -O3. I see absolutly no difference from manual bit shifting to memmove/memcpy. Anyway, as it is not slower, memcpy is much shorter to write and better readable.

Question 7

I think it's because -O3 understands the silly for loop and executes it only one time so as I wrote, with one copy you can't tell the difference. With 1billion in the for loop it takes 4 seconds without -O and 0.00001 sec with so I guess that's it

Question 8

There is a difference between the OP's code and memmove/memcpy: The result with the latter depends on the system's byte order, whereas OP's result doesn't.

Question 9

memmove to a uint32_t simply gets optimized away to mov eax, DWORD PTR [rdi+rsi] on GCC 7.2 (-O3)

Question 10

I'd like to know if there is a better way.

uint32_t res = get_data(buf, 10); and get_data(buf, 10) are a good first step as 1) it is functionally correct and 2) highly portable.

Any "better" solution should use this as the baseline in which to compare/profile.

The next performance step involves some assumptions. If the uint32_t is of the expected endian, than a simplememcpy() will work in lieu of get_data().

memcpy(&res, buf + 10, sizeof ref);

Although this may look like a function call, a worthy compiler "understands" memcpy() and can replace this with efficient in-line emitted code. Let your good compiler do its job - or get a better compiler.

If code "knows" res andrefdo not overflowmemcpy()is faster, or as fast as memmove(). IAC, a good compiler replaces either of these with in-line code for such small sizeof ref copies. mox nix

Soapbox: Overall, the core issue with modern code efficiency improvement is that it is unlikely to be a good investment of coding expense/effort. Spend time writing good code without employing tricks. Real efficiency improvement comes from higher level choices than this - which can vary from implementation to implementation. You may code something faster on a select platform, but be slower on the next as the big O() is the same,

Konstantinoscs Konstantinoscs 3511 bronze badges · Answer 1 · 2018-01-11 19:51:38Z

Since you want low level operations, I'd suggest memmove

#include <time.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
static uint32_t get_data(uint8_t *buf, size_t off)
{
 return ((uint32_t)(buf[off + 0]) << 24) +
 ((uint32_t)(buf[off + 1]) << 16) +
 ((uint32_t)(buf[off + 2]) << 8) +
 ((uint32_t)(buf[off + 3]));
}
int main(int argc, char **argv)
{
 uint8_t buf[128];
 /* get some example data */
 for (uint8_t i = 0; i < 128; ++i)
 buf[i] = i;
 clock_t t = clock();
 uint32_t res;
 for(int i=0; i<10000; i++)
 memmove(&res, buf+10, sizeof(uint32_t));
 t = clock() -t;
 printf("Time %lf\n", (double)t/CLOCKS_PER_SEC);
 t = clock();
 for(int i=0; i<10000; i++)
 res = get_data(buf, 10);
 t = clock() -t;
 printf("Time %lf\n", (double)t/CLOCKS_PER_SEC);
}

Because a single copy doesn't show any difference I tried with 10.000 and my results were:

Time 0.000049
Time 0.000090

Almost double the speedup

EDIT 1: As mentioned in the comments, memcpy is a viable alternative to memmove.
EDIT 2: The speed difference in this example cannot be observed with -O flag as the compiler executes the loop only one time.

Might want to compare with memcpy, as the buffers don't overlap.
Tried this with clang, from -O0 to -O3. I see absolutly no difference from manual bit shifting to memmove/memcpy. Anyway, as it is not slower, memcpy is much shorter to write and better readable.
I think it's because -O3 understands the silly for loop and executes it only one time so as I wrote, with one copy you can't tell the difference. With 1billion in the for loop it takes 4 seconds without -O and 0.00001 sec with so I guess that's it
There is a difference between the OP's code and memmove/memcpy: The result with the latter depends on the system's byte order, whereas OP's result doesn't.
memmove to a uint32_t simply gets optimized away to mov eax, DWORD PTR [rdi+rsi] on GCC 7.2 (-O3)

chux chux 36.1k2 gold badges43 silver badges96 bronze badges · Answer 2 · 2018-01-12 04:51:28Z

I'd like to know if there is a better way.

uint32_t res = get_data(buf, 10); and get_data(buf, 10) are a good first step as 1) it is functionally correct and 2) highly portable.

Any "better" solution should use this as the baseline in which to compare/profile.

The next performance step involves some assumptions. If the uint32_t is of the expected endian, than a simplememcpy() will work in lieu of get_data().

memcpy(&res, buf + 10, sizeof ref);

Although this may look like a function call, a worthy compiler "understands" memcpy() and can replace this with efficient in-line emitted code. Let your good compiler do its job - or get a better compiler.

If code "knows" res andrefdo not overflowmemcpy()is faster, or as fast as memmove(). IAC, a good compiler replaces either of these with in-line code for such small sizeof ref copies. mox nix

Soapbox: Overall, the core issue with modern code efficiency improvement is that it is unlikely to be a good investment of coding expense/effort. Spend time writing good code without employing tricks. Real efficiency improvement comes from higher level choices than this - which can vary from implementation to implementation. You may code something faster on a select platform, but be slower on the next as the big O() is the same,

Stack Exchange Network

Fastest way to extract n bytes from uint8_t buffer

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fastest way to extract n bytes from uint8_t buffer

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions