I work a lot with byte buffers and have to extract different parts. In this example, it's 4 byte, but it ranges from a single bit to 128 bit. Speed is the most important metric here. See the code for a MWE. I'd like to know if there is a better way.
#include <stdint.h>
static uint32_t get_data(uint8_t *buf, size_t off)
{
return ((uint32_t)(buf[off + 0]) << 24) +
((uint32_t)(buf[off + 1]) << 16) +
((uint32_t)(buf[off + 2]) << 8) +
((uint32_t)(buf[off + 3]));
}
int main(int argc, char **argv)
{
uint8_t buf[128];
/* get some example data */
for (uint8_t i = 0; i < 128; ++i)
buf[i] = i;
/* we want the data from offset 10 as an uint32_t */
uint32_t res = get_data(buf, 10);
}
-
\$\begingroup\$ Do you mean bit or byte at the end? \$\endgroup\$Deduplicator– Deduplicator2018年01月11日 19:55:10 +00:00Commented Jan 11, 2018 at 19:55
-
\$\begingroup\$ I used both bit and byte in my question intentionally, although I usually work with bytes. But in rare cases I also need to know the value of individual bits, hence the range 1 to 128. \$\endgroup\$flowit– flowit2018年01月11日 20:01:56 +00:00Commented Jan 11, 2018 at 20:01
2 Answers 2
Since you want low level operations, I'd suggest memmove
#include <time.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
static uint32_t get_data(uint8_t *buf, size_t off)
{
return ((uint32_t)(buf[off + 0]) << 24) +
((uint32_t)(buf[off + 1]) << 16) +
((uint32_t)(buf[off + 2]) << 8) +
((uint32_t)(buf[off + 3]));
}
int main(int argc, char **argv)
{
uint8_t buf[128];
/* get some example data */
for (uint8_t i = 0; i < 128; ++i)
buf[i] = i;
clock_t t = clock();
uint32_t res;
for(int i=0; i<10000; i++)
memmove(&res, buf+10, sizeof(uint32_t));
t = clock() -t;
printf("Time %lf\n", (double)t/CLOCKS_PER_SEC);
t = clock();
for(int i=0; i<10000; i++)
res = get_data(buf, 10);
t = clock() -t;
printf("Time %lf\n", (double)t/CLOCKS_PER_SEC);
}
Because a single copy doesn't show any difference I tried with 10.000 and my results were:
Time 0.000049
Time 0.000090
Almost double the speedup
- EDIT 1: As mentioned in the comments, memcpy is a viable alternative to memmove.
- EDIT 2: The speed difference in this example cannot be observed with -O flag as the compiler executes the loop only one time.
-
2\$\begingroup\$ Might want to compare with
memcpy
, as the buffers don't overlap. \$\endgroup\$Deduplicator– Deduplicator2018年01月11日 19:56:14 +00:00Commented Jan 11, 2018 at 19:56 -
1\$\begingroup\$ Tried this with clang, from -O0 to -O3. I see absolutly no difference from manual bit shifting to memmove/memcpy. Anyway, as it is not slower, memcpy is much shorter to write and better readable. \$\endgroup\$flowit– flowit2018年01月11日 20:20:49 +00:00Commented Jan 11, 2018 at 20:20
-
2\$\begingroup\$ I think it's because -O3 understands the silly for loop and executes it only one time so as I wrote, with one copy you can't tell the difference. With 1billion in the for loop it takes 4 seconds without -O and 0.00001 sec with so I guess that's it \$\endgroup\$Konstantinoscs– Konstantinoscs2018年01月11日 20:24:23 +00:00Commented Jan 11, 2018 at 20:24
-
5\$\begingroup\$ There is a difference between the OP's code and
memmove
/memcpy
: The result with the latter depends on the system's byte order, whereas OP's result doesn't. \$\endgroup\$Cris Luengo– Cris Luengo2018年01月11日 21:05:51 +00:00Commented Jan 11, 2018 at 21:05 -
1\$\begingroup\$
memmove
to auint32_t
simply gets optimized away tomov eax, DWORD PTR [rdi+rsi]
on GCC 7.2 (-O3) \$\endgroup\$D. Jurcau– D. Jurcau2018年01月11日 21:15:28 +00:00Commented Jan 11, 2018 at 21:15
I'd like to know if there is a better way.
uint32_t res = get_data(buf, 10);
and get_data(buf, 10)
are a good first step as 1) it is functionally correct and 2) highly portable.
Any "better" solution should use this as the baseline in which to compare/profile.
The next performance step involves some assumptions. If the uint32_t
is of the expected endian, than a simplememcpy()
will work in lieu of get_data()
.
memcpy(&res, buf + 10, sizeof ref);
Although this may look like a function call, a worthy compiler "understands" memcpy()
and can replace this with efficient in-line emitted code. Let your good compiler do its job - or get a better compiler.
If code "knows" res and
refdo not overflow
memcpy()is faster, or as fast as memmove()
. IAC, a good compiler replaces either of these with in-line code for such small sizeof ref
copies. mox nix
Soapbox: Overall, the core issue with modern code efficiency improvement is that it is unlikely to be a good investment of coding expense/effort. Spend time writing good code without employing tricks. Real efficiency improvement comes from higher level choices than this - which can vary from implementation to implementation. You may code something faster on a select platform, but be slower on the next as the big O() is the same,