For a project to learn C, I decided for fun to put the code in Compiler Explorer and compare the assembly output with the C code. Here's a minimal example of some code
unsigned char count[256][256];
int main()
{
for (int i=0; i<256; ++i)
for (int j=0; j<256; ++j)
count[i][j] = 0;
}
-Os turned this into rep stosd which make sense.
mov edx, OFFSET FLAT:count
xor eax, eax
mov ecx, 16384
mov rdi, rdx
rep stosd
But -O2 turned it into a memset call?
mov edx, 65536
xor esi, esi
mov edi, OFFSET FLAT:count
call memset
Isn't calling memset slower than just inlining the assembly instructions? Or does it depend on memset which is optimized differently for different architectures?
This is glibc's string/memset.c, which I assume is what will be called.
/* Copyright (C) 1991-2024 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
<https://www.gnu.org/licenses/>. */
#include <string.h>
#include <memcopy.h>
#ifndef MEMSET
# define MEMSET memset
#endif
void *
inhibit_loop_to_libcall
MEMSET (void *dstpp, int c, size_t len)
{
long int dstp = (long int) dstpp;
if (len >= 8)
{
size_t xlen;
op_t cccc;
cccc = (unsigned char) c;
cccc |= cccc << 8;
cccc |= cccc << 16;
if (OPSIZ > 4)
/* Do the shift in two steps to avoid warning if long has 32 bits. */
cccc |= (cccc << 16) << 16;
/* There are at least some bytes to set.
No need to test for LEN == 0 in this alignment loop. */
while (dstp % OPSIZ != 0)
{
((byte *) dstp)[0] = c;
dstp += 1;
len -= 1;
}
/* Write 8 `op_t' per iteration until less than 8 `op_t' remain. */
xlen = len / (OPSIZ * 8);
while (xlen > 0)
{
((op_t *) dstp)[0] = cccc;
((op_t *) dstp)[1] = cccc;
((op_t *) dstp)[2] = cccc;
((op_t *) dstp)[3] = cccc;
((op_t *) dstp)[4] = cccc;
((op_t *) dstp)[5] = cccc;
((op_t *) dstp)[6] = cccc;
((op_t *) dstp)[7] = cccc;
dstp += 8 * OPSIZ;
xlen -= 1;
}
len %= OPSIZ * 8;
/* Write 1 `op_t' per iteration until less than OPSIZ bytes remain. */
xlen = len / OPSIZ;
while (xlen > 0)
{
((op_t *) dstp)[0] = cccc;
dstp += OPSIZ;
xlen -= 1;
}
len %= OPSIZ;
}
/* Write the last few bytes. */
while (len > 0)
{
((byte *) dstp)[0] = c;
dstp += 1;
len -= 1;
}
return dstpp;
}
libc_hidden_builtin_def (MEMSET)
1 Answer 1
theoretically memset can be very efficient as often it is a highly optimized "handwritten" function using all the available processor features.
Depending on the architecture the difference can be negligible or substantial. Sometimes memset can be slower or faster than the compiler-generated code.
Test code:
#include <stdio.h>
#include <stddef.h>
#include <time.h>
unsigned char count[256][256];
void __attribute__((noinline)) foo(unsigned char (*count)[256]);
#if defined(NOO2)
#define FORMAT "-Os = %f\n"
#else
#define FORMAT "-O2 = %f\n"
#endif
int main()
{
clock_t begin = clock();
for(unsigned x = 0; x< 3000000; x++)
foo(count);
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf(FORMAT, time_spent);
}
#if !defined(NOO2)
#pragma GCC optimize ("O2")
#endif
void __attribute__((noinline)) foo(unsigned char (*count)[256])
{
for (int i=0; i<256; ++i)
for (int j=0; j<256; ++j)
count[i][j] = 0;
}
64 bit x86 - https://godbolt.org/z/q78Yj9W8j
Execution times are : -O2 = 3.123006 & -Os = 3.120289
32bit - https://godbolt.org/z/cG94Wrzn1
Execution times are : -O2 = 3.630126 & -Os = 2.331193
memset is much slower!
EDIT
It is to show that it may vary from the machine to machine and configuration
2 Comments
call memset once I change the executor to match the asm output window, since you used a separate "executor" window but didn't sync the compiler options. (I usually use the "execute the code" dropdown: godbolt.org/z/bYM89dr9c). And the 32-bit link has 3.09s for call memset, or 3.6 seconds after clicking refresh on the executor.
memsetimplementations are highly optimized and may beatstosd. They may use SIMD instructions to set many more bytes at a time thanstosd.rep stosdhas fast-strings microcode (since Pentium Pro) that uses wider stores. In the best case (large sufficiently-aligned buffer) it's usually nearly as fast as a SIMD loop. (Enhanced REP MOVSB for memcpy). The main downside is startup overhead and bigger misalignment penalties. (On some CPUs the microcode isn't as well tuned if they didn't get around to updating it after architectural changes, like perhaps even after widening the store pathways.)string/memset.cfor architectures that don't have a hand-written asmmemset. Instead look at codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/… / codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/… for CPUs that support AVX-512 masked stores. (Dispatch is done at dynamic-link time.) Related: Why does glibc's strlen need to be so complicated to run quickly? - again someone looking at the fallback portable C.