Why did GCC turn a for loop to clear an array into a memset?

Question 1

For a project to learn C, I decided for fun to put the code in Compiler Explorer and compare the assembly output with the C code. Here's a minimal example of some code

unsigned char count[256][256];
int main()
{
 for (int i=0; i<256; ++i)
 for (int j=0; j<256; ++j)
 count[i][j] = 0;
}

-Os turned this into rep stosd which make sense.

 mov edx, OFFSET FLAT:count
 xor eax, eax
 mov ecx, 16384
 mov rdi, rdx
 rep stosd

But -O2 turned it into a memset call?

 mov edx, 65536
 xor esi, esi
 mov edi, OFFSET FLAT:count
 call memset

Isn't calling memset slower than just inlining the assembly instructions? Or does it depend on memset which is optimized differently for different architectures?

This is glibc's string/memset.c, which I assume is what will be called.

/* Copyright (C) 1991-2024 Free Software Foundation, Inc.
 This file is part of the GNU C Library.
 The GNU C Library is free software; you can redistribute it and/or
 modify it under the terms of the GNU Lesser General Public
 License as published by the Free Software Foundation; either
 version 2.1 of the License, or (at your option) any later version.
 The GNU C Library is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 Lesser General Public License for more details.
 You should have received a copy of the GNU Lesser General Public
 License along with the GNU C Library; if not, see
 <https://www.gnu.org/licenses/>. */
#include <string.h>
#include <memcopy.h>
#ifndef MEMSET
# define MEMSET memset
#endif
void *
inhibit_loop_to_libcall
MEMSET (void *dstpp, int c, size_t len)
{
 long int dstp = (long int) dstpp;
 if (len >= 8)
 {
 size_t xlen;
 op_t cccc;
 cccc = (unsigned char) c;
 cccc |= cccc << 8;
 cccc |= cccc << 16;
 if (OPSIZ > 4)
 /* Do the shift in two steps to avoid warning if long has 32 bits. */
 cccc |= (cccc << 16) << 16;
 /* There are at least some bytes to set.
 No need to test for LEN == 0 in this alignment loop. */
 while (dstp % OPSIZ != 0)
 {
 ((byte *) dstp)[0] = c;
 dstp += 1;
 len -= 1;
 }
 /* Write 8 `op_t' per iteration until less than 8 `op_t' remain. */
 xlen = len / (OPSIZ * 8);
 while (xlen > 0)
 {
 ((op_t *) dstp)[0] = cccc;
 ((op_t *) dstp)[1] = cccc;
 ((op_t *) dstp)[2] = cccc;
 ((op_t *) dstp)[3] = cccc;
 ((op_t *) dstp)[4] = cccc;
 ((op_t *) dstp)[5] = cccc;
 ((op_t *) dstp)[6] = cccc;
 ((op_t *) dstp)[7] = cccc;
 dstp += 8 * OPSIZ;
 xlen -= 1;
 }
 len %= OPSIZ * 8;
 /* Write 1 `op_t' per iteration until less than OPSIZ bytes remain. */
 xlen = len / OPSIZ;
 while (xlen > 0)
 {
 ((op_t *) dstp)[0] = cccc;
 dstp += OPSIZ;
 xlen -= 1;
 }
 len %= OPSIZ;
 }
 /* Write the last few bytes. */
 while (len > 0)
 {
 ((byte *) dstp)[0] = c;
 dstp += 1;
 len -= 1;
 }
 return dstpp;
}
libc_hidden_builtin_def (MEMSET)

Question 2

Some memset implementations are highly optimized and may beat stosd. They may use SIMD instructions to set many more bytes at a time than stosd.

Question 3

It might be useful to benchmark and update the question with the result. To me it's impossible to answer as it depends on your platform and even then only whoever wrote the gcc optimization would have a chance of answering it. I can tell you that memset is highly optimized.

Question 4

@EricPostpischil Any decent memset operation will beat stosd by storing 16 or 32 bytes vectors at a time. I’ve seen one where at boot time the OS decided between different implementations for memset , memcpy, memmov and a few others depending on the processor, stored jump instructions to the implementations, and the actual memset would just jmp to the best method.

Question 5

@gnasher729 and Eric: rep stosd has fast-strings microcode (since Pentium Pro) that uses wider stores. In the best case (large sufficiently-aligned buffer) it's usually nearly as fast as a SIMD loop. (Enhanced REP MOVSB for memcpy). The main downside is startup overhead and bigger misalignment penalties. (On some CPUs the microcode isn't as well tuned if they didn't get around to updating it after architectural changes, like perhaps even after widening the store pathways.)

Question 6

@qwr - you're looking at the fallback string/memset.c for architectures that don't have a hand-written asm memset. Instead look at codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/… / codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/… for CPUs that support AVX-512 masked stores. (Dispatch is done at dynamic-link time.) Related: Why does glibc's strlen need to be so complicated to run quickly? - again someone looking at the fallback portable C.

Question 7

theoretically memset can be very efficient as often it is a highly optimized "handwritten" function using all the available processor features.

Depending on the architecture the difference can be negligible or substantial. Sometimes memset can be slower or faster than the compiler-generated code.

Test code:

#include <stdio.h>
#include <stddef.h>
#include <time.h>
unsigned char count[256][256];
void __attribute__((noinline)) foo(unsigned char (*count)[256]);
#if defined(NOO2)
#define FORMAT "-Os = %f\n" 
#else
#define FORMAT "-O2 = %f\n"
#endif 
int main()
{
 clock_t begin = clock();
 for(unsigned x = 0; x< 3000000; x++)
 foo(count);
 clock_t end = clock();
 double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
 printf(FORMAT, time_spent);
}
#if !defined(NOO2)
#pragma GCC optimize ("O2")
#endif
void __attribute__((noinline)) foo(unsigned char (*count)[256])
{
 for (int i=0; i<256; ++i)
 for (int j=0; j<256; ++j)
 count[i][j] = 0;
}

64 bit x86 - https://godbolt.org/z/q78Yj9W8j

Execution times are : -O2 = 3.123006 & -Os = 3.120289

32bit - https://godbolt.org/z/cG94Wrzn1

Execution times are : -O2 = 3.630126 & -Os = 2.331193

memset is much slower!

EDIT

It is to show that it may vary from the machine to machine and configuration

Question 8

Godbolt runs your code on AWS instances that might be Cascade Lake or Ice Lake Xeon, or Zen 3. Of various clock speeds. When I open the 64-bit link, I see both at about 2.18 seconds. Or 1.99s for call memset once I change the executor to match the asm output window, since you used a separate "executor" window but didn't sync the compiler options. (I usually use the "execute the code" dropdown: godbolt.org/z/bYM89dr9c). And the 32-bit link has 3.09s for call memset, or 3.6 seconds after clicking refresh on the executor.

Question 9

TL:DR: your benchmark results are probably from different machines as well as different source, and might even not be the right asm since your "executor" window build options didn't match the source.

0___________ 71.8k4 gold badges41 silver badges95 bronze badges · Accepted Answer · 2024-08-11 07:42:29Z

theoretically memset can be very efficient as often it is a highly optimized "handwritten" function using all the available processor features.

Depending on the architecture the difference can be negligible or substantial. Sometimes memset can be slower or faster than the compiler-generated code.

Test code:

#include <stdio.h>
#include <stddef.h>
#include <time.h>
unsigned char count[256][256];
void __attribute__((noinline)) foo(unsigned char (*count)[256]);
#if defined(NOO2)
#define FORMAT "-Os = %f\n" 
#else
#define FORMAT "-O2 = %f\n"
#endif 
int main()
{
 clock_t begin = clock();
 for(unsigned x = 0; x< 3000000; x++)
 foo(count);
 clock_t end = clock();
 double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
 printf(FORMAT, time_spent);
}
#if !defined(NOO2)
#pragma GCC optimize ("O2")
#endif
void __attribute__((noinline)) foo(unsigned char (*count)[256])
{
 for (int i=0; i<256; ++i)
 for (int j=0; j<256; ++j)
 count[i][j] = 0;
}

64 bit x86 - https://godbolt.org/z/q78Yj9W8j

Execution times are : -O2 = 3.123006 & -Os = 3.120289

32bit - https://godbolt.org/z/cG94Wrzn1

Execution times are : -O2 = 3.630126 & -Os = 2.331193

memset is much slower!

EDIT

It is to show that it may vary from the machine to machine and configuration

Godbolt runs your code on AWS instances that might be Cascade Lake or Ice Lake Xeon, or Zen 3. Of various clock speeds. When I open the 64-bit link, I see both at about 2.18 seconds. Or 1.99s for call memset once I change the executor to match the asm output window, since you used a separate "executor" window but didn't sync the compiler options. (I usually use the "execute the code" dropdown: godbolt.org/z/bYM89dr9c). And the 32-bit link has 3.09s for call memset, or 3.6 seconds after clicking refresh on the executor.
TL:DR: your benchmark results are probably from different machines as well as different source, and might even not be the right asm since your "executor" window build options didn't match the source.

CollectivesTM on Stack Overflow

Why did GCC turn a for loop to clear an array into a memset?

1 Answer 1

EDIT

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions