I engineered myself into a situation where I wanted to check whether a region of memory consists of identical bytes. I wanted to at least try to avoid the obvious solution of an O(n) loop, but also was not fond of figuring out how SIMD works. So, I came up with this:
bool check_homogeneous(const void *region, size_t region_len) {
const unsigned char *bytes = (const unsigned char *) region;
// Special case for odd initial length
if (region_len & 1 && bytes[0] != bytes[region_len - 1]) {
return false;
}
// Continuously divide region in half and compare with upper half
size_t part = region_len >> 1;
while (part) {
if (memcmp(&bytes[0], &bytes[part], part) != 0) return false;
part >>= 1;
}
return true;
}
It's a bit strange, though it seems to work fine and perform somewhat better than a naive loop. Posting this with hope that someone can steer me towards a less spooky implementation.
5 Answers 5
Review
That while()
with its accessory statements is begging to be expressed as a simple for()
instead.
Since region_len
is not const
, there's really no need for part
. Use the (local) parameter variable itself for the loop.
Boolean functions, especially, should be named meaningfully. if(checkRaining())
is a little vague. Need an umbrella to get to the train? if(isRaining())
gives a better clue about the meaning of its returned value.
Be consistent! There are two if()
conditionals in this function. The first one uses braces around a single return false;
statement. The second if()
has a "no brace body" in which the return false;
appears on the same line as if()
. Either is fine (imo), but both together is not fine. Read lots of "expert" code and learn a style that works for you when writing your own code. Adapt to and immitate (don't emulate) any existing style when altering existing code. Consistency suggests better quality, and can make bugs harder to write and easier to spot.
The comment explains "divide by 2", that is implemented as part >>= 1;
. Modern compilers are vastly more 'clever' than most of us coders using them. The compiler will recognise the intent of part /= 2;
and do the right (and fastest) thing. It's better to aim for clear code than to second guess the intelligence of the compiler. Leave or adjust the comment to explain why you, the author, chose to divide by two if you feel it worthwhile. (Comments are really, really cheap! and will execute in 0 nanoseconds!)
From the example of Dennis Ritchie's strcmp()
, I would advise you as a neophyte to postpone writing functions accepting a void *
pointer. Your function's purpose is based on byte-by-byte comparison (as opposed to a function tuned for, perhaps, compound datatypes.) I'd feel safer if the function signature, itself, indicated unsigned char *
for the first parameter. This may be debatable, but void *
tells the compiler to be LESS stringent with its type checking. Meditate on why Dennis Ritchie wrote the <string.h>
functions to take the datatypes of their parameters as they do (instead of making all of them take void *
parameters.)
Understanding and alternative solution
Suggest that you locate and study Standard Library implementation of memcpy()
. There you'll find a bit of 'start up' address compensation, then a loop that can safely treat the rest of the block of bytes as 32 or 64 bit 'chunks' (a single instruction deals with 4 or 8 bytes in one step.) Study and learn and adapt to suit your needs.
This is only a learning exercise, of course.
In the meantime:
#include <stdint.h> // for 'uint8_t'
bool isSameByte( const uint8_t *arr, const size_t size )
{
// EDIT: Added missing test as alternative to required fix detected by @chux
if( size < 2 )
return true; // 0 or 1 byte is always equal to itself
return memcmp( arr + 0, arr + 1, size - 1 ) == 0;
}
would make a single pass (optimised by the compiler to use the very best code that's available in the Standard Library) across the entire array. If there's a single byte that differs from its neighbours, the scan will end there. (Notice there's no fiddle-faddle with specially handling the first byte of the array. No if()
.)
(Put a print statement inside your while()
loop, then think about whether or not you've improved on O(n).)
Don't stop experimenting, but lean toward using Standard Library functions where possible.
DIY is REALLY unlikely to improve on performance or trustworthiness.
Spend a rainy Sunday skimming the names of all the functions in the Standard Library. I still find nuggets that I'd known then forgotten about.
And don't try to out clever yourself with intricate brilliance.
Take it from one who's been there: you'll look at this again in 3-6 months and exclaim, "WTF?!".
KISS!
If you're worrying about repeated access to the same byte value, don't. In your 'divide and conquer' scheme, a very large array might suffer from cache misses because it is drawing bytes from two regions. This 'neighbour checking' can race through lines of contiguous bytes, then ask for another serving to scan. Dunno. Feels like it's running the pipeline at maximum speed...
Rows of roses
Homogeneous milk is milk that has been processed so what would be called cream no longer floats to the top. It is, in a way, a heterogeneous mixture that still contains the same substances as before but in a different structure.
One could define a homogeneous array of a mixture of only lowercase ASCII letters. It would be homogeneous by one interpretation, but likely not all the same character, therefore heterogeneous.
Vocabulary matters. Say what you mean (and mean what you say!)
Boo!
Sorry... Just some thrill-seeking spookiness on a howling-wind night! :-)
XY?
Delightfully referenced and linked from the Wikipedia page for "Regret" is the psychology concept of "System One and System Two Thinking". Possibly relevant to this question as the function's purpose and use has the aroma of "checking that all elements of an array of flag bits/bytes have the same value" where the value or indices of any black sheep byte(s) is/are not important. Perhaps some sort of watch dog sniffer run at certain intervals???
(This is "System Two" as some time has passed since this answer was last updated...)
Should this speculation about the function's purpose come close to the mark, the OP is advised to re-think this mechanism.
Sufficient might be a single "Set/Reset" flag and a single register that "remembers" and compares the results of a current operation with only one previous operation. When two results differ, set the flag and leave it set until the caller of this current function wants to verify all is well or work is needed. It's only when that work is done that the S/R flag is reset for the next detection.
This would reduce memory requirements and amortise the effort to scan an array (unknown size) across all the operations that are occuring.
Irrespective of this speculation's appropriateness, it must be noted that developing software is fertile ground for germination and growth of 'regret'...
-
1\$\begingroup\$ The other standard-library function that could do the job is
strspn
, at least if your first byte isn't0
, the C string terminator. But it normally only checks one byte at a time after making a lookup table since it takes a whole set of accept-characters. It can use SSE4.2pcmpistri
for small-enough accept-sets, but that's still not as fast as a simple byte compare. Oh, and it works on C strings so would need a terminator or mismatching byte following the array. There's nomemspn
even in glibc. \$\endgroup\$Peter Cordes– Peter Cordes2025年06月09日 18:00:14 +00:00Commented Jun 9 at 18:00 -
1\$\begingroup\$ Anyway, your overlapping
memcmp
trick could probably go somewhat less than half as fast as optimal SIMD machine code on arrays that are already hot in L1d cache, on typical modern x86. For arrays farther away from the CPU, probably still limited by cache/memory bandwidth getting the data to L1d. \$\endgroup\$Peter Cordes– Peter Cordes2025年06月09日 18:04:19 +00:00Commented Jun 9 at 18:04 -
1\$\begingroup\$ In usage like this (C++ templates, C generic functions, or examples/pseudocode),
T
is the same type in both appearances, which is very bad forT size
. Just usesize_t size
because you don't want the size-type to be parameterized. If you wanted to use a different wildcard type, useT* arr, S size
. \$\endgroup\$Peter Cordes– Peter Cordes2025年06月09日 22:01:58 +00:00Commented Jun 9 at 22:01 -
2\$\begingroup\$ One way to work around it is to write C that always touches every element, with no early-out. e.g.
mismatches |= (arr[i] != arr[0]);
orallmatch &= (arr[i] == arr[0])
. That's bad for large arrays with a mismatch near the start, but might be acceptable in some cases. See the how to auto vectorization array comparison function for an example of auto-vectorization of counting matches. \$\endgroup\$Peter Cordes– Peter Cordes2025年06月09日 22:10:13 +00:00Commented Jun 9 at 22:10 -
2\$\begingroup\$ Genius! I somehow got it into my head that the regions passed to
memcmp
could not overlap, even though a quick look at the docs would have indicated otherwise. \$\endgroup\$Xavier Pedraza– Xavier Pedraza2025年06月09日 22:26:02 +00:00Commented Jun 9 at 22:26
Your Algorithm is Buggy
It only checks for an odd block size on the first iteration. If any partition has an odd size, the algorithm fails to check the final byte. So the test driver
const char test_data[] = "aabaab"; // Adds terminating null byte.
printf( "\"%s\" %s homogeneous.\n",
test_data,
check_homogeneous(test_data, sizeof(test_data)-1U) ? "is" : "is not" );
prints "aabaab" is homogeneous.
Other Suggestions
I recommend using const
wherever you can. This mainly lets the compiler check for bugs, but sometimes can help the optimizer out too. You already declare pointers to const unsigned char
, but these could be const
pointers as well.
C has functions, not methods (unlike Java or Python).
Your algorithm appears to work properly when called with check_homogeneous(NULL, 0)
. Great! It’s good practice in general to check for logic errors such as null pointers, especially on systems where dereferencing a null pointer will not crash at runtime. If you will be subtracting pointers, which gives you an index of type ptrdiff_t
, you additionally need to check that the input size is not greater than PTRDIFF_MAX
.
Or if there are conditions the caller must check, at the VERY least put a big warning in a comment right above the function about this behavior!
I agree with Peter Cordes’ suggestion to follow the is_homogeneous
naming convention. To me, check_homogeneous
could mean performing some action depending on homogeneity.
Is Simpler Better?
Your divide-and-conquer algorithm will, in practice, end up performing a linear scan over 2N bytes of input (as it has to compare N/2 bytes to N/2 bytes on the first iteration, N/4 bytes to N/4 bytes on the second iteration, and so on). This is twice as many comparisons as a simple linear scan.
Let’s compare to a version with the same basic structure that does a short-circuiting linear scan:
bool is_homogeneous(const void *const region, const size_t region_len) {
if (region_len == 0) {
return true; // Nullary reduction.
}
assert(region != NULL); // Logic error!
// If this line is reached, region points to a block of more than zero bytes.
const unsigned char* const byte_region = region;
const unsigned char compare_to = *byte_region;
for (size_t i = 1U; i < region_len; ++i) {
if (byte_region[i] != compare_to) {
return false;
}
}
return true;
}
GCC 15.1 with -std=c23 -march=x86-64-v4 -O3
is capable of partly automatically vectorizing this loop, but clang 20.1, ICX 2024.0 and MSVC 19.43 are not. See on Godbolt. GCC generates this block:
.L24:
add rdx, 64
vpaddq zmm0, zmm0, zmm2
cmp rdx, r9
je .L44
.L26:
vmovdqa64 zmm1, ZMMWORD PTR [rdx]
vpcmpb k0, zmm1, zmm3, 4
kortestq k0, k0
je .L24
vmovq rax, xmm0
vzeroupper
jmp .L27
If we take out the short-circuiting to make this a pure reduction operation, the LLVM compilers are now able to auto-vectorize it, but GCC 15.1 gets confused and can no longer auto-vectorize.
bool is_homogeneous(const void *const region, const size_t region_len) {
if (region_len == 0) {
return true; // Nullary reduction.
}
assert(region != NULL); // Logic error!
// If this line is reached, region points to a block of more than zero bytes.
const unsigned char* const byte_region = region;
const unsigned char compare_to = *byte_region;
bool all_equal_so_far = true;
for (size_t i = 1U; i < region_len; ++i) {
all_equal_so_far &= (byte_region[i] == compare_to);
}
return all_equal_so_far;
}
What if we whack the compiler in the head with a big clue-by-four?
/* Hey, compiler! Maybe think about a SIMD reduction loop? I enabled
* both AVX-512bw and AVX-512vl instructions, if you want to use them.
*/
#pragma omp simd reduction(&:all_equal_so_far)
The version of MSVC on Godbolt doesn’t understand this at all, even though the documentation claims to support it, and it makes no difference in any other compiler.
To get every compiler to vectorize the loop, we have to refactor to get a bit closer to the native binary representation:
bool is_homogeneous(const void *const region, const size_t region_len) {
if (region_len == 0) {
return true; // Nullary reduction.
}
assert(region != NULL); // Logic error!
// If this line is reached, region points to a block of more than zero bytes.
const unsigned char* const byte_region = region;
const unsigned char compare_to = *byte_region;
unsigned char accumulator = 0;
for (size_t i = 1U; i < region_len; ++i) {
accumulator |= (byte_region[i] != compare_to ? 0xFFU : 0x00U);
}
return !accumulator;
}
This does generate vectorized code for Clang, GCC, ICX and MSVC, but only Clang produces good code for it. And this is at the cost of always checking every byte in the region. You can evaluate for yourself on Godbolt.
If we want to do better, we’re going to need extreme measures.
A Vectorized Loop
Peter Cordes goes over a lot of this, but there are some things I respectfully disagree with (overflow on a signed type such as ptrdiff_t
is formally undefined behavior, originally to support weird old CPUs that would hardware-fault, but the current generation of C compilers takes that as permission to introduce security faults without warning) and some where I’ll suggest a different approach.
What we ideally want mainstream 2025 hardware to do is to load the bytes of the block in aligned chunks, perform branchless SIMD comparisons on each byte, reduce them with a bitwise operation, and short-circuit if we encounter an inhomogeneous byte in the current chunk. For this project, I’ll be compiling with -mavx512bw -mavx512vl -mprefer-vector-width=256
and using naturally-aligned 32-byte chunks of memory, but if you need portability, you can set these constants inside #elif
blocks.
#define V_BYTES 32UL
#define V_ALIGN 32UL
One optimization Peter does that I don’t here is to load a larger number of bytes at once, in between the unpredictable branches. You could do the same by increasing V_BYTES
to 64, 128 or 256 bytes (although then you do need to change the program logic a bit to make sure not to skip any range of memory).
After reading the first byte, we can process the remainder of the block in one step if we can load the remaining bytes into a native vector. Otherwise, we want to partition the block into:
- the first byte, which we compare all others to
- all bytes between the first byte and the first aligned address within the block, which we will call
left
- a middle section of any number of aligned chunks
- all bytes after the last address in the block, which we will call
right
This algorithm requires us to find the first and last vector-aligned address in the block. This brings us to the annoyance that there’s a function to align pointers in the C++ standard library, but not in standard C, and different compilers need different code for it. ICX needs a built-in function that Clang also supports, GCC does not have the built-in function but does have one to tell the compiler that a pointer is aligned. MSVC doesn’t generate any different code for aligned and unaligned loads, but modern CPUs will still run faster if loads and stores are aligned at runtime. Finally, only GCC short-circuits preprocessor conditionals, so I can’t write code like #if defined(__has_builtin) && __has_builtin(__builtin_align_up)
.
All of which is to say, here’s my boilerplate for the helper macros. It’s not pretty, but it compiles without warnings on every compiler I tested.
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_up)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
__builtin_align_up((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
__builtin_assume_aligned(\
(void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)),\
(align))
#else
// Other compilers, including MSVC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#endif
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_down)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
__builtin_align_down((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
__builtin_assume_aligned(\
(void*)((uintptr_t)(void*)(ptr) & -(uintptr_t)(align)),\
(align))
#else
// Other compilers
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#endif
Next: If we copy any chunk with fewer than 32 bytes to an aligned buffer that we pad with copies of the target byte, the code to test each chunk for homogeneity becomes identical. So we can factor out this repeated code into a helper function, instead of copy-pasting it four times.
static inline bool is_hg_chunk(const unsigned char v[V_BYTES], const unsigned char c) {
uint8_t accumulator = 0x00U;
for (unsigned i = 0; i < V_BYTES; ++i) {
accumulator |= (v[i] != c) ? 0xFFU : 0x00U;
}
return !accumulator;
}
When passed a pointer the compiler can deduce is aligned, Clang with -march=x86-64-v3
generates the assembly
vmovdqa ymm0, ymmword ptr [rsp + 32]
vpxor ymm0, ymm0, ymmword ptr [rsp]
vptest ymm0, ymm0
sete al
GCC also generates aligned-load instructions. Its assembly is
vpcmpeqb ymm0, ymm0, YMMWORD PTR [rsp]
vpxor xmm1, xmm1, xmm1
vpcmpeqb ymm0, ymm0, ymm1
vptest ymm0, ymm0
sete al
And ICX produces
vmovd xmm0, r14d
vpbroadcastb ymm0, xmm0
vpxor ymm0, ymm0, ymmword ptr [rsp]
vptest ymm0, ymm0
sete al
MSVC is noticeably worse than these three. It performs the reduction with vpextract
and a series of shifts, and also doesn’t generate aligned-load instructions (although those only matter on older CPUs). But even it can vectorize the comparison:
vmovdqu ymm1, YMMWORD PTR tv237[rbp]
vpcmpeqb ymm0, ymm1, YMMWORD PTR buffer1ドル[rbp]
vpandn ymm4, ymm0, YMMWORD PTR __ymm@ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
Putting it All Together
I'll go ahead and post the full module that calculates the prefix, suffix and each aligned chunk in between, using the above is_hg_chunk
function:
#include <assert.h>
#include <stdalign.h> // Needed by MSVC 17.13
#include <stdbool.h> // Needed by MSVC 17.13
#include <stddef.h>
#include <stdint.h>
#include <string.h>
#define V_BYTES 32UL
#define V_ALIGN 32UL
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_up)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
__builtin_align_up((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
__builtin_assume_aligned(\
(void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)),\
(align))
#else
// Other compilers, including MSVC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#endif
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_down)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
__builtin_align_down((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
__builtin_assume_aligned(\
(void*)((uintptr_t)(void*)(ptr) & -(uintptr_t)(align)),\
(align))
#else
// Other compilers
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#endif
typedef struct byte_vector {
alignas(V_ALIGN) unsigned char bytes[V_BYTES];
} byte_vector;
static inline bool is_hg_chunk(const unsigned char v[V_BYTES], const unsigned char c) {
uint8_t accumulator = 0x00U;
for (unsigned i = 0; i < V_BYTES; ++i) {
accumulator |= (v[i] != c) ? 0xFFU : 0x00U;
}
return !accumulator;
}
bool is_homogeneous(const void *const region, const size_t region_len) {
if (region_len == 0) { // is_homogeneous(NULL, 0) is not a bug.
return true; // Nullary reduction.
}
assert(region != NULL); // Logic error!
// If this line is reached, region points to a block of more than zero bytes.
const unsigned char* const byte_region = region;
const uint8_t compare_byte = *byte_region;
if (region_len <= V_BYTES + 1U) {
alignas(V_ALIGN) byte_vector buffer;
memset(buffer.bytes, compare_byte, V_BYTES);
memcpy(buffer.bytes, byte_region+1, region_len-1U);
return is_hg_chunk(buffer.bytes, compare_byte);
}
// The first aligned address in the memory region
const unsigned char* const left = p_align_up(byte_region+1, V_ALIGN);
// The last aligned address in the memory region
const unsigned char* const right = p_align_down(byte_region + region_len, V_ALIGN);
// The end of the memory region
const unsigned char* const end = byte_region + region_len;
const size_t prefix_len = (size_t)(left - byte_region - 1);
const size_t suffix_len = (size_t)(end - right);
// Always, always, always check for buffer overruns!
assert(prefix_len < V_BYTES);
assert(suffix_len < V_BYTES);
// Check the prefix as one chunk.
if (prefix_len > 0U) {
byte_vector buffer;
memset(buffer.bytes, compare_byte, V_BYTES);
memcpy(buffer.bytes, byte_region+1, prefix_len);
if (!is_hg_chunk(buffer.bytes, compare_byte)) {
return false;
}
}
// Check the aligned portion of the region.
/* Most compilers can deduce from the facts that left is aligned and p
* increments by the alignment that p is aligned.
*/
for (const unsigned char* p = left; p < right; p += V_BYTES) {
if (!is_hg_chunk(p, compare_byte)) {
return false;
}
}
// Check the suffix as one chunk.
if (suffix_len > 0U) {
byte_vector buffer;
memset(buffer.bytes, compare_byte, V_BYTES);
memcpy(buffer.bytes, right, suffix_len);
if (!is_hg_chunk(buffer.bytes, compare_byte)) {
return false;
}
}
return true;
}
And a simple test driver:
#include <stdio.h>
#include <stdlib.h>
#define TEST_SIZE 1024U
int main(void) {
static char test_data[TEST_SIZE + 1U];
memset(test_data, 'a', TEST_SIZE);
test_data[TEST_SIZE] = '0円';
// test_data[TEST_SIZE - 1U] = 'b';
printf( "\n\"%.8s\"... %s homogeneous.\n",
test_data,
is_homogeneous(test_data, TEST_SIZE) ? "is" : "is not" );
return EXIT_SUCCESS;
}
The critical for
loop that will run the most times for a large region compiles on Clang 20.1 to
.LBB0_11:
vpxor ymm1, ymm0, ymmword ptr [r13]
vptest ymm1, ymm1
jne .LBB0_14
add r13, 32
cmp r13, rax
jb .LBB0_11
You can test for yourself on Godbolt.
Update
Although this isn’t especially germane to a review of your code, I went ahead and tweaked my version based on feedback. That gets Clang 20.1 with -mavx2
to generate the critical loop
.LBB0_10:
vpcmpeqb ymm2, ymm0, ymmword ptr [r14 + 96]
vpxor ymm2, ymm2, ymm1
vpcmpeqb ymm3, ymm0, ymmword ptr [r14 + 64]
vpxor ymm3, ymm3, ymm1
vpor ymm2, ymm2, ymm3
vpcmpeqb ymm3, ymm0, ymmword ptr [r14 + 32]
vpxor ymm3, ymm3, ymm1
vpcmpeqb ymm4, ymm0, ymmword ptr [r14]
vpxor ymm4, ymm4, ymm1
vpor ymm3, ymm3, ymm4
vpor ymm2, ymm2, ymm3
vpsllw ymm2, ymm2, 7
vpmovmskb ecx, ymm2
test ecx, ecx
jne .LBB0_18
sub r14, -128
cmp r14, rax
jbe .LBB0_10
With -mavx512bw -mavx512vl -mprefer-vector-width=256
, Clang gives us
.LBB0_10:
vpcmpneqb k0, ymm0, ymmword ptr [r14 + 96]
vpcmpneqb k1, ymm0, ymmword ptr [r14 + 64]
vpcmpneqb k2, ymm0, ymmword ptr [r14 + 32]
kord k0, k0, k1
vpcmpneqb k1, ymm0, ymmword ptr [r14]
kord k1, k2, k1
kord k0, k0, k1
kortestd k0, k0
jne .LBB0_18
sub r14, -128
cmp r14, rax
jbe .LBB0_10
ICX 2024 is similar but partially unrolls, GCC 15.1 now does a series of vmovdqa
/vpcmpb
followed by unnecessary shuffling, instead of the two vpcmpeqb
instructions it was doing before, and MSVC is the laggard.
The benefit here should be more loads per cycle, which will all be aligned, and fewer unpredictable branches. On the other hand, there might be up to three unnecessary loads.
To my surprise, refactoring to use 32-bit SIMD lanes makes the generated code significantly worse.
-
1\$\begingroup\$ @user555045 Right, the penalty comes from a load crossing a cache-line boundary, or worse, a page boundary. I see I just said "aligned" for both aligned instructions and the runtime alignment of the pointer. I see how that was misleading. It matters much more when doing non-temporal loads/stores, which really must be aligned. Thanks, will fix. \$\endgroup\$Davislor– Davislor2025年06月11日 06:28:13 +00:00Commented Jun 11 at 6:28
-
1\$\begingroup\$ Good idea to use
xor
to get0
for all-matching instead of-1
fromvpcmpeqb
. You can OR together multiple results beforevptest
/branch to amortize that overhead, like I did in my answer. That has the further advantage of enablingvpternlogd
with AVX-512 to combine 3 vectors with one instruction starting from a memory source operand, not just from compare results. Or is that just Clang's invention, not your source? \$\endgroup\$Peter Cordes– Peter Cordes2025年06月11日 06:45:53 +00:00Commented Jun 11 at 6:45 -
\$\begingroup\$ @PeterCordes The motivation for making
V_BYTES
andV_ALIGN
different constants was so I could do that (perform more loads at once) by increasingV_BYTES
, but then I didn’t actually write the algorithm to work correctly if they’re different. \$\endgroup\$Davislor– Davislor2025年06月11日 06:51:15 +00:00Commented Jun 11 at 6:51 -
\$\begingroup\$ @PeterCordes The efficient way is probably to do one more loop over any aligned chunks after the final value of
p
, before the suffix check in there now. \$\endgroup\$Davislor– Davislor2025年06月11日 06:53:56 +00:00Commented Jun 11 at 6:53 -
1\$\begingroup\$ Re: overflow of
ptrdiff_t
- My answer doesn't overflowptrdiff_t
for small sizes, just signed subtraction producing a negative. Unsigned wrapping ofsize_t
to a huge size would be a problem fori < size-128 + 1
, but for example0 < -119
is false so the loop doesn't run at all forsize = 8
when using a signed type. My answer doesn't work at all for sizes greater thanPTRDIFF_MAX
, but that's impossible on real-world 64-bit implementations. It is possible in 32-bit code (or ILP32 ABIs) under 64-bit kernels, though, so maybe I should have mentioned it. \$\endgroup\$Peter Cordes– Peter Cordes2025年06月11日 06:54:10 +00:00Commented Jun 11 at 6:54
Fe2O3's answer shows a great idea, of using libc memcmp(arr, arr+1, size-1)
to make a single pass over your data. (I commented under it with a bunch of things which I'm turning into an answer.)
Libc memcmp
uses SIMD via hand-written asm on most platforms; it's an important enough function that it's worth hand-tuning it. On GNU/Linux for example, it even picks a version at runtime based on CPU features, so it can use AVX2 or AVX-512 if available, even if your program is compiled to still work on older x86-64 CPUs.
That has the downside of loading every byte twice, with half the loads being misaligned. That's definitely a bottleneck if your data is already hot in L1d cache. On modern x86-64 with manually-vectorized code doing 2x _mm256_cmpeq_epi8
(vpcmpeb
) + 1x _mm256_and_si256
(vpand
), or a larger reduction of say 8 compares and 4x + 2x vpand
+ vptest
(AND and set FLAGS), most CPUs should be able to keep their vector load ports fully busy (2 or 3 vectors per clock with cache-line splits costing double, more at page splits.) Or will bottleneck on front-end bandwidth. Intel CPUs might bottleneck on 3/clock SIMD ALU operations; AMD Zen family has 4 SIMD ALU ports and at least the AND instructions should be able to run on any port.
With a vector size of 32 bytes, that means one of every 4 loads will be split across 2 cache lines (if the start of the array is aligned by 32), that could be a bottleneck even with data coming from L2 cache. But not much of a bottleneck. And if the data is coming from farther away, the extra work for the core itself can happen while waiting for HW prefetch and demand loads to bring in the data. For non-small arrays, the memcmp
idea should perform about the same as the best you can do with manually-vectorized code, at least on x86 where unaligned vector loads are cheap. It will use more execution resources than optimal code, which matters if sharing a physical core with another thread (hyperthreading / SMT), and a bit for thermals / power usage.
Auto-vectorization
Compilers generally don't auto-vectorize loops with an early-out condition (because the trip-count can't be computed before the first iteration). That makes it basically impossible to get good asm from a compiler for your own memcmp
-style loop. ICC Classic (not LLVM-based ICX) could do that sometimes, and if I recall correctly the latest GCC or Clang might now be able to do it sometimes, too.
Another challenge for the way you wrote your loop is that the code has to work for every case that doesn't have undefined behaviour. For example, passing a 2-byte array with different values, and a size of 1 billion. Even if those 2 bytes are at the end of a page followed by an unmapped page, so a 16-byte load would segfault. See Is it safe to read past the end of a buffer within the same page? (Yes in asm, but hard to do in C without undefined behaviour. It's what libc str*
functions like strlen
do.)
You can write int foo(char arr[static 1024])
to tell the compiler that the whole array is really there, but GCC and Clang don't take advantage of that hint, last I checked.
One way to work around it is to write C that always touches every element, with no early-out. e.g. mismatches |= (arr[i] != arr[0]);
or allmatch &= (arr[i] == arr[0])
. That's bad for large arrays with a mismatch near the start, but might be acceptable in some cases. See https://stackoverflow.com/questions/42146125/how-to-auto-vectorization-array-comparison-function and https://stackoverflow.com/questions/2741859/how-fast-can-you-make-linear-search/31509388#31509388 for examples.
We can even apply that technique in chunks, to make an inner loop that will vectorize and fully unroll, inside an outer loop that checks 128 bytes at a time.
By thinking about the x86-64 asm I wanted, I wrote this code to lead a C compiler in that direction:
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
//#include <unistd.h> // ssize_t to make the size loop condition not wrap for small size
typedef ptrdiff_t SIZE_TYPE; // ssize_t is POSIX, not ISO C.
int is_homogeneous(unsigned char *arr, SIZE_TYPE size)
{
unsigned char key = arr[0];
for (SIZE_TYPE pos = 0 ; pos < (size-128 + 1) ; pos += 128) {
unsigned char allmatch = 0xff; // vpbroadcastb or equivalent
for (int i=0 ; i<128 ; i++){
unsigned char match = arr[pos+i] == key ? 0xff : 0; // pcmpeqb
allmatch &= match; // pand
}
//if ((allmatch & 0x80) != 0x80) return false; // pmovmskb + cmp reg, 0xffff / jne
if (allmatch != 0xff) return false; // ptest + a jcc that checks for all-ones, ideally. But compilers don't see that.
}
// TODO: tail handling of the last size % 128 bytes
return true;
}
Godbolt with Clang and GCC -O3
. (-O2
gives the same asm; recent GCC does auto-vectorize at -O2
for cases like this when no cleanup is needed.)
GCC does the expected vpcmpeqb
/ vpand
, but makes a total mess of branching on the final 32-byte (256-bit) vector being all-ones or not. It unpacks to 128-bit vectors of 16-bit elements for multiplies (vpmullw
), with a huge number of shuffle instructions.
Clang optimizes this inner loop into non-terrible asm, although makes the braindead choice to optimize the allmatch != 0xff
(cmp reg, -1
) into tmp == 0
(test reg,reg
) which is a tiny saving which comes at the cost of having to invert every compare result so it can OR them together. So there's a vpxor
(with -1) + vpor
where there would otherwise just be a vpand
.
# clang (trunk) -O3 -march=x86-64-v3 (AVX2 + BMI2)
is_homogeneous:
mov eax, 1 # return value
cmp rsi, 128
jl .LBB0_5
movzx ecx, byte ptr [rdi]
add rsi, -127
vmovd xmm0, ecx
vpbroadcastb ymm0, xmm0 # YMM0 = _mm256_set1_epi8(arr[0])
xor ecx, ecx
vpcmpeqd ymm1, ymm1, ymm1 # YMM1 = all-one bits = set1(-1)
.LBB0_3:
vpcmpeqb ymm2, ymm0, ymmword ptr [rdi + rcx + 96]
vpxor ymm2, ymm2, ymm1
vpcmpeqb ymm3, ymm0, ymmword ptr [rdi + rcx + 64]
vpxor ymm3, ymm3, ymm1
vpor ymm2, ymm2, ymm3
vpcmpeqb ymm3, ymm0, ymmword ptr [rdi + rcx + 32]
vpxor ymm3, ymm3, ymm1
vpcmpeqb ymm4, ymm0, ymmword ptr [rdi + rcx]
vpxor ymm4, ymm4, ymm1
vpor ymm3, ymm3, ymm4
vpor ymm2, ymm2, ymm3
vpsllw ymm2, ymm2, 7 # it wants the low bit apparently, instead of the high bit.
vpmovmskb edx, ymm2 # edx = high bit of each byte of ymm2
test edx, edx # set FLAGS according to EDX
jne .LBB0_4 # jump (keep looping) if non-zero
sub rcx, -128
cmp rcx, rsi
jl .LBB0_3 # size-based loop condition
.LBB0_5:
vzeroupper
ret # return true; EAX set ahead of the loop
.LBB0_4:
xor eax, eax # return false
vzeroupper
ret
Handling the last size % 128
bytes efficiently could take as much or more code. If the total array size is at least 128 bytes, you could do another same-size chunk that ends at the last byte of the array. It's fine if it overlaps with some bytes you already checked.
If sizes are always or normally multiples of 32 or something, that can simplify things, by not handling or not handling efficiently some cases.
-
1\$\begingroup\$ @Davislor: Mine compiles like crap even with Clang, using about 1.5x as many vector-ALU instructions as it should (
vpxor
/vpor
instead ofvpand
, following thevpcmpeqbb
) Mine also only does loads that are aligned relative to the start of the array so that's not a difference. Yourvpxor/vptest/jcc
loop is 6 uops (fused domain), so can run at best 1 load per clock on Alder Lake and later big-cores. (And Zen 2 or later). Or actually less: all 6 uops need an ALU port, and Intel only has 6 ALU ports in Lion Cove (Core Ultra 2 and Lunar Lake), fewer in earlier CPUs including Redwood (5) \$\endgroup\$Peter Cordes– Peter Cordes2025年06月11日 07:22:28 +00:00Commented Jun 11 at 7:22 -
1\$\begingroup\$ @Davislor: AMD Zen-family has separate schedulers and ports for vector vs. scalar uops, so the back-end can actually keep up with the 3 vector-ALU and 3 scalar-ALU uops per cycle the front-end can feed it. If you compiler did a better job, the
vpxor
/vptest
/jcc
(1 + 2 + 1 = 4 uops) would bevpcmpeqb
/vpmovmskb
/cmp+jcc
(1 + 1 + 1 = 3 uops), making the whole loop 5. \$\endgroup\$Peter Cordes– Peter Cordes2025年06月11日 07:27:19 +00:00Commented Jun 11 at 7:27 -
1\$\begingroup\$ @Davislor:
vpxor
is great if unrolling and combining multiple results withvpor
like Clang seems to want to even when I write the source with&=
, and allows AVX-512vpternlogd
, but it defeats taking advantage of the AND part ofvptest
using two different source vectors to make sure they're both all-ones. (Or wait, maybe you can't figure that out from the FLAGS conditions of oneptest
. There's a_mm_test_mix_ones_zeros
and a_mm_test_all_zeros
, but I forget if their semantic meaning only works when one vector is all-ones.) \$\endgroup\$Peter Cordes– Peter Cordes2025年06月11日 07:30:21 +00:00Commented Jun 11 at 7:30 -
1\$\begingroup\$ @Davislor: Anyway, in its current state with Clang, my code uses 16 fused-domain uops to check 4 vectors. Oh wait, no, 20 fused-domain uops because LLVM used indexed addressing-modes for the memory-source operands, so those unlaminate on Intel. /facepalm. 13 of those uops need a vector ALU port (port 0, 1, or 5 on Intel CPUs). uica.uops.info predicts Skylake and Rocket Lake will run yours at 1 iter (1 vector) per 2 clocks (although 1.5 might be more realistic; IDK why it thinks both branches need port 6), mine at 1 iter (4 vectors) per 6 clocks (SKL) or 4.4 clocks (ICL/RKL). \$\endgroup\$Peter Cordes– Peter Cordes2025年06月11日 07:36:43 +00:00Commented Jun 11 at 7:36
-
1\$\begingroup\$ However, three of the four compilers generate much better code with
-mavx512bw -mavx512vl -mprefer-vector-width=256
enabled. MSVC at least tries to vectorize. \$\endgroup\$Davislor– Davislor2025年06月11日 17:58:12 +00:00Commented Jun 11 at 17:58
For those who like the memcmp(region, (char *)region + 1, region_len - 1)
approach:
Consider:
- Edge cases
What if region_len == 0
?
bool check_homogeneous_alt1(const void *region, size_t region_len) {
return region_len == 0 ||
memcmp(region, (char *)region + 1, region_len - 1) == 0;
}
- Alignment
As region
may be nicely aligned, use like-wise aligned pointers for the first memcmp()
(and likely most time consuming) call. memcmp()
can handle unaligned compares yet common implementation work best with aligned ones.
// Unchecked code. Will review later.
/*
* Return true when memory is homogeneous.
*
* First compare data 1 "chunk" away.
* This "chunk is `sizeof(max_align_t)` bytes away.
*
* Then compare 1 byte away for the "leftovers".
*/
bool check_homogeneous_alt2(const void *region, size_t region_len) {
const unsigned char *uregion = region;
size_t leftover = region_len % sizeof(max_align_t);
// Do we have at least 2 chunks?
if (region_len >= sizeof(max_align_t)*2) {
if(memcmp(uregion, uregion + sizeof(max_align_t), region_len - leftover - sizeof(max_align_t))) {
return false;
}
// - 1 is to check that the first region is like the remainder.
uregion += region_len - leftover - 1;
leftover++;
}
if (leftover > 1) {
if(memcmp(uregion, uregion + 1, leftover - 1)) {
return false;
}
}
return true;
}
-
3\$\begingroup\$ You've beaten me to the punch. Your comment to "guard against
0-1
" was on my to-do list to perform with the morning's first coffee... Good that you spotted that! I was going to usesize < 2
to shortcut, since a single byte is always the same as itself, too. Cheers, and thanks for your aligning version in C. The 'low level' versions in other answers here is really impressive, but ironically being 'low', it's well over my head.:-)
\$\endgroup\$user272752– user2727522025年06月11日 21:25:11 +00:00Commented Jun 11 at 21:25 -
\$\begingroup\$ You might as well just check
region_len <= 1
since there's no need to run a length=0 memcmp. \$\endgroup\$Peter Cordes– Peter Cordes2025年07月08日 14:46:56 +00:00Commented Jul 8 at 14:46 -
\$\begingroup\$ @PeterCordes, With
size_t len
, eitherregion_len == 0
orregion_len <= 1
will work fine here. I expect such smalllen
values to be rare out there in the wild and so prefer theregion_len == 0
test as simpler to encode and execute for the lion share of of cases. YMMV. \$\endgroup\$chux– chux2025年07月08日 17:03:12 +00:00Commented Jul 8 at 17:03 -
\$\begingroup\$ @chux: Fair enough, saves a byte on x86 (
cmp reg,imm8/jcc
vs.test/jcc
), but on AArch64 could allowcbz
orcbnz
saving a whole instruction. And on RISC-V allowsbeq
with the zero-register instead of needing to materialize a1
or2
constant. But wait, we need tosub reg, 1
to getregion_len - 1
for the call, so could branch on flags from that (below-or-equal). But if you need FLAGS, you can't use LEA to copy-and-sub, so yeah potentially some savings. \$\endgroup\$Peter Cordes– Peter Cordes2025年07月08日 17:38:46 +00:00Commented Jul 8 at 17:38 -
\$\begingroup\$ @PeterCordes " sub reg, 1 to get region_len - 1 for the call, so could branch on flags from that (below-or-equal" --> would not that also consider sizes in the upper-half
size_t
range as negative? \$\endgroup\$chux– chux2025年07月09日 00:33:13 +00:00Commented Jul 9 at 0:33
Lots of answers addressing correctness and performance; I'll add a look at usability and style.
Using an active verb such as "check" to start the name suggests a procedure rather than a function (in C, that means returning void
rather than a value). Prefer to name predicate functions such as this by beginning with is_...
or has_...
etc. (note that names beginning with is
followed directly by a letter are reserved for future C standards, so don't omit the underscore!).
"Homogeneous" is a term that's overloaded in some contexts to refer to types of memory, so I might use "uniform" instead, giving is_uniform()
. Or perhaps has_one_value()
.
Consider choosing names for the formal parameters that are familiar from the C Standard Library specification. In this case, following the pattern of memchr()
would give us s
and n
for the names.
We don't need a cast here:
bool check_homogeneous(const void *region, size_t region_len) { const unsigned char *bytes = (const unsigned char *) region;
A pointer to void
converts without a cast to any other object pointer type, and pointer to const void
converts to any pointer to const object, so we can simply write unsigned char const *const bytes = region
without the clutter.
-
2\$\begingroup\$ The "comments" section under my answer has become "lengthy" (to say the least!) Please forgive me this piggy-back here where it may be noticed... The OP's post contains text that may be the best function name of all:
is_identical_byte_val()
... Noted recently, "[descriptive] naming is hard" Cheers!:-)
\$\endgroup\$user272752– user2727522025年06月13日 23:41:27 +00:00Commented Jun 13 at 23:41 -
\$\begingroup\$ "[descriptive] naming is hard"... Can't help but think of Elon's kid named 'X'... He'll sign his passport and/or credit card application and people will think he's illiterate....
:-)
Send me a message and I'll delete this frivolous comment myself... Cheers!:-)
(Can you imagine if X grows up and moves to U in Micronesia? :-) Maybe becomes a "one trick pony" like me, working only in C... This stuff almost writes itself... :-) \$\endgroup\$user272752– user2727522025年06月14日 08:24:08 +00:00Commented Jun 14 at 8:24
memcmp
, you are doing worse than o(n) work since you are scanning the same memory range multiple times, which, depending on the size of the range and of the caches may have more or less noticeable effects. \$\endgroup\$AAbAAbA
...? \$\endgroup\$memcmp(p, p+step, n-step)
. To choose a good value forstep
consider something that's comfortable for the target hardware (perhaps cache-line length) and you'll need special-case code for shorter inputs. \$\endgroup\$