C method to determine whether a byte array is homogeneous

Question 1

I engineered myself into a situation where I wanted to check whether a region of memory consists of identical bytes. I wanted to at least try to avoid the obvious solution of an O(n) loop, but also was not fond of figuring out how SIMD works. So, I came up with this:

bool check_homogeneous(const void *region, size_t region_len) {
 const unsigned char *bytes = (const unsigned char *) region;
 // Special case for odd initial length
 if (region_len & 1 && bytes[0] != bytes[region_len - 1]) {
 return false;
 }
 // Continuously divide region in half and compare with upper half
 size_t part = region_len >> 1;
 while (part) {
 if (memcmp(&bytes[0], &bytes[part], part) != 0) return false;
 part >>= 1;
 }
 return true;
}

It's a bit strange, though it seems to work fine and perform somewhat better than a naive loop. Posting this with hope that someone can steer me towards a less spooky implementation.

Question 2

While you benefit from the optimisations of memcmp, you are doing worse than o(n) work since you are scanning the same memory range multiple times, which, depending on the size of the range and of the caches may have more or less noticeable effects.

Question 3

Does your code actually test what you intend? How about a 7-byte string AAbAAbA...?

Question 4

Perhaps instead of the loop you could try benchmarking something based on memcmp(p, p+step, n-step). To choose a good value for step consider something that's comfortable for the target hardware (perhaps cache-line length) and you'll need special-case code for shorter inputs.

Question 5

You mentioned that you'd measured performance. If you're really interested in the performance, then you should consider adding the performance tag and including your benchmarking code in the question.

Question 6

"I wanted to at least *try to avoid the obvious solution of an O(n) loop*" -- it's unclear exactly what you mean by that, but note well that you cannot do this job in fewer than O(n) steps (worst-case) because every one of the n bytes in the array needs to be examined in order to confirm that none differ from the others. You can avoid an explicit loop, but that does not get you better asymptotic complexity.

Question 7

Review

That while() with its accessory statements is begging to be expressed as a simple for() instead.

Since region_len is not const, there's really no need for part. Use the (local) parameter variable itself for the loop.

Boolean functions, especially, should be named meaningfully. if(checkRaining()) is a little vague. Need an umbrella to get to the train? if(isRaining()) gives a better clue about the meaning of its returned value.

Be consistent! There are two if() conditionals in this function. The first one uses braces around a single return false; statement. The second if() has a "no brace body" in which the return false; appears on the same line as if(). Either is fine (imo), but both together is not fine. Read lots of "expert" code and learn a style that works for you when writing your own code. Adapt to and immitate (don't emulate) any existing style when altering existing code. Consistency suggests better quality, and can make bugs harder to write and easier to spot.

The comment explains "divide by 2", that is implemented as part >>= 1;. Modern compilers are vastly more 'clever' than most of us coders using them. The compiler will recognise the intent of part /= 2; and do the right (and fastest) thing. It's better to aim for clear code than to second guess the intelligence of the compiler. Leave or adjust the comment to explain why you, the author, chose to divide by two if you feel it worthwhile. (Comments are really, really cheap! and will execute in 0 nanoseconds!)

From the example of Dennis Ritchie's strcmp(), I would advise you as a neophyte to postpone writing functions accepting a void * pointer. Your function's purpose is based on byte-by-byte comparison (as opposed to a function tuned for, perhaps, compound datatypes.) I'd feel safer if the function signature, itself, indicated unsigned char * for the first parameter. This may be debatable, but void * tells the compiler to be LESS stringent with its type checking. Meditate on why Dennis Ritchie wrote the <string.h> functions to take the datatypes of their parameters as they do (instead of making all of them take void * parameters.)

Understanding and alternative solution

Suggest that you locate and study Standard Library implementation of memcpy(). There you'll find a bit of 'start up' address compensation, then a loop that can safely treat the rest of the block of bytes as 32 or 64 bit 'chunks' (a single instruction deals with 4 or 8 bytes in one step.) Study and learn and adapt to suit your needs.

This is only a learning exercise, of course.

In the meantime:

#include <stdint.h> // for 'uint8_t'
bool isSameByte( const uint8_t *arr, const size_t size )
{
 // EDIT: Added missing test as alternative to required fix detected by @chux
 if( size < 2 )
 return true; // 0 or 1 byte is always equal to itself
 return memcmp( arr + 0, arr + 1, size - 1 ) == 0;
}

would make a single pass (optimised by the compiler to use the very best code that's available in the Standard Library) across the entire array. If there's a single byte that differs from its neighbours, the scan will end there. (Notice there's no fiddle-faddle with specially handling the first byte of the array. No if().)

(Put a print statement inside your while() loop, then think about whether or not you've improved on O(n).)

Don't stop experimenting, but lean toward using Standard Library functions where possible.
DIY is REALLY unlikely to improve on performance or trustworthiness.
Spend a rainy Sunday skimming the names of all the functions in the Standard Library. I still find nuggets that I'd known then forgotten about.
And don't try to out clever yourself with intricate brilliance.
Take it from one who's been there: you'll look at this again in 3-6 months and exclaim, "WTF?!".
KISS!

If you're worrying about repeated access to the same byte value, don't. In your 'divide and conquer' scheme, a very large array might suffer from cache misses because it is drawing bytes from two regions. This 'neighbour checking' can race through lines of contiguous bytes, then ask for another serving to scan. Dunno. Feels like it's running the pipeline at maximum speed...

Rows of roses

Homogeneous milk is milk that has been processed so what would be called cream no longer floats to the top. It is, in a way, a heterogeneous mixture that still contains the same substances as before but in a different structure.

One could define a homogeneous array of a mixture of only lowercase ASCII letters. It would be homogeneous by one interpretation, but likely not all the same character, therefore heterogeneous.

Vocabulary matters. Say what you mean (and mean what you say!)

Boo!

Sorry... Just some thrill-seeking spookiness on a howling-wind night! :-)

XY?

Delightfully referenced and linked from the Wikipedia page for "Regret" is the psychology concept of "System One and System Two Thinking". Possibly relevant to this question as the function's purpose and use has the aroma of "checking that all elements of an array of flag bits/bytes have the same value" where the value or indices of any black sheep byte(s) is/are not important. Perhaps some sort of watch dog sniffer run at certain intervals???

(This is "System Two" as some time has passed since this answer was last updated...)

Should this speculation about the function's purpose come close to the mark, the OP is advised to re-think this mechanism.

Sufficient might be a single "Set/Reset" flag and a single register that "remembers" and compares the results of a current operation with only one previous operation. When two results differ, set the flag and leave it set until the caller of this current function wants to verify all is well or work is needed. It's only when that work is done that the S/R flag is reset for the next detection.

This would reduce memory requirements and amortise the effort to scan an array (unknown size) across all the operations that are occuring.

Irrespective of this speculation's appropriateness, it must be noted that developing software is fertile ground for germination and growth of 'regret'...

Question 8

The other standard-library function that could do the job is strspn, at least if your first byte isn't 0, the C string terminator. But it normally only checks one byte at a time after making a lookup table since it takes a whole set of accept-characters. It can use SSE4.2 pcmpistri for small-enough accept-sets, but that's still not as fast as a simple byte compare. Oh, and it works on C strings so would need a terminator or mismatching byte following the array. There's no memspn even in glibc.

Question 9

Anyway, your overlapping memcmp trick could probably go somewhat less than half as fast as optimal SIMD machine code on arrays that are already hot in L1d cache, on typical modern x86. For arrays farther away from the CPU, probably still limited by cache/memory bandwidth getting the data to L1d.

Question 10

In usage like this (C++ templates, C generic functions, or examples/pseudocode), T is the same type in both appearances, which is very bad for T size. Just use size_t size because you don't want the size-type to be parameterized. If you wanted to use a different wildcard type, use T* arr, S size.

Question 11

One way to work around it is to write C that always touches every element, with no early-out. e.g. mismatches |= (arr[i] != arr[0]); or allmatch &= (arr[i] == arr[0]). That's bad for large arrays with a mismatch near the start, but might be acceptable in some cases. See the how to auto vectorization array comparison function for an example of auto-vectorization of counting matches.

Question 12

Genius! I somehow got it into my head that the regions passed to memcmp could not overlap, even though a quick look at the docs would have indicated otherwise.

Question 13

Your Algorithm is Buggy

It only checks for an odd block size on the first iteration. If any partition has an odd size, the algorithm fails to check the final byte. So the test driver

const char test_data[] = "aabaab"; // Adds terminating null byte.
printf( "\"%s\" %s homogeneous.\n",
 test_data,
 check_homogeneous(test_data, sizeof(test_data)-1U) ? "is" : "is not" );

prints "aabaab" is homogeneous.

Other Suggestions

I recommend using const wherever you can. This mainly lets the compiler check for bugs, but sometimes can help the optimizer out too. You already declare pointers to const unsigned char, but these could be const pointers as well.

C has functions, not methods (unlike Java or Python).

Your algorithm appears to work properly when called with check_homogeneous(NULL, 0). Great! It’s good practice in general to check for logic errors such as null pointers, especially on systems where dereferencing a null pointer will not crash at runtime. If you will be subtracting pointers, which gives you an index of type ptrdiff_t, you additionally need to check that the input size is not greater than PTRDIFF_MAX.

Or if there are conditions the caller must check, at the VERY least put a big warning in a comment right above the function about this behavior!

I agree with Peter Cordes’ suggestion to follow the is_homogeneous naming convention. To me, check_homogeneous could mean performing some action depending on homogeneity.

Is Simpler Better?

Your divide-and-conquer algorithm will, in practice, end up performing a linear scan over 2N bytes of input (as it has to compare N/2 bytes to N/2 bytes on the first iteration, N/4 bytes to N/4 bytes on the second iteration, and so on). This is twice as many comparisons as a simple linear scan.

Let’s compare to a version with the same basic structure that does a short-circuiting linear scan:

bool is_homogeneous(const void *const region, const size_t region_len) {
 if (region_len == 0) {
 return true; // Nullary reduction.
 }
 assert(region != NULL); // Logic error!
 // If this line is reached, region points to a block of more than zero bytes.
 
 const unsigned char* const byte_region = region;
 const unsigned char compare_to = *byte_region;
 for (size_t i = 1U; i < region_len; ++i) {
 if (byte_region[i] != compare_to) {
 return false;
 }
 }
 return true;
}

GCC 15.1 with -std=c23 -march=x86-64-v4 -O3 is capable of partly automatically vectorizing this loop, but clang 20.1, ICX 2024.0 and MSVC 19.43 are not. See on Godbolt. GCC generates this block:

.L24:
 add rdx, 64
 vpaddq zmm0, zmm0, zmm2
 cmp rdx, r9
 je .L44
.L26:
 vmovdqa64 zmm1, ZMMWORD PTR [rdx]
 vpcmpb k0, zmm1, zmm3, 4
 kortestq k0, k0
 je .L24
 vmovq rax, xmm0
 vzeroupper
 jmp .L27

If we take out the short-circuiting to make this a pure reduction operation, the LLVM compilers are now able to auto-vectorize it, but GCC 15.1 gets confused and can no longer auto-vectorize.

bool is_homogeneous(const void *const region, const size_t region_len) {
 if (region_len == 0) {
 return true; // Nullary reduction.
 }
 assert(region != NULL); // Logic error!
 // If this line is reached, region points to a block of more than zero bytes.
 
 const unsigned char* const byte_region = region;
 const unsigned char compare_to = *byte_region;
 bool all_equal_so_far = true;
 for (size_t i = 1U; i < region_len; ++i) {
 all_equal_so_far &= (byte_region[i] == compare_to);
 }
 return all_equal_so_far;
}

What if we whack the compiler in the head with a big clue-by-four?

/* Hey, compiler! Maybe think about a SIMD reduction loop? I enabled
 * both AVX-512bw and AVX-512vl instructions, if you want to use them.
 */
#pragma omp simd reduction(&:all_equal_so_far)

The version of MSVC on Godbolt doesn’t understand this at all, even though the documentation claims to support it, and it makes no difference in any other compiler.

To get every compiler to vectorize the loop, we have to refactor to get a bit closer to the native binary representation:

bool is_homogeneous(const void *const region, const size_t region_len) {
 if (region_len == 0) {
 return true; // Nullary reduction.
 }
 assert(region != NULL); // Logic error!
 // If this line is reached, region points to a block of more than zero bytes.
 
 const unsigned char* const byte_region = region;
 const unsigned char compare_to = *byte_region;
 unsigned char accumulator = 0;
 for (size_t i = 1U; i < region_len; ++i) {
 accumulator |= (byte_region[i] != compare_to ? 0xFFU : 0x00U);
 }
 return !accumulator;
}

This does generate vectorized code for Clang, GCC, ICX and MSVC, but only Clang produces good code for it. And this is at the cost of always checking every byte in the region. You can evaluate for yourself on Godbolt.

If we want to do better, we’re going to need extreme measures.

A Vectorized Loop

Peter Cordes goes over a lot of this, but there are some things I respectfully disagree with (overflow on a signed type such as ptrdiff_t is formally undefined behavior, originally to support weird old CPUs that would hardware-fault, but the current generation of C compilers takes that as permission to introduce security faults without warning) and some where I’ll suggest a different approach.

What we ideally want mainstream 2025 hardware to do is to load the bytes of the block in aligned chunks, perform branchless SIMD comparisons on each byte, reduce them with a bitwise operation, and short-circuit if we encounter an inhomogeneous byte in the current chunk. For this project, I’ll be compiling with -mavx512bw -mavx512vl -mprefer-vector-width=256 and using naturally-aligned 32-byte chunks of memory, but if you need portability, you can set these constants inside #elif blocks.

#define V_BYTES 32UL
#define V_ALIGN 32UL

One optimization Peter does that I don’t here is to load a larger number of bytes at once, in between the unpredictable branches. You could do the same by increasing V_BYTES to 64, 128 or 256 bytes (although then you do need to change the program logic a bit to make sure not to skip any range of memory).

After reading the first byte, we can process the remainder of the block in one step if we can load the remaining bytes into a native vector. Otherwise, we want to partition the block into:

the first byte, which we compare all others to
all bytes between the first byte and the first aligned address within the block, which we will call left
a middle section of any number of aligned chunks
all bytes after the last address in the block, which we will call right

This algorithm requires us to find the first and last vector-aligned address in the block. This brings us to the annoyance that there’s a function to align pointers in the C++ standard library, but not in standard C, and different compilers need different code for it. ICX needs a built-in function that Clang also supports, GCC does not have the built-in function but does have one to tell the compiler that a pointer is aligned. MSVC doesn’t generate any different code for aligned and unaligned loads, but modern CPUs will still run faster if loads and stores are aligned at runtime. Finally, only GCC short-circuits preprocessor conditionals, so I can’t write code like #if defined(__has_builtin) && __has_builtin(__builtin_align_up).

All of which is to say, here’s my boilerplate for the helper macros. It’s not pretty, but it compiles without warnings on every compiler I tested.

#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 ((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_up)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 __builtin_align_up((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 __builtin_assume_aligned(\
 (void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)),\
 (align))
#else
// Other compilers, including MSVC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 ((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#endif
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 ((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_down)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 __builtin_align_down((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 __builtin_assume_aligned(\
 (void*)((uintptr_t)(void*)(ptr) & -(uintptr_t)(align)),\
 (align))
#else
// Other compilers
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 ((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#endif

Next: If we copy any chunk with fewer than 32 bytes to an aligned buffer that we pad with copies of the target byte, the code to test each chunk for homogeneity becomes identical. So we can factor out this repeated code into a helper function, instead of copy-pasting it four times.

static inline bool is_hg_chunk(const unsigned char v[V_BYTES], const unsigned char c) {
 uint8_t accumulator = 0x00U;
 for (unsigned i = 0; i < V_BYTES; ++i) {
 accumulator |= (v[i] != c) ? 0xFFU : 0x00U;
 }
 return !accumulator;
}

When passed a pointer the compiler can deduce is aligned, Clang with -march=x86-64-v3 generates the assembly

vmovdqa ymm0, ymmword ptr [rsp + 32]
vpxor ymm0, ymm0, ymmword ptr [rsp]
vptest ymm0, ymm0
sete al

GCC also generates aligned-load instructions. Its assembly is

vpcmpeqb ymm0, ymm0, YMMWORD PTR [rsp]
vpxor xmm1, xmm1, xmm1
vpcmpeqb ymm0, ymm0, ymm1
vptest ymm0, ymm0
sete al

And ICX produces

vmovd xmm0, r14d
vpbroadcastb ymm0, xmm0
vpxor ymm0, ymm0, ymmword ptr [rsp]
vptest ymm0, ymm0
sete al

MSVC is noticeably worse than these three. It performs the reduction with vpextract and a series of shifts, and also doesn’t generate aligned-load instructions (although those only matter on older CPUs). But even it can vectorize the comparison:

vmovdqu ymm1, YMMWORD PTR tv237[rbp]
vpcmpeqb ymm0, ymm1, YMMWORD PTR buffer1ドル[rbp]
vpandn ymm4, ymm0, YMMWORD PTR __ymm@ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

Putting it All Together

I'll go ahead and post the full module that calculates the prefix, suffix and each aligned chunk in between, using the above is_hg_chunk function:

#include <assert.h>
#include <stdalign.h> // Needed by MSVC 17.13
#include <stdbool.h> // Needed by MSVC 17.13
#include <stddef.h>
#include <stdint.h>
#include <string.h>
#define V_BYTES 32UL
#define V_ALIGN 32UL
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 ((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_up)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 __builtin_align_up((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 __builtin_assume_aligned(\
 (void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)),\
 (align))
#else
// Other compilers, including MSVC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_up(ptr, align) \
 ((void*)(((uintptr_t)(void*)(ptr) + (align) - 1U) & (uintptr_t)-(intptr_t)(align)))
#endif
#ifndef __has_builtin
// MSVC, etc.
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 ((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#elif __has_builtin(__builtin_align_down)
// Clang, ICX, ICPX. ICX specifically needs the builtin, not bit manipulation.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 __builtin_align_down((ptr), (align))
#elif __has_builtin(__builtin_assume_aligned)
// GCC
// WARNING: align may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 __builtin_assume_aligned(\
 (void*)((uintptr_t)(void*)(ptr) & -(uintptr_t)(align)),\
 (align))
#else
// Other compilers
// WARNING: The arguments may be evaluated multiple times.
// align MUST be a power of 2.
# define p_align_down(ptr, align) \
 ((void*)((uintptr_t)(void*)(ptr) & (uintptr_t)-(intptr_t)(align)))
#endif
typedef struct byte_vector {
 alignas(V_ALIGN) unsigned char bytes[V_BYTES];
} byte_vector;
static inline bool is_hg_chunk(const unsigned char v[V_BYTES], const unsigned char c) {
 uint8_t accumulator = 0x00U;
 for (unsigned i = 0; i < V_BYTES; ++i) {
 accumulator |= (v[i] != c) ? 0xFFU : 0x00U;
 }
 return !accumulator;
}
bool is_homogeneous(const void *const region, const size_t region_len) {
 if (region_len == 0) { // is_homogeneous(NULL, 0) is not a bug.
 return true; // Nullary reduction.
 }
 assert(region != NULL); // Logic error!
 // If this line is reached, region points to a block of more than zero bytes.
 
 const unsigned char* const byte_region = region;
 const uint8_t compare_byte = *byte_region;
 if (region_len <= V_BYTES + 1U) {
 alignas(V_ALIGN) byte_vector buffer;
 memset(buffer.bytes, compare_byte, V_BYTES);
 memcpy(buffer.bytes, byte_region+1, region_len-1U);
 return is_hg_chunk(buffer.bytes, compare_byte);
 }
 // The first aligned address in the memory region
 const unsigned char* const left = p_align_up(byte_region+1, V_ALIGN);
 // The last aligned address in the memory region
 const unsigned char* const right = p_align_down(byte_region + region_len, V_ALIGN);
 // The end of the memory region
 const unsigned char* const end = byte_region + region_len;
 const size_t prefix_len = (size_t)(left - byte_region - 1);
 const size_t suffix_len = (size_t)(end - right);
 // Always, always, always check for buffer overruns!
 assert(prefix_len < V_BYTES);
 assert(suffix_len < V_BYTES);
 // Check the prefix as one chunk.
 if (prefix_len > 0U) {
 byte_vector buffer;
 memset(buffer.bytes, compare_byte, V_BYTES);
 memcpy(buffer.bytes, byte_region+1, prefix_len);
 if (!is_hg_chunk(buffer.bytes, compare_byte)) {
 return false;
 }
 }
 // Check the aligned portion of the region.
 /* Most compilers can deduce from the facts that left is aligned and p
 * increments by the alignment that p is aligned.
 */
 for (const unsigned char* p = left; p < right; p += V_BYTES) {
 if (!is_hg_chunk(p, compare_byte)) {
 return false;
 }
 }
 // Check the suffix as one chunk.
 if (suffix_len > 0U) {
 byte_vector buffer;
 memset(buffer.bytes, compare_byte, V_BYTES);
 memcpy(buffer.bytes, right, suffix_len);
 if (!is_hg_chunk(buffer.bytes, compare_byte)) {
 return false;
 }
 }
 return true;
}

And a simple test driver:

#include <stdio.h>
#include <stdlib.h>
#define TEST_SIZE 1024U
int main(void) {
 static char test_data[TEST_SIZE + 1U];
 memset(test_data, 'a', TEST_SIZE);
 test_data[TEST_SIZE] = '0円';
 // test_data[TEST_SIZE - 1U] = 'b';
 printf( "\n\"%.8s\"... %s homogeneous.\n",
 test_data,
 is_homogeneous(test_data, TEST_SIZE) ? "is" : "is not" );
 return EXIT_SUCCESS;
}

The critical for loop that will run the most times for a large region compiles on Clang 20.1 to

.LBB0_11:
 vpxor ymm1, ymm0, ymmword ptr [r13]
 vptest ymm1, ymm1
 jne .LBB0_14
 add r13, 32
 cmp r13, rax
 jb .LBB0_11

You can test for yourself on Godbolt.

Update

Although this isn’t especially germane to a review of your code, I went ahead and tweaked my version based on feedback. That gets Clang 20.1 with -mavx2 to generate the critical loop

.LBB0_10:
 vpcmpeqb ymm2, ymm0, ymmword ptr [r14 + 96]
 vpxor ymm2, ymm2, ymm1
 vpcmpeqb ymm3, ymm0, ymmword ptr [r14 + 64]
 vpxor ymm3, ymm3, ymm1
 vpor ymm2, ymm2, ymm3
 vpcmpeqb ymm3, ymm0, ymmword ptr [r14 + 32]
 vpxor ymm3, ymm3, ymm1
 vpcmpeqb ymm4, ymm0, ymmword ptr [r14]
 vpxor ymm4, ymm4, ymm1
 vpor ymm3, ymm3, ymm4
 vpor ymm2, ymm2, ymm3
 vpsllw ymm2, ymm2, 7
 vpmovmskb ecx, ymm2
 test ecx, ecx
 jne .LBB0_18
 sub r14, -128
 cmp r14, rax
 jbe .LBB0_10

With -mavx512bw -mavx512vl -mprefer-vector-width=256, Clang gives us

.LBB0_10:
 vpcmpneqb k0, ymm0, ymmword ptr [r14 + 96]
 vpcmpneqb k1, ymm0, ymmword ptr [r14 + 64]
 vpcmpneqb k2, ymm0, ymmword ptr [r14 + 32]
 kord k0, k0, k1
 vpcmpneqb k1, ymm0, ymmword ptr [r14]
 kord k1, k2, k1
 kord k0, k0, k1
 kortestd k0, k0
 jne .LBB0_18
 sub r14, -128
 cmp r14, rax
 jbe .LBB0_10

ICX 2024 is similar but partially unrolls, GCC 15.1 now does a series of vmovdqa/vpcmpb followed by unnecessary shuffling, instead of the two vpcmpeqb instructions it was doing before, and MSVC is the laggard.

The benefit here should be more loads per cycle, which will all be aligned, and fewer unpredictable branches. On the other hand, there might be up to three unnecessary loads.

To my surprise, refactoring to use 32-bit SIMD lanes makes the generated code significantly worse.

Question 14

@user555045 Right, the penalty comes from a load crossing a cache-line boundary, or worse, a page boundary. I see I just said "aligned" for both aligned instructions and the runtime alignment of the pointer. I see how that was misleading. It matters much more when doing non-temporal loads/stores, which really must be aligned. Thanks, will fix.

Question 15

Good idea to use xor to get 0 for all-matching instead of -1 from vpcmpeqb. You can OR together multiple results before vptest/branch to amortize that overhead, like I did in my answer. That has the further advantage of enabling vpternlogd with AVX-512 to combine 3 vectors with one instruction starting from a memory source operand, not just from compare results. Or is that just Clang's invention, not your source?

Question 16

@PeterCordes The motivation for making V_BYTES and V_ALIGN different constants was so I could do that (perform more loads at once) by increasing V_BYTES, but then I didn’t actually write the algorithm to work correctly if they’re different.

Question 17

@PeterCordes The efficient way is probably to do one more loop over any aligned chunks after the final value of p, before the suffix check in there now.

Question 18

Re: overflow of ptrdiff_t - My answer doesn't overflow ptrdiff_t for small sizes, just signed subtraction producing a negative. Unsigned wrapping of size_t to a huge size would be a problem for i < size-128 + 1, but for example 0 < -119 is false so the loop doesn't run at all for size = 8 when using a signed type. My answer doesn't work at all for sizes greater than PTRDIFF_MAX, but that's impossible on real-world 64-bit implementations. It is possible in 32-bit code (or ILP32 ABIs) under 64-bit kernels, though, so maybe I should have mentioned it.

Question 19

Fe2O3's answer shows a great idea, of using libc memcmp(arr, arr+1, size-1) to make a single pass over your data. (I commented under it with a bunch of things which I'm turning into an answer.)

Libc memcmp uses SIMD via hand-written asm on most platforms; it's an important enough function that it's worth hand-tuning it. On GNU/Linux for example, it even picks a version at runtime based on CPU features, so it can use AVX2 or AVX-512 if available, even if your program is compiled to still work on older x86-64 CPUs.

That has the downside of loading every byte twice, with half the loads being misaligned. That's definitely a bottleneck if your data is already hot in L1d cache. On modern x86-64 with manually-vectorized code doing 2x _mm256_cmpeq_epi8 (vpcmpeb) + 1x _mm256_and_si256 (vpand), or a larger reduction of say 8 compares and 4x + 2x vpand + vptest (AND and set FLAGS), most CPUs should be able to keep their vector load ports fully busy (2 or 3 vectors per clock with cache-line splits costing double, more at page splits.) Or will bottleneck on front-end bandwidth. Intel CPUs might bottleneck on 3/clock SIMD ALU operations; AMD Zen family has 4 SIMD ALU ports and at least the AND instructions should be able to run on any port.

With a vector size of 32 bytes, that means one of every 4 loads will be split across 2 cache lines (if the start of the array is aligned by 32), that could be a bottleneck even with data coming from L2 cache. But not much of a bottleneck. And if the data is coming from farther away, the extra work for the core itself can happen while waiting for HW prefetch and demand loads to bring in the data. For non-small arrays, the memcmp idea should perform about the same as the best you can do with manually-vectorized code, at least on x86 where unaligned vector loads are cheap. It will use more execution resources than optimal code, which matters if sharing a physical core with another thread (hyperthreading / SMT), and a bit for thermals / power usage.

Auto-vectorization

Compilers generally don't auto-vectorize loops with an early-out condition (because the trip-count can't be computed before the first iteration). That makes it basically impossible to get good asm from a compiler for your own memcmp-style loop. ICC Classic (not LLVM-based ICX) could do that sometimes, and if I recall correctly the latest GCC or Clang might now be able to do it sometimes, too.

Another challenge for the way you wrote your loop is that the code has to work for every case that doesn't have undefined behaviour. For example, passing a 2-byte array with different values, and a size of 1 billion. Even if those 2 bytes are at the end of a page followed by an unmapped page, so a 16-byte load would segfault. See Is it safe to read past the end of a buffer within the same page? (Yes in asm, but hard to do in C without undefined behaviour. It's what libc str* functions like strlen do.)
You can write int foo(char arr[static 1024]) to tell the compiler that the whole array is really there, but GCC and Clang don't take advantage of that hint, last I checked.

One way to work around it is to write C that always touches every element, with no early-out. e.g. mismatches |= (arr[i] != arr[0]); or allmatch &= (arr[i] == arr[0]). That's bad for large arrays with a mismatch near the start, but might be acceptable in some cases. See https://stackoverflow.com/questions/42146125/how-to-auto-vectorization-array-comparison-function and https://stackoverflow.com/questions/2741859/how-fast-can-you-make-linear-search/31509388#31509388 for examples.

We can even apply that technique in chunks, to make an inner loop that will vectorize and fully unroll, inside an outer loop that checks 128 bytes at a time.

By thinking about the x86-64 asm I wanted, I wrote this code to lead a C compiler in that direction:

#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
//#include <unistd.h> // ssize_t to make the size loop condition not wrap for small size
typedef ptrdiff_t SIZE_TYPE; // ssize_t is POSIX, not ISO C.
int is_homogeneous(unsigned char *arr, SIZE_TYPE size)
{
 unsigned char key = arr[0];
 for (SIZE_TYPE pos = 0 ; pos < (size-128 + 1) ; pos += 128) {
 unsigned char allmatch = 0xff; // vpbroadcastb or equivalent
 for (int i=0 ; i<128 ; i++){
 unsigned char match = arr[pos+i] == key ? 0xff : 0; // pcmpeqb
 allmatch &= match; // pand
 }
 //if ((allmatch & 0x80) != 0x80) return false; // pmovmskb + cmp reg, 0xffff / jne
 if (allmatch != 0xff) return false; // ptest + a jcc that checks for all-ones, ideally. But compilers don't see that.
 }
 // TODO: tail handling of the last size % 128 bytes
 return true;
}

Godbolt with Clang and GCC -O3. (-O2 gives the same asm; recent GCC does auto-vectorize at -O2 for cases like this when no cleanup is needed.)

GCC does the expected vpcmpeqb / vpand, but makes a total mess of branching on the final 32-byte (256-bit) vector being all-ones or not. It unpacks to 128-bit vectors of 16-bit elements for multiplies (vpmullw), with a huge number of shuffle instructions.

Clang optimizes this inner loop into non-terrible asm, although makes the braindead choice to optimize the allmatch != 0xff (cmp reg, -1) into tmp == 0 (test reg,reg) which is a tiny saving which comes at the cost of having to invert every compare result so it can OR them together. So there's a vpxor (with -1) + vpor where there would otherwise just be a vpand.

# clang (trunk) -O3 -march=x86-64-v3 (AVX2 + BMI2)
is_homogeneous:
 mov eax, 1 # return value
 cmp rsi, 128
 jl .LBB0_5
 movzx ecx, byte ptr [rdi]
 add rsi, -127
 vmovd xmm0, ecx
 vpbroadcastb ymm0, xmm0 # YMM0 = _mm256_set1_epi8(arr[0])
 xor ecx, ecx
 vpcmpeqd ymm1, ymm1, ymm1 # YMM1 = all-one bits = set1(-1)
.LBB0_3:
 vpcmpeqb ymm2, ymm0, ymmword ptr [rdi + rcx + 96]
 vpxor ymm2, ymm2, ymm1
 vpcmpeqb ymm3, ymm0, ymmword ptr [rdi + rcx + 64]
 vpxor ymm3, ymm3, ymm1
 vpor ymm2, ymm2, ymm3
 vpcmpeqb ymm3, ymm0, ymmword ptr [rdi + rcx + 32]
 vpxor ymm3, ymm3, ymm1
 vpcmpeqb ymm4, ymm0, ymmword ptr [rdi + rcx]
 vpxor ymm4, ymm4, ymm1
 vpor ymm3, ymm3, ymm4
 vpor ymm2, ymm2, ymm3
 vpsllw ymm2, ymm2, 7 # it wants the low bit apparently, instead of the high bit.
 vpmovmskb edx, ymm2 # edx = high bit of each byte of ymm2
 test edx, edx # set FLAGS according to EDX
 jne .LBB0_4 # jump (keep looping) if non-zero
 sub rcx, -128
 cmp rcx, rsi
 jl .LBB0_3 # size-based loop condition
.LBB0_5:
 vzeroupper
 ret # return true; EAX set ahead of the loop
.LBB0_4:
 xor eax, eax # return false
 vzeroupper
 ret

Handling the last size % 128 bytes efficiently could take as much or more code. If the total array size is at least 128 bytes, you could do another same-size chunk that ends at the last byte of the array. It's fine if it overlaps with some bytes you already checked.

If sizes are always or normally multiples of 32 or something, that can simplify things, by not handling or not handling efficiently some cases.

Question 20

@Davislor: Mine compiles like crap even with Clang, using about 1.5x as many vector-ALU instructions as it should (vpxor / vpor instead of vpand, following the vpcmpeqbb) Mine also only does loads that are aligned relative to the start of the array so that's not a difference. Your vpxor/vptest/jcc loop is 6 uops (fused domain), so can run at best 1 load per clock on Alder Lake and later big-cores. (And Zen 2 or later). Or actually less: all 6 uops need an ALU port, and Intel only has 6 ALU ports in Lion Cove (Core Ultra 2 and Lunar Lake), fewer in earlier CPUs including Redwood (5)

Question 21

@Davislor: AMD Zen-family has separate schedulers and ports for vector vs. scalar uops, so the back-end can actually keep up with the 3 vector-ALU and 3 scalar-ALU uops per cycle the front-end can feed it. If you compiler did a better job, the vpxor/vptest/jcc (1 + 2 + 1 = 4 uops) would be vpcmpeqb/vpmovmskb/cmp+jcc (1 + 1 + 1 = 3 uops), making the whole loop 5.

Question 22

@Davislor: vpxor is great if unrolling and combining multiple results with vpor like Clang seems to want to even when I write the source with &=, and allows AVX-512 vpternlogd, but it defeats taking advantage of the AND part of vptest using two different source vectors to make sure they're both all-ones. (Or wait, maybe you can't figure that out from the FLAGS conditions of one ptest. There's a _mm_test_mix_ones_zeros and a _mm_test_all_zeros, but I forget if their semantic meaning only works when one vector is all-ones.)

Question 23

@Davislor: Anyway, in its current state with Clang, my code uses 16 fused-domain uops to check 4 vectors. Oh wait, no, 20 fused-domain uops because LLVM used indexed addressing-modes for the memory-source operands, so those unlaminate on Intel. /facepalm. 13 of those uops need a vector ALU port (port 0, 1, or 5 on Intel CPUs). uica.uops.info predicts Skylake and Rocket Lake will run yours at 1 iter (1 vector) per 2 clocks (although 1.5 might be more realistic; IDK why it thinks both branches need port 6), mine at 1 iter (4 vectors) per 6 clocks (SKL) or 4.4 clocks (ICL/RKL).

Question 24

However, three of the four compilers generate much better code with -mavx512bw -mavx512vl -mprefer-vector-width=256 enabled. MSVC at least tries to vectorize.

Question 25

For those who like the memcmp(region, (char *)region + 1, region_len - 1) approach:

Consider:

Edge cases

What if region_len == 0?

bool check_homogeneous_alt1(const void *region, size_t region_len) {
 return region_len == 0 ||
 memcmp(region, (char *)region + 1, region_len - 1) == 0;
}

Alignment

As region may be nicely aligned, use like-wise aligned pointers for the first memcmp() (and likely most time consuming) call. memcmp() can handle unaligned compares yet common implementation work best with aligned ones.

// Unchecked code. Will review later.
/*
 * Return true when memory is homogeneous.
 * 
 * First compare data 1 "chunk" away.
 * This "chunk is `sizeof(max_align_t)` bytes away.
 *
 * Then compare 1 byte away for the "leftovers".
 */
bool check_homogeneous_alt2(const void *region, size_t region_len) {
 const unsigned char *uregion = region;
 size_t leftover = region_len % sizeof(max_align_t);
 // Do we have at least 2 chunks?
 if (region_len >= sizeof(max_align_t)*2) {
 if(memcmp(uregion, uregion + sizeof(max_align_t), region_len - leftover - sizeof(max_align_t))) {
 return false;
 }
 // - 1 is to check that the first region is like the remainder.
 uregion += region_len - leftover - 1;
 leftover++;
 }
 if (leftover > 1) {
 if(memcmp(uregion, uregion + 1, leftover - 1)) {
 return false;
 }
 }
 return true;
}

Question 26

You've beaten me to the punch. Your comment to "guard against 0-1" was on my to-do list to perform with the morning's first coffee... Good that you spotted that! I was going to use size < 2 to shortcut, since a single byte is always the same as itself, too. Cheers, and thanks for your aligning version in C. The 'low level' versions in other answers here is really impressive, but ironically being 'low', it's well over my head. :-)

Question 27

You might as well just check region_len <= 1 since there's no need to run a length=0 memcmp.

Question 28

@PeterCordes, With size_t len, either region_len == 0 or region_len <= 1 will work fine here. I expect such small len values to be rare out there in the wild and so prefer the region_len == 0 test as simpler to encode and execute for the lion share of of cases. YMMV.

Question 29

@chux: Fair enough, saves a byte on x86 (cmp reg,imm8/jcc vs. test/jcc), but on AArch64 could allow cbz or cbnz saving a whole instruction. And on RISC-V allows beq with the zero-register instead of needing to materialize a 1 or 2 constant. But wait, we need to sub reg, 1 to get region_len - 1 for the call, so could branch on flags from that (below-or-equal). But if you need FLAGS, you can't use LEA to copy-and-sub, so yeah potentially some savings.

Question 30

@PeterCordes " sub reg, 1 to get region_len - 1 for the call, so could branch on flags from that (below-or-equal" --> would not that also consider sizes in the upper-half size_t range as negative?

Question 31

Lots of answers addressing correctness and performance; I'll add a look at usability and style.

Using an active verb such as "check" to start the name suggests a procedure rather than a function (in C, that means returning void rather than a value). Prefer to name predicate functions such as this by beginning with is_... or has_... etc. (note that names beginning with is followed directly by a letter are reserved for future C standards, so don't omit the underscore!).

"Homogeneous" is a term that's overloaded in some contexts to refer to types of memory, so I might use "uniform" instead, giving is_uniform(). Or perhaps has_one_value().

Consider choosing names for the formal parameters that are familiar from the C Standard Library specification. In this case, following the pattern of memchr() would give us s and n for the names.

We don't need a cast here:

bool check_homogeneous(const void *region, size_t region_len) {
 const unsigned char *bytes = (const unsigned char *) region;

A pointer to void converts without a cast to any other object pointer type, and pointer to const void converts to any pointer to const object, so we can simply write unsigned char const *const bytes = region without the clutter.

Question 32

The "comments" section under my answer has become "lengthy" (to say the least!) Please forgive me this piggy-back here where it may be noticed... The OP's post contains text that may be the best function name of all: is_identical_byte_val()... Noted recently, "[descriptive] naming is hard" Cheers! :-)

Question 33

"[descriptive] naming is hard"... Can't help but think of Elon's kid named 'X'... He'll sign his passport and/or credit card application and people will think he's illiterate.... :-) Send me a message and I'll delete this frivolous comment myself... Cheers! :-) (Can you imagine if X grows up and moves to U in Micronesia? :-) Maybe becomes a "one trick pony" like me, working only in C... This stuff almost writes itself... :-)

user272752user272752 · Accepted Answer · 2025-06-09 02:59:24Z

Review

That while() with its accessory statements is begging to be expressed as a simple for() instead.

Since region_len is not const, there's really no need for part. Use the (local) parameter variable itself for the loop.

Boolean functions, especially, should be named meaningfully. if(checkRaining()) is a little vague. Need an umbrella to get to the train? if(isRaining()) gives a better clue about the meaning of its returned value.

Be consistent! There are two if() conditionals in this function. The first one uses braces around a single return false; statement. The second if() has a "no brace body" in which the return false; appears on the same line as if(). Either is fine (imo), but both together is not fine. Read lots of "expert" code and learn a style that works for you when writing your own code. Adapt to and immitate (don't emulate) any existing style when altering existing code. Consistency suggests better quality, and can make bugs harder to write and easier to spot.

The comment explains "divide by 2", that is implemented as part >>= 1;. Modern compilers are vastly more 'clever' than most of us coders using them. The compiler will recognise the intent of part /= 2; and do the right (and fastest) thing. It's better to aim for clear code than to second guess the intelligence of the compiler. Leave or adjust the comment to explain why you, the author, chose to divide by two if you feel it worthwhile. (Comments are really, really cheap! and will execute in 0 nanoseconds!)

From the example of Dennis Ritchie's strcmp(), I would advise you as a neophyte to postpone writing functions accepting a void * pointer. Your function's purpose is based on byte-by-byte comparison (as opposed to a function tuned for, perhaps, compound datatypes.) I'd feel safer if the function signature, itself, indicated unsigned char * for the first parameter. This may be debatable, but void * tells the compiler to be LESS stringent with its type checking. Meditate on why Dennis Ritchie wrote the <string.h> functions to take the datatypes of their parameters as they do (instead of making all of them take void * parameters.)

Understanding and alternative solution

Suggest that you locate and study Standard Library implementation of memcpy(). There you'll find a bit of 'start up' address compensation, then a loop that can safely treat the rest of the block of bytes as 32 or 64 bit 'chunks' (a single instruction deals with 4 or 8 bytes in one step.) Study and learn and adapt to suit your needs.

This is only a learning exercise, of course.

In the meantime:

#include <stdint.h> // for 'uint8_t'
bool isSameByte( const uint8_t *arr, const size_t size )
{
 // EDIT: Added missing test as alternative to required fix detected by @chux
 if( size < 2 )
 return true; // 0 or 1 byte is always equal to itself
 return memcmp( arr + 0, arr + 1, size - 1 ) == 0;
}

would make a single pass (optimised by the compiler to use the very best code that's available in the Standard Library) across the entire array. If there's a single byte that differs from its neighbours, the scan will end there. (Notice there's no fiddle-faddle with specially handling the first byte of the array. No if().)

(Put a print statement inside your while() loop, then think about whether or not you've improved on O(n).)

Don't stop experimenting, but lean toward using Standard Library functions where possible.
DIY is REALLY unlikely to improve on performance or trustworthiness.
Spend a rainy Sunday skimming the names of all the functions in the Standard Library. I still find nuggets that I'd known then forgotten about.
And don't try to out clever yourself with intricate brilliance.
Take it from one who's been there: you'll look at this again in 3-6 months and exclaim, "WTF?!".
KISS!

If you're worrying about repeated access to the same byte value, don't. In your 'divide and conquer' scheme, a very large array might suffer from cache misses because it is drawing bytes from two regions. This 'neighbour checking' can race through lines of contiguous bytes, then ask for another serving to scan. Dunno. Feels like it's running the pipeline at maximum speed...

Rows of roses

Homogeneous milk is milk that has been processed so what would be called cream no longer floats to the top. It is, in a way, a heterogeneous mixture that still contains the same substances as before but in a different structure.

One could define a homogeneous array of a mixture of only lowercase ASCII letters. It would be homogeneous by one interpretation, but likely not all the same character, therefore heterogeneous.

Vocabulary matters. Say what you mean (and mean what you say!)

Boo!

Sorry... Just some thrill-seeking spookiness on a howling-wind night! :-)

XY?

Delightfully referenced and linked from the Wikipedia page for "Regret" is the psychology concept of "System One and System Two Thinking". Possibly relevant to this question as the function's purpose and use has the aroma of "checking that all elements of an array of flag bits/bytes have the same value" where the value or indices of any black sheep byte(s) is/are not important. Perhaps some sort of watch dog sniffer run at certain intervals???

(This is "System Two" as some time has passed since this answer was last updated...)

Should this speculation about the function's purpose come close to the mark, the OP is advised to re-think this mechanism.

Sufficient might be a single "Set/Reset" flag and a single register that "remembers" and compares the results of a current operation with only one previous operation. When two results differ, set the flag and leave it set until the caller of this current function wants to verify all is well or work is needed. It's only when that work is done that the S/R flag is reset for the next detection.

This would reduce memory requirements and amortise the effort to scan an array (unknown size) across all the operations that are occuring.

Irrespective of this speculation's appropriateness, it must be noted that developing software is fertile ground for germination and growth of 'regret'...

The other standard-library function that could do the job is strspn, at least if your first byte isn't 0, the C string terminator. But it normally only checks one byte at a time after making a lookup table since it takes a whole set of accept-characters. It can use SSE4.2 pcmpistri for small-enough accept-sets, but that's still not as fast as a simple byte compare. Oh, and it works on C strings so would need a terminator or mismatching byte following the array. There's no memspn even in glibc.
Anyway, your overlapping memcmp trick could probably go somewhat less than half as fast as optimal SIMD machine code on arrays that are already hot in L1d cache, on typical modern x86. For arrays farther away from the CPU, probably still limited by cache/memory bandwidth getting the data to L1d.
In usage like this (C++ templates, C generic functions, or examples/pseudocode), T is the same type in both appearances, which is very bad for T size. Just use size_t size because you don't want the size-type to be parameterized. If you wanted to use a different wildcard type, use T* arr, S size.
One way to work around it is to write C that always touches every element, with no early-out. e.g. mismatches |= (arr[i] != arr[0]); or allmatch &= (arr[i] == arr[0]). That's bad for large arrays with a mismatch near the start, but might be acceptable in some cases. See the how to auto vectorization array comparison function for an example of auto-vectorization of counting matches.
Genius! I somehow got it into my head that the regions passed to memcmp could not overlap, even though a quick look at the docs would have indicated otherwise.

Stack Exchange Network

C method to determine whether a byte array is homogeneous

5 Answers 5

Review

Understanding and alternative solution

Rows of roses

Boo!

XY?

Your Algorithm is Buggy

Other Suggestions

Is Simpler Better?

A Vectorized Loop

Putting it All Together

Update

Auto-vectorization

Hot Network Questions

C method to determine whether a byte array is homogeneous

5 Answers 5

Review

Understanding and alternative solution

Rows of roses

Boo!

XY?

Your Algorithm is Buggy

Other Suggestions

Is Simpler Better?

A Vectorized Loop

Putting it All Together

Update

Auto-vectorization

Related

Hot Network Questions