How does the following C program look to compare to strings and return a case-insensitive sort? The following is what I have so far:
int main(void)
{
char* strings[4] = {"Onus", "deacon", "Alex", "zebra"};
for (int i=0; i<4; i++) printf("%d. %s\n", i+1, strings[i]);
qsort(strings, 4, 8, scmp);
}
int scmp(const void *p1, const void *p2)
{
// being a string (pointer-to-char)
const char* s1 = *(char**) p1;
const char* s2 = *(char**) p2
size_t len_s1 = strlen(s1);
size_t len_s2 = strlen(s2);
// lower-case first string buffer
char s1_lower[len_s1+1]; // cannot initialize with {[len_s1]=0} in VLA ?
s1_lower[len_s1] = 0;
for (int i=0; s1[i] != 0; i++)
s1_lower[i] = tolower(s1[i]);
// lower-case second string buffer
char s2_lower[len_s2+1];
s2_lower[len_s2] = 0;
for (int i=0; s2[i] != 0; i++)
s2_lower[i] = tolower(s2[i]);
// built-in string compare (strcmp) will return the sort value
return strcmp(s1_lower, s2_lower);
/* printf("%s -- %s\n", s1_lower, s2_lower); */
}
Another version extracting out the str_lower
function would be:
void lower_string(char buffer[], const char* str, size_t len)
{
char str_lower[len+1];
str_lower[len] = 0;
for (int i=0; str[i] != 0; i++)
str_lower[i] = tolower(str[i]);
}
int scmp(const void *p1, const void *p2)
{
const char* s1 = *(char**) p1;
const char* s2 = *(char**) p2;
char s1_lower[strlen(s1)+1];
lower_string(s1_lower, s1, strlen(s1));
char s2_lower[strlen(s2)+1];
lower_string(s2_lower, s2, strlen(s2));
return strcmp(s1_lower, s2_lower);
}
2 Answers 2
Firstly, thank you for providing a test program. We could make the test even better, if we make it self-checking:
char const *expected[] = {"Alex", "deacon", "Onus", "zebra"};
for (int i = 0; i < 4; ++i) {
if (strcmp(expected[i], strings[i])) {
return EXIT_FAILURE;
}
}
We might consider including some non-alphabetic characters, too ("should "A-team"
sort before or after "abacus"
?). And we'll want some tests of equal and almost-equal strings; not just ones that differ in the first character.
There were a few errors and warnings to fix up (I compiled with gcc -std=c17 -Wall -Wextra -Wwrite-strings -Wno-parentheses -Wpedantic -Warray-bounds -Wstrict-prototypes -Wconversion
to catch all of these):
We missed a few necessary includes for the comparison function:
#include <ctype.h> #include <stdint.h> #include <string.h>
And for the test program:
#include <stdio.h> #include <stdlib.h>
We're assigning
char*
pointers using string literals. Whilst C allows this for historical reasons, it's dangerous, and should be avoided (because assignment through such pointers is UB, and the compiler can't spot that for you). It's easy to fix that:const char* strings[4] = {"Onus", "deacon", "Alex", "zebra"};
And one that GCC can't spot:
qsort(strings, 4, 8, scmp); // ^^^
We can't assume that a const char*
is a particular size on the target platform. Whilst this value may be correct where you are, it's not true everywhere. Luckily, C provides the sizeof
operator to help:
qsort(strings, 4, sizeof strings[0], scmp);
Looking in detail at the comparison function, it does a lot of extra work to make a copy of each input string (we have to traverse each string up to three times - once to find its length, once to copy and convert case, and once to compare).
Instead of preparing the ground for strcmp()
, we could (and should) write a case-insensitive version of strcmp()
, so we traverse each string once, and downcase only those characters we actually need for the comparison, as we reach them:
#include <ctype.h>
/* N.B. name not beginning with "str". */
int ci_strcmp(const char *s1, const char *s2)
{
while (*s1) {
if (!*s2) {
/* s1 is longer */
return 1;
}
int c1 = tolower(*s1);
int c2 = tolower(*s2);
if (c1 < c2) {
return -1;
}
if (c2 < c1) {
return 1;
}
++s1;
++s2;
}
if (*s2) {
/* s2 is longer */
return -1;
}
/* compare equal */
return 0;
}
/* Adapter for qsort */
int scmp(const void *p1, const void *p2)
{
const char *const *sp1 = p1;
const char *const *sp2 = p2;
return ci_strcmp(*sp1, *sp2);
}
We can make the body of the comparer more compact (but functionally identical) with a little cleverness:
int ci_strcmp(const char *s1, const char *s2)
{
for (;;) {
int c1 = tolower(*s1++);
int c2 = tolower(*s2++);
int result = (c1 > c2) - (c1 < c2);
if (result || !c1) return result;
}
}
As mentioned in another answer, we need to use unsigned char
with tolower()
:
int ci_strcmp(const unsigned char *s1, const unsigned char *s2)
{
for (;;) {
int c1 = tolower(*s1++);
int c2 = tolower(*s2++);
int result = (c1 > c2) - (c1 < c2);
if (result || !c1) return result;
}
}
/* Adapter for qsort */
int scmp(const void *p1, const void *p2)
{
const unsigned char *const *sp1 = p1;
const unsigned char *const *sp2 = p2;
return ci_strcmp(*sp1, *sp2);
}
-
1\$\begingroup\$ An annoying attribute about standard string functions is that their functional interface is
char *
, but the internal workings are as ifunsigned char *
. Unfortunately callingmy_strfoo1(unsigned char *)
with achar *
can give a warning, hence often amy_strfoo2(char *)
obliges a pointer conversion within. As much string processing is ASCII only - this issue is not common and left unaddressed. \$\endgroup\$chux– chux2021年03月05日 17:57:26 +00:00Commented Mar 5, 2021 at 17:57
How does the following C program look to compare to strings and return a case-insensitive sort?
In addition to @Toby Speight fine answer:
Order consideration
Folding values to lower case gives a different sort than folding to uppercase.
Consider "a", "A", "_"
. What order is expected? Is '_'
before or after letters?
Function description should specify its intent.
Non-ASCII values
Values outside the 0-127 are tricky as they may be negative. tolower()
is not defied well for most negative values.
const char* s1 = *(char**) p1;
...
s1_lower[i] = tolower(s1[i]); // Potential UB
Instead handle as unsigned char *
.
const unsigned char* s1 = *(const unsigned char**) p1;
unsigned char s1_lower[len_s1+1];
...
s1_lower[i] = tolower(s1[i]);
Non-ASCII values (more)
Various non-ASCII letters can form a non-1-to-1 mapping between lower and upper case.
How should "E", "e", "è", "é", "ê", "ë", "É"
sort?
Code could round trip
s1_lower[i] = tolower(toupper(s1[i]));
Going forward, this non-ASCII issue concern tends to be a Unicode one.
Candidate caseless compare
int scmpi(const void *p1, const void *p2) {
const unsigned char* s1 = *(unsigned char**) p1;
const unsigned char* s2 = *(unsigned char**) p2;
// Keep loop simple
while ((tolower(*s1) == tolower(*s2)) && *s2) {
s1++;
s2++;
}
int ch1 = tolower(*s1);
int ch2 = tolower(*s2);
return (ch1 > ch2) - (ch1 < ch2);
}
Speedier??
I found my_strcmp_caseless()
about twice as fast, back-in-the-day; useful in a string heavy application.
-
\$\begingroup\$ @David542 You might like Additional pitfalls to watch out for when doing case insensitive compares. \$\endgroup\$chux– chux2021年03月05日 18:10:55 +00:00Commented Mar 5, 2021 at 18:10
-
\$\begingroup\$ thanks for this great answer. What would be an example of how a string could have an
unsigned char
in it? \$\endgroup\$David542– David5422021年03月06日 03:02:15 +00:00Commented Mar 6, 2021 at 3:02 -
1\$\begingroup\$ @David542
unsigned char example[] = { 255, 0 };
. In C: "A string is a contiguous sequence of characters terminated by and including the first null character." \$\endgroup\$chux– chux2021年03月06日 03:33:59 +00:00Commented Mar 6, 2021 at 3:33