Implementing basename, stpcpy, asprintf, vasprintf, strchrnul, and strcasecmp

Question 1

C23 standarized strdup()/strndup(), but left out stpcpy(). stpcpy() differs from strcpy() in that it returns a pointer to the terminating null byte of the copied string, and not a pointer to the starting address of the destination string.

It allows for this:
```
char bigString[1000];
bigString[0] = '0円';
stpcpy(stpcpy(stpcpy(stpcpy(bigString, "John, "), "Paul, "), "Geoge, "), "Joel ")));
```
instead of:
```
char bigString[1000]; 
bigString[0] = '0円'; // or strcpy()
strcat(strcat(strcat(strcat(p,"John, "), "Paul, "), "George, "), "Joel ");
```
If you didn't know, strcat() uses the Shlemiel the painter’s algorithm.
There are two different versions of basename() (POSIX and GNU). The POSIX version modifies the path argument, and may segfault when called with a static string such as "/usr/". The GNU version, which I have tried to follow, does not.
About asprintf() and vasprintf(), I'd quote directly from the man page:

The functions asprintf() and vasprintf() are analogs of sprintf(3) and vsprintf(3), except that they allocate a string large enough to hold the output including the terminating null byte, and return a pointer to it via the first argument. This pointer should be passed to free(3) to release the allocated storage when it is no longer needed.

Unfortunately, ISO C23 also left these two goodies out as well.
The strchrnul() differs from strchr() in that it returns a pointer to the matched character, or a pointer to the null byte at the end of s (i.e, s + strlen(s)) if the character is not found. One place where I've found this useful is for removing trailing newline from an input string.

With strchr(), one would have done:
```
char *p = strchr(s, '\n');
if (p) {
 *p = '0円';
}
```
With strchrnul(), you can skip the branch and dereference the result directly:
```
*strchrnul(s, '\n') = '0円';
```
The strcasecmp() function performs a byte-by-byte comparison of two strings, ignoring the case of the characters.

Code:

util.h:

#ifndef UTIL_H
#define UTIL_H 1
#include <stdarg.h>
int util_vasprintf(char **restrict strp,
 const char fmt[restrict static 1], 
 va_list ap);
[[gnu::format(printf, 2, 3)]] int util_asprintf(char **restrict strp,
 const char fmt[restrict static 1], 
 ...); 
[[gnu::returns_nonnull]] char *util_stpcpy(char dest[restrict static 1],
 const char src[restrict static 1]);
[[gnu::returns_nonnull]] const char *util_basename(const char path[static 1]);
[[gnu::pure, gnu::returns_nonnull]] char *util_strchrnul(const char s[static 1], 
 int c);
[[gnu::pure]] int util_strcasecmp(const char s[restrict static 1], 
 const char t[restrict static 1]);
#endif /* UTIL_H */

util.c:

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include "util.h"
int util_vasprintf(char **restrict strp, const char fmt[restrict static 1],
 va_list ap)
{
 va_list ap_copy;
 va_copy(ap_copy, ap);
 const int nwritten = vsnprintf(nullptr, 0, fmt, ap_copy);
 va_end(ap_copy);
 if (nwritten < 0) {
 goto fatal;
 }
 *strp = malloc((size_t) nwritten + 1);
 if (*strp == nullptr) {
 goto fatal;
 }
 const int status = vsprintf(*strp, fmt, ap);
 if (status < 0) {
 free(*strp);
 goto fatal;
 }
 return status;
 fatal:
 /* The BSD implementation sets *strp to nullptr on failure. Linux's leaves
 * the contents undefined. Neither C nor POSIX has standarized this
 * function as of yet. */
 *strp = nullptr;
 return -1;
}
int util_asprintf(char **restrict strp, const char fmt[static 1], ...)
{
 va_list argp;
 va_start(argp, fmt);
 *strp = nullptr;
 int nwritten = util_vasprintf(strp, fmt, argp);
 va_end(argp);
 return nwritten;
}
char *util_stpcpy(char dest[restrict static 1],
 const char src[restrict static 1])
{
 const size_t len = strlen(src);
 return (char *) memcpy(dest, src, len + 1) + len;
}
const char *util_basename(const char path[static 1])
{
 char *const cp = strrchr(path, '/');
 return cp ? cp + 1 : path;
}
char *util_strchrnul(const char s[static 1], int c) 
{
 while (*s) {
 if (*s == c) {
 break;
 }
 }
 return (char *) s;
}
int util_strcasecmp(const char s[restrict static 1], 
 const char t[restrict static 1])
{
 int p, q;
 do {
 p = *s++;
 if (islower((unsigned char) p)) {
 p = toupper((unsigned char) p);
 }
 
 q = *t++;
 if (islower((unsigned char) q)) {
 q = toupper((unsigned char) p);
 }
 } while (p == q && q != '0円');
 return p - q;
}
#ifdef TEST_MAIN
#include <assert.h>
int main(void)
{
 assert(strcmp(util_basename("/usr/lib"), "lib") == 0);
 assert(strcmp(util_basename("/usr/"), "") == 0);
 assert(strcmp(util_basename("usr"), "usr") == 0);
 assert(strcmp(util_basename("/"), "") == 0);
 assert(strcmp(util_basename("."), ".") == 0);
 assert(strcmp(util_basename(".."), "..") == 0);
 assert(util_strcasecmp("aPplE", "APPLE") == 0);
 assert(util_strcasecmp("apple", "apple") == 0);
 assert(util_strcasecmp("HELLO", "HELLO") == 0);
 assert(util_strcasecmp("", "") == 0);
 assert(util_strcasecmp("HrLLO", "HELLO"));
}
#endif

I was unable to think of meaningful tests for the other functions.

Review Request:

Are these implementations correct? Do you see any bugs or inefficiencies in the code?

Are there any other useful attributes I can attach to these functions?

Question 2

If you know of any other non-standard useful string function, do mention it.

Question 3

Another note about strchrnul(): A readlines() functions can avoid N branches with strchrnul() if it replaces strchr() with it. But there's strcspn() too, which can be used instead.

Question 4

@Fe2O3 while (*s && *s != c) { ++s; } return (char *)s; ==> I really shouldn't try stuffing too much in one line. :)

Question 5

@Fe2O3 The first version was a for loop: for (;; ++s) { if (*s == '0円' || *s == c) { return (char *)s; } :)

Question 6

@Fe2O3 Oh yes it has, I'd add it in the while loop.

Question 7

bump the pointer

char *util_strchrnul(const char s[static 1], int c) 
{
 while (*s) {
 if (*s == c) {
 break;
 }
 }
 return (char *) s;
}

Pretty sure you're going to want to ++ increment s at some point. If it's initially pointing at non-NUL and != c, then it never moves and never exits.

It's a trivial mistake, no biggie. The larger mistake was neglecting code coverage measurements. A line of source that never executed is likely a buggy line. We systematically exercise the codebase to discover such issues.

locale

util_strcasecmp() should document that it only handles USASCII rather than utf-8 codepoints. It sticks to just the "C" locale, avoiding wide character mapping.

 assert(util_strcasecmp("äpfel", "Äpfel") == 0);
 assert(util_strcasecmp("fußgängerübergänge",
 "Fußgängerübergänge") == 0);

I see no reason for islower() tests and the associated pipeline stalls due to branch misprediction. We could just unconditionally assign the case smashed value, as non-alpha characters go through the identity mapping, coming back as themselves.

The man page makes an interesting observation about "missing" mappings.

The details of what constitutes an uppercase or lowercase letter depend on the current locale. For example, the default "C" locale does not know about umlauts, so no conversion is done for them.

In some non-English locales, there are lowercase letters with no corresponding uppercase equivalent; the German sharp ß is one example.

Ordinarily I would look at {upper, lower} and say "meh!" {six, half-dozen}, doesn't matter which way we case smash. But that last item seems to suggest it would be more convenient to smash to lower. The effect will be the same. It's a little surprising that after seeing islower() and then assigning p = toupper(p) we cannot safely assert isupper(p), yet that's what the man page explains.

Question 8

"Pretty sure you're going to want to ++ increment s at some point." ==> Yes, silly me. I changed the implementation at the last moment, I had a tricky loop before this (I liked it, at least it worked). Thanks. :)

Question 9

But there is an uppercase ß these days... U+1E9E ẞ LATIN CAPITAL LETTER SHARP S

Question 10

@Shawn. Ha! Okay. Looks like a single-byte ISO Latin locale could behave differently from multi-byte unicode. (I'm just reading the man page, perhaps it needs an update.)

Question 11

C locales were a huge mistake. Here's a funny rant about it github.com/mpv-player/mpv/commit/…

Question 12

Potential idea

C2x is planned to support

QChar *strchr(QChar *s, int c);

This QChar allows one to call strchr() with a const char * or char * and return the same type.

I have not dug into C2x enough to well understand the mechanism used (it might be _Generic), but perhaps your code could do the same with const char *util_basename(const char path[static 1])?

Question 13

godbolt.org/z/1nj8vWcPd is how it can be implemented. Lundin presented this example on a StackOverflow question I asked.

Question 14

Bug: questionable compare.

Code does

 if (islower((unsigned char) p)) {
 p = toupper((unsigned char) p);
 }

yet documentation has

In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed.

To reveal important differences involves testing a '_', which is between upper and lower case ASCII letters.

Suggest instead:

int util_strcasecmp(const char s[restrict static 1], 
 const char t[restrict static 1]) {
 const unsigned char *us = (const unsigned char *) s;
 const unsigned char *ut = (const unsigned char *) t;
 unsigned p, q;
 do {
 p = tolower(*us++);
 q = tolower(*ut++);
 } while (p == q && q != '0円');
 return (p > q) - (p < q);
}

Additional notes:

Even though a non C2x issue, const unsigned char *us = (const unsigned char *) s; also correctly handles when char is signed and not 2's complement. (unsigned char) p does not.
(p > q) - (p < q) does not overflow like p - q might when char same size as int. This common idiom is well handled by good compilers.

For a potential 2x speed strcasecmp(), see strcicmp_ch() which omits the q != '0円' test in the loop.

Question 15

In C23, all signed integral types are two’s-complement. C23 code does not need to worry about one’s-complement or sign-and-magnitude representations.

Question 16

The linked answer has a section named "Do all letters map one lower to one upper? (pedantic)". Is that not to be entertained here?

Question 17

@Davislor Yes "correctly handles when char is signed and not 2's complement." is not applicable with C23 as the post is tag. Still, using const unsigned char *us = (const unsigned char *) s; allows wider application, with no downside.

Question 18

@Harith, Beyond simple US-ASCII case mapping, case insensitivity is commonly today a UTF8 string and that is a much larger case insensitive compare issue. Locale 8-bit issues are fading from relevance and that is where "Do all letters map one lower to one upper? (pedantic)" most applies.

Question 19

Whilst util_stpcpy() isn't wrong, it's possibly less efficient than a implementation tuned for the target architecture, which can fold finding the end into an implementation of strcpy(). I couldn't make GCC 14 fuse these operations at any level of optimisation, and always end up with two function calls.

Does it really make sense to test islower() before calling toupper() in strcasecmp()? I'm not aware of any locale where there's a non-lower character that isn't returned unchanged by toupper(). I strongly suspect that unconditional toupper()1 would be faster too.

1 Or tolower(), to be more like the POSIX strcasecmp() when used in the POSIX locale.

I'm surprised you couldn't think of any tests of asprintf(). Quite a few spring to mind, including some of the unspecified (invalid) inputs. But at a minimum, we should test a simple success path such as

{
 char *formatted = nullptr;
 int len = util_asprintf(&formatted, "%d", 10);
 assert(len==2);
 assert(formatted);
 assert(!strcmp(formatted, "10"));
 free(formatted);
}

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-05-17 17:02:38Z

bump the pointer

char *util_strchrnul(const char s[static 1], int c) 
{
 while (*s) {
 if (*s == c) {
 break;
 }
 }
 return (char *) s;
}

Pretty sure you're going to want to ++ increment s at some point. If it's initially pointing at non-NUL and != c, then it never moves and never exits.

It's a trivial mistake, no biggie. The larger mistake was neglecting code coverage measurements. A line of source that never executed is likely a buggy line. We systematically exercise the codebase to discover such issues.

locale

util_strcasecmp() should document that it only handles USASCII rather than utf-8 codepoints. It sticks to just the "C" locale, avoiding wide character mapping.

 assert(util_strcasecmp("äpfel", "Äpfel") == 0);
 assert(util_strcasecmp("fußgängerübergänge",
 "Fußgängerübergänge") == 0);

I see no reason for islower() tests and the associated pipeline stalls due to branch misprediction. We could just unconditionally assign the case smashed value, as non-alpha characters go through the identity mapping, coming back as themselves.

The man page makes an interesting observation about "missing" mappings.

The details of what constitutes an uppercase or lowercase letter depend on the current locale. For example, the default "C" locale does not know about umlauts, so no conversion is done for them.

In some non-English locales, there are lowercase letters with no corresponding uppercase equivalent; the German sharp ß is one example.

Ordinarily I would look at {upper, lower} and say "meh!" {six, half-dozen}, doesn't matter which way we case smash. But that last item seems to suggest it would be more convenient to smash to lower. The effect will be the same. It's a little surprising that after seeing islower() and then assigning p = toupper(p) we cannot safely assert isupper(p), yet that's what the man page explains.

"Pretty sure you're going to want to ++ increment s at some point." ==> Yes, silly me. I changed the implementation at the last moment, I had a tricky loop before this (I liked it, at least it worked). Thanks. :)
But there is an uppercase ß these days... U+1E9E ẞ LATIN CAPITAL LETTER SHARP S
@Shawn. Ha! Okay. Looks like a single-byte ISO Latin locale could behave differently from multi-byte unicode. (I'm just reading the man page, perhaps it needs an update.)
C locales were a huge mistake. Here's a funny rant about it github.com/mpv-player/mpv/commit/…

Stack Exchange Network

Implementing basename, stpcpy, asprintf, vasprintf, strchrnul, and strcasecmp

Code:

Review Request:

4 Answers 4

bump the pointer

locale

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementing basename, stpcpy, asprintf, vasprintf, strchrnul, and strcasecmp

Code:

Review Request:

4 Answers 4

bump the pointer

locale

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions