C23 standarized
strdup()
/strndup()
, but left outstpcpy()
.stpcpy()
differs fromstrcpy()
in that it returns a pointer to the terminating null byte of the copied string, and not a pointer to the starting address of the destination string.It allows for this:
char bigString[1000]; bigString[0] = '0円'; stpcpy(stpcpy(stpcpy(stpcpy(bigString, "John, "), "Paul, "), "Geoge, "), "Joel ")));
instead of:
char bigString[1000]; bigString[0] = '0円'; // or strcpy() strcat(strcat(strcat(strcat(p,"John, "), "Paul, "), "George, "), "Joel ");
If you didn't know,
strcat()
uses the Shlemiel the painter’s algorithm.There are two different versions of
basename()
(POSIX and GNU). The POSIX version modifies thepath
argument, and may segfault when called with a static string such as"/usr/"
. The GNU version, which I have tried to follow, does not.About
asprintf()
andvasprintf()
, I'd quote directly from the man page:The functions
asprintf()
andvasprintf()
are analogs ofsprintf(3)
andvsprintf(3)
, except that they allocate a string large enough to hold the output including the terminating null byte, and return a pointer to it via the first argument. This pointer should be passed tofree(3)
to release the allocated storage when it is no longer needed.Unfortunately, ISO C23 also left these two goodies out as well.
The
strchrnul()
differs fromstrchr()
in that it returns a pointer to the matched character, or a pointer to the null byte at the end ofs
(i.e,s + strlen(s))
if the character is not found. One place where I've found this useful is for removing trailing newline from an input string.With
strchr()
, one would have done:char *p = strchr(s, '\n'); if (p) { *p = '0円'; }
With
strchrnul()
, you can skip the branch and dereference the result directly:*strchrnul(s, '\n') = '0円';
The
strcasecmp()
function performs a byte-by-byte comparison of two strings, ignoring the case of the characters.
Code:
util.h:
#ifndef UTIL_H
#define UTIL_H 1
#include <stdarg.h>
int util_vasprintf(char **restrict strp,
const char fmt[restrict static 1],
va_list ap);
[[gnu::format(printf, 2, 3)]] int util_asprintf(char **restrict strp,
const char fmt[restrict static 1],
...);
[[gnu::returns_nonnull]] char *util_stpcpy(char dest[restrict static 1],
const char src[restrict static 1]);
[[gnu::returns_nonnull]] const char *util_basename(const char path[static 1]);
[[gnu::pure, gnu::returns_nonnull]] char *util_strchrnul(const char s[static 1],
int c);
[[gnu::pure]] int util_strcasecmp(const char s[restrict static 1],
const char t[restrict static 1]);
#endif /* UTIL_H */
util.c:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include "util.h"
int util_vasprintf(char **restrict strp, const char fmt[restrict static 1],
va_list ap)
{
va_list ap_copy;
va_copy(ap_copy, ap);
const int nwritten = vsnprintf(nullptr, 0, fmt, ap_copy);
va_end(ap_copy);
if (nwritten < 0) {
goto fatal;
}
*strp = malloc((size_t) nwritten + 1);
if (*strp == nullptr) {
goto fatal;
}
const int status = vsprintf(*strp, fmt, ap);
if (status < 0) {
free(*strp);
goto fatal;
}
return status;
fatal:
/* The BSD implementation sets *strp to nullptr on failure. Linux's leaves
* the contents undefined. Neither C nor POSIX has standarized this
* function as of yet. */
*strp = nullptr;
return -1;
}
int util_asprintf(char **restrict strp, const char fmt[static 1], ...)
{
va_list argp;
va_start(argp, fmt);
*strp = nullptr;
int nwritten = util_vasprintf(strp, fmt, argp);
va_end(argp);
return nwritten;
}
char *util_stpcpy(char dest[restrict static 1],
const char src[restrict static 1])
{
const size_t len = strlen(src);
return (char *) memcpy(dest, src, len + 1) + len;
}
const char *util_basename(const char path[static 1])
{
char *const cp = strrchr(path, '/');
return cp ? cp + 1 : path;
}
char *util_strchrnul(const char s[static 1], int c)
{
while (*s) {
if (*s == c) {
break;
}
}
return (char *) s;
}
int util_strcasecmp(const char s[restrict static 1],
const char t[restrict static 1])
{
int p, q;
do {
p = *s++;
if (islower((unsigned char) p)) {
p = toupper((unsigned char) p);
}
q = *t++;
if (islower((unsigned char) q)) {
q = toupper((unsigned char) p);
}
} while (p == q && q != '0円');
return p - q;
}
#ifdef TEST_MAIN
#include <assert.h>
int main(void)
{
assert(strcmp(util_basename("/usr/lib"), "lib") == 0);
assert(strcmp(util_basename("/usr/"), "") == 0);
assert(strcmp(util_basename("usr"), "usr") == 0);
assert(strcmp(util_basename("/"), "") == 0);
assert(strcmp(util_basename("."), ".") == 0);
assert(strcmp(util_basename(".."), "..") == 0);
assert(util_strcasecmp("aPplE", "APPLE") == 0);
assert(util_strcasecmp("apple", "apple") == 0);
assert(util_strcasecmp("HELLO", "HELLO") == 0);
assert(util_strcasecmp("", "") == 0);
assert(util_strcasecmp("HrLLO", "HELLO"));
}
#endif
I was unable to think of meaningful tests for the other functions.
Review Request:
Are these implementations correct? Do you see any bugs or inefficiencies in the code?
Are there any other useful attributes I can attach to these functions?
4 Answers 4
bump the pointer
char *util_strchrnul(const char s[static 1], int c)
{
while (*s) {
if (*s == c) {
break;
}
}
return (char *) s;
}
Pretty sure you're going to want to ++
increment s
at some point.
If it's initially pointing at non-NUL and != c
,
then it never moves and never exits.
It's a trivial mistake, no biggie. The larger mistake was neglecting code coverage measurements. A line of source that never executed is likely a buggy line. We systematically exercise the codebase to discover such issues.
locale
util_strcasecmp()
should document that it
only handles USASCII rather than utf-8 codepoints.
It sticks to just the "C" locale, avoiding wide character mapping.
assert(util_strcasecmp("äpfel", "Äpfel") == 0);
assert(util_strcasecmp("fußgängerübergänge",
"Fußgängerübergänge") == 0);
I see no reason for islower()
tests and
the associated pipeline stalls due to branch misprediction.
We could just unconditionally assign the case smashed value,
as non-alpha characters go through the identity mapping,
coming back as themselves.
The man page makes an interesting observation about "missing" mappings.
The details of what constitutes an uppercase or lowercase letter depend on the current locale. For example, the default "C" locale does not know about umlauts, so no conversion is done for them.
In some non-English locales, there are lowercase letters with no corresponding uppercase equivalent; the German sharp ß is one example.
Ordinarily I would look at {upper, lower} and say "meh!"
{six, half-dozen}, doesn't matter which way we case smash.
But that last item seems to suggest it would be more
convenient to smash to lower.
The effect will be the same.
It's a little surprising that after seeing islower()
and
then assigning p = toupper(p)
we cannot safely assert isupper(p)
,
yet that's what the man page explains.
-
\$\begingroup\$ "Pretty sure you're going to want to ++ increment s at some point." ==> Yes, silly me. I changed the implementation at the last moment, I had a tricky loop before this (I liked it, at least it worked). Thanks. :) \$\endgroup\$Madagascar– Madagascar2024年05月17日 17:05:53 +00:00Commented May 17, 2024 at 17:05
-
\$\begingroup\$ But there is an uppercase ß these days... U+1E9E ẞ LATIN CAPITAL LETTER SHARP S \$\endgroup\$Shawn– Shawn2024年05月18日 00:46:01 +00:00Commented May 18, 2024 at 0:46
-
\$\begingroup\$ @Shawn. Ha! Okay. Looks like a single-byte ISO Latin locale could behave differently from multi-byte unicode. (I'm just reading the man page, perhaps it needs an update.) \$\endgroup\$J_H– J_H2024年05月18日 00:52:14 +00:00Commented May 18, 2024 at 0:52
-
\$\begingroup\$ C locales were a huge mistake. Here's a funny rant about it github.com/mpv-player/mpv/commit/… \$\endgroup\$qwr– qwr2024年05月21日 02:57:56 +00:00Commented May 21, 2024 at 2:57
Potential idea
C2x is planned to support
QChar *strchr(QChar *s, int c);
This QChar
allows one to call strchr()
with a const char *
or char *
and return the same type.
I have not dug into C2x enough to well understand the mechanism used (it might be _Generic
), but perhaps your code could do the same with const char *util_basename(const char path[static 1])
?
-
1\$\begingroup\$ godbolt.org/z/1nj8vWcPd is how it can be implemented. Lundin presented this example on a StackOverflow question I asked. \$\endgroup\$Madagascar– Madagascar2024年05月20日 11:13:39 +00:00Commented May 20, 2024 at 11:13
Bug: questionable compare.
Code does
if (islower((unsigned char) p)) {
p = toupper((unsigned char) p);
}
yet documentation has
In the POSIX locale, strcasecmp() and strncasecmp() shall behave as if the strings had been converted to lowercase and then a byte comparison performed.
To reveal important differences involves testing a '_'
, which is between upper and lower case ASCII letters.
Suggest instead:
int util_strcasecmp(const char s[restrict static 1],
const char t[restrict static 1]) {
const unsigned char *us = (const unsigned char *) s;
const unsigned char *ut = (const unsigned char *) t;
unsigned p, q;
do {
p = tolower(*us++);
q = tolower(*ut++);
} while (p == q && q != '0円');
return (p > q) - (p < q);
}
Additional notes:
Even though a non C2x issue,
const unsigned char *us = (const unsigned char *) s;
also correctly handles whenchar
is signed and not 2's complement.(unsigned char) p
does not.(p > q) - (p < q)
does not overflow likep - q
might whenchar
same size asint
. This common idiom is well handled by good compilers.
For a potential 2x speed strcasecmp()
, see strcicmp_ch()
which omits the q != '0円'
test in the loop.
-
1\$\begingroup\$ In C23, all signed integral types are two’s-complement. C23 code does not need to worry about one’s-complement or sign-and-magnitude representations. \$\endgroup\$Davislor– Davislor2024年05月21日 05:03:59 +00:00Commented May 21, 2024 at 5:03
-
\$\begingroup\$ The linked answer has a section named "Do all letters map one lower to one upper? (pedantic)". Is that not to be entertained here? \$\endgroup\$Madagascar– Madagascar2024年05月21日 08:09:53 +00:00Commented May 21, 2024 at 8:09
-
\$\begingroup\$ @Davislor Yes "correctly handles when char is signed and not 2's complement." is not applicable with C23 as the post is tag. Still, using
const unsigned char *us = (const unsigned char *) s;
allows wider application, with no downside. \$\endgroup\$chux– chux2024年05月22日 22:44:54 +00:00Commented May 22, 2024 at 22:44 -
\$\begingroup\$ @Harith, Beyond simple US-ASCII case mapping, case insensitivity is commonly today a UTF8 string and that is a much larger case insensitive compare issue. Locale 8-bit issues are fading from relevance and that is where "Do all letters map one lower to one upper? (pedantic)" most applies. \$\endgroup\$chux– chux2024年05月22日 22:49:42 +00:00Commented May 22, 2024 at 22:49
Whilst util_stpcpy()
isn't wrong, it's possibly less efficient than a implementation tuned for the target architecture, which can fold finding the end into an implementation of strcpy()
. I couldn't make GCC 14 fuse these operations at any level of optimisation, and always end up with two function calls.
Does it really make sense to test islower()
before calling toupper()
in strcasecmp()
? I'm not aware of any locale where there's a non-lower character that isn't returned unchanged by toupper()
. I strongly suspect that unconditional toupper()
1 would be faster too.
1 Or tolower()
, to be more like the POSIX strcasecmp()
when used in the POSIX
locale.
I'm surprised you couldn't think of any tests of asprintf()
. Quite a few spring to mind, including some of the unspecified (invalid) inputs. But at a minimum, we should test a simple success path such as
{
char *formatted = nullptr;
int len = util_asprintf(&formatted, "%d", 10);
assert(len==2);
assert(formatted);
assert(!strcmp(formatted, "10"));
free(formatted);
}
strchrnul()
: Areadlines()
functions can avoid N branches withstrchrnul()
if it replacesstrchr()
with it. But there'sstrcspn()
too, which can be used instead. \$\endgroup\$while (*s && *s != c) { ++s; } return (char *)s;
==> I really shouldn't try stuffing too much in one line. :) \$\endgroup\$for (;; ++s) { if (*s == '0円' || *s == c) { return (char *)s; }
:) \$\endgroup\$while
loop. \$\endgroup\$