3
\$\begingroup\$

I made a string separator below. It's like strtok, but it separates based on entire strings instead of single char delimiters. Any comments/answers on how I could improve the performance or readability would be great. I'm also leaking some memory according to valgrind, but that is only a minor concern for my use case.

It takes two char arrays, a buffer(string?) and a separator. It returns an array of strings containing all of the separated strings. For example, if the buffer is foobar and the separator is foo, it'll return {"bar", NULL}. But if I'm passed foobarfoobarfoobar and the separator is bar, it'll return {"foo", "foo", "foo", NULL}.

Edit: Comments added to clarify intended functionality.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char **sepstr(const char *buf, const char *sep){
 int i = 0, j = 0, k = 0, l = 0, stringCount = 0;
 /* If separator > buffer, return NULL */
 while (buf[i]) i++;
 while (sep[j]) j++;
 if (j > i) return NULL;
 char **strings = malloc(0);
 /* While we're not at the end of the buffer */
 while (*(buf+k)){
 /* If the characters match, check to see if it is the separator */
 if (*(buf+k) == *(sep)){
 for (l = 0; l < j; l++){
 /* If it isn't the separator, break */
 if (*(buf+k+l) != *(sep+l)) break;
 /* If it is the separator and the separator isn't the
 beginning, add buf to buf+k bytes as a string */
 if (l == j-1 && k != 0){
 strings = realloc(strings, (stringCount+1)*sizeof (char *));
 strings[stringCount] = malloc(k+1);
 memcpy(strings[stringCount++], buf, k);
 buf += (j + k), k = -1;
 /* If it is the separator, but it's the beginning
 of the string, skip it */
 } else if (l == j-1 && k == 0) {
 buf += j, k = -1;
 }
 }
 }
 k++;
 }
 /* Add a string for the left over bytes if sep isn't the end */
 if (i != k && *(buf)){
 while (buf[l]) l++;
 strings = realloc(strings, (stringCount+1) * sizeof (char *));
 strings[stringCount] = malloc(l+1);
 memcpy(strings[stringCount++], buf, l);
 }
 /* Append NULL to array of strings */
 strings = realloc(strings, (stringCount+1) * sizeof (char *));
 strings[stringCount] = malloc(sizeof NULL);
 strings[stringCount] = NULL;
 return strings;
}
int main(){
 const char buffer[] = "foobarfoobarfoobar";
 const char separator[] = "bar";
 char **strings = sepstr(buffer, separator);
 while(*strings){
 printf("%s\n", *(strings));
 free(*(strings++));
 }
 free(strings);
 return 0;
}
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Aug 23, 2017 at 3:22
\$\endgroup\$
5
  • \$\begingroup\$ Possible suggestion: Replace memcpy with for (l = 0; l < k; l++) strings[stringCount][l] = *(buf+l);to remove dependency on string.h and remove string.h. \$\endgroup\$ Commented Aug 23, 2017 at 5:12
  • 1
    \$\begingroup\$ We expect your code to be working correctly as intended. Please fix any known memory leaks before proceeding with a review. \$\endgroup\$ Commented Aug 23, 2017 at 6:31
  • \$\begingroup\$ @200_success Arguably, the code still works as intended. The issue here isn't really the memory leaks as it has no ill-effects on the actual outcome of the program. A short-lived program might suffer no apparent ill effects from a memory leak. In that case, the question could be considered on-topic, and the sloppiness would be fair material to be addressed in an answer. \$\endgroup\$ Commented Aug 23, 2017 at 12:02
  • 1
    \$\begingroup\$ +1 for using Valgrind; -1 for ignoring its findings. \$\endgroup\$ Commented Aug 23, 2017 at 14:42
  • \$\begingroup\$ @TobySpeight Didn't ignore its findings. Check the edits on the original post. I had originally wanted help with fixing the memory leaks, but because of that 200_success marked it as off-topic. In order to make it "on-topic" he edited the original post. \$\endgroup\$ Commented Aug 23, 2017 at 16:12

5 Answers 5

4
\$\begingroup\$

I'm also leaking some memory according to valgrind, but that is only a minor concern for my use case.

Hmm, I don't really agree. Leaking memory is harmless only if you know the cause (*), and simply choose to not spend time in fixing it. In your case, the cause (for one part, see below) is the print loop at the end of the main:

while(*strings){
 printf("%s\n", *(strings));
 free(*(strings++)); // incrementing strings here
}
free(strings); // string no longer points to the initially allocated bloc: UB

Whether it matters is your problem, but you should not reproduce this code. BTW, it caused a crash on my VS2017...

Now for the function itself:

  • char** strings = malloc(0); Nothing really bad here because it will be reallocated later, but as it is more of a trick, it really deserves a comment explaining the reason for that line

  • Your naming convention for local variables (int i = 0, j = 0, k = 0, l = 0, stringCount = 0;) does not help to understand their meaning. After analysis I would prefer:

    i -> buflen
    j -> seplen
    k -> buf_ix // or whatever you want saying it is intended to be an index of buf
    l -> sep_ix
    
  • the strings in strings are not null-terminated. This is really bad because there is no way to find the used length: you really should add that terminating null - anyway, you have reserved memory for it...

  • An additional memory leak when setting last element of strings to NULL:

    /* Append NULL to array of strings */
    strings = realloc(strings, (stringCount+1) * sizeof (char *));
    strings[stringCount] = malloc(sizeof NULL); // hmm... a dynamic allocation...
    strings[stringCount] = NULL; // immediately and definitively lost !
    

    this code should become:

    /* Append NULL to array of strings */
    strings = realloc(strings, (stringCount+1) * sizeof (char *));
    strings[stringCount] = NULL;
    
  • you should test malloc and realloc return values. Allocation error are uncommon nowadays, but if this code was to run under stressed condition very bad things would occur, because you would be using a null pointer which is Undefined Behaviour (other details on @TobySpeight's answer).


(*) Even if you had identified that the code in main caused a memory leak, it could have masked the other memory leak inside the function the deserved to be fixed. That's the reason why it is bad to leave a memory leak in a program, and even in a test only one!

answered Aug 23, 2017 at 14:30
\$\endgroup\$
2
  • \$\begingroup\$ Note that that p = realloc(p, n) is risky, as I explain in my answer. \$\endgroup\$ Commented Aug 24, 2017 at 8:14
  • \$\begingroup\$ @TobySpeight: I've add it to my answer. Thanks for noticing! \$\endgroup\$ Commented Aug 24, 2017 at 8:25
6
\$\begingroup\$

This is a worthy goal - strtok() has a terrible interface, and sorely needs a stateless, thread-safe, non-destructive replacement.

It's inconvenient for the caller to be given ownership of allocated objects. I'll propose an alternative interface you might like to consider, that's based on existing practice in the standard library:

struct view { const char *s, size_t len };
size_t sepstr(struct view *result, size_t n, const char *str, const char *delim)
  • result points to an array of size n supplied by the caller. A null pointer may be passed if n is zero.
  • str and delim are the same as your buf and sep but renamed to be consistent with strtok() documentation
  • The return value is the number of (possibly empty) tokens found (i.e. one more than the number of delimiters). This may be greater than n, in which case, tokens after the nth are not stored in result.

This interface means that a caller can use their own memory management (which need not be malloc() and free()) but can still determine the necessary buffer size, in a similar manner to snprintf().

One disadvantage here is that because we're returning a view into an existing string, we need a length instead of using ordinary null-terminated strings. The caller may need to copy them into working memory to get null termination. On the upside, it doesn't need to allocate memory for all the results at once if it only operates on one at a time.

If you combine this with the suggestion to use standard strstr() and similar functions, you should end up with a really good replacement for strtok().

answered Aug 23, 2017 at 14:30
\$\endgroup\$
4
\$\begingroup\$

I think you should keep it simple. There's no need for allocating memory in a function that tokenizes. Let the user decide if it needs to strdup(3) the input string before tokenizing, which will result in simpler code.


There's strsep(3), which originated at 4.4BSD, and is provided at least by:

  • glibc
  • musl libc
  • bionic libc
  • FreeBSD
  • NetBSD
  • OpenBSD

and documented here: https://www.man7.org/linux/man-pages/man3/strsep.3.html

That's a quite useful function for replacing strtok(3) in some cases. It's simple, as it has no internal state (strtok(3) has internal state); it doesn't merge adjacent delimiters (strtok(3) merges them); and it also doesn't allocate any memory (unlike your approach).


I developed my own variant, which I called stpsep(), for replacing some other uses of strtok(3), which strsep(3) isn't good for:

inline char *
stpsep(char *s, const char *delim)
{
 strsep(&s, delim);
 return s;
}

Or a portable implementation (using only C11 functions):

inline char *
stpsep(char *s, const char *delim)
{
 s += strcspn(s, delim);
 if (*s == '0円')
 return NULL;
 strcpy(s++, "");
 return s;
}

This function is probably the simplest separating function that one can write:

  • If it finds a delimiter:
    • Overwrite it with a '0円'
    • Return a pointer to the next character after the delimiter.
  • If no delimiter is found:
    • Return NULL.

So, if you want to separate "foo,,bar,baz":

char s[] = "foo,,bar,baz";
p = stpsep(s, ","); // p is ",bar,baz"; s is now "foo"

This is useful for example for clearing the '\n' after fgets:

if (fgets(buf, sizeof(buf), stdin) == NULL)
 err(1, "fgets");
if (stpsep(buf, "\n") == NULL)
 errx(1, "line too long");

Edit:

  • (2024年07月04日) As @chux-ReinstateMonica suggested, I've removed the unnecessary restrict qualifier from the parameters.

  • (2024年07月04日) As @chux-ReinstateMonica suggested, I've replaced s = s + ... by s += ....

answered Jul 3, 2024 at 9:48
\$\endgroup\$
6
  • 1
    \$\begingroup\$ restrict looks unneeded in stpsep(char *s, const char *restrict delim). It is not needed for size_t strcspn(const char *s1, const char *s2);. Why the added restriction? \$\endgroup\$ Commented Jul 4, 2024 at 17:08
  • 1
    \$\begingroup\$ Candidate simplification: inline char *stpsep(char *s, const char *delim) { s += strcspn(s, delim); if (s == '0円') return NULL; *s = 0; return s+1; }. \$\endgroup\$ Commented Jul 4, 2024 at 17:13
  • \$\begingroup\$ @chux-ReinstateMonica re: restrict: I didn't think about it too much; I saw strsep(3) is restrict in glibc, and used it too in my API. It's true that it's strictly not necessary in the portable implementation. I'll remove it. \$\endgroup\$ Commented Jul 4, 2024 at 19:28
  • \$\begingroup\$ @chux-ReinstateMonica re: simplification: strcpy(..., "") vs = '0円': strcpy(3) provides some safety. When compiling with _FORTIFY_SOURCE, strcpy(3) implicitly protects the write of the '0円' via compiler magic. That's a reason why I use strcpy(s, "") for terminating strings. I got that habit last year, after discussing how to make safer some string APIs that I developed for shadow utils. See <github.com/shadow-maint/shadow/pull/1035#discussion_r1663394760> and <github.com/shadow-maint/shadow/pull/850#discussion_r1413169325> for more details. \$\endgroup\$ Commented Jul 4, 2024 at 19:35
  • \$\begingroup\$ @chux-ReinstateMonica re: simplification: +=: Thanks! \$\endgroup\$ Commented Jul 4, 2024 at 19:39
2
\$\begingroup\$

Test your allocations

This code seems to assume that memory allocation always succeeds. That's not true in general - you should always test the return value from malloc() and similar.

This is a particularly dangerous anti-pattern:

strings = realloc(strings, (stringCount+1) * sizeof (char *));

If the allocation fails and returns NULL, we can't even free() the old memory, because we have overwritten the only pointer to it.

answered Aug 23, 2017 at 14:40
\$\endgroup\$
1
\$\begingroup\$

C has a function called strstr which makes it all easier to read. It returns a pointer to the first character of the separator it finds or NULL if nothing is found. With pointer arithmetic you can then find out about the length of the string to copy.

Here is a version using strstr. It returns a null-terminated array with maximum size 20. If you need a growable array then implement that yourself.

char **strsep(char *line, const char *sep)
{
 char **ret = malloc(20 * sizeof(char *));
 char *startp = line, *endp;
 size_t l = strlen(sep);
 int cnt = 0;
 while (endp = strstr(startp, sep))
 {
 ptrdiff_t diff = endp - startp;
 ret[cnt] = malloc(diff + 1);
 strncpy(ret[cnt], startp, diff);
 ret[cnt][diff] = '0円';
 cnt++;
 startp = endp + l;
 }
 if (*startp) // in case last is not a separator
 {
 ret[cnt] = malloc(strlen(startp) + 1);
 strcpy(ret[cnt], startp);
 cnt++;
 }
 ret[cnt] = NULL;
 return ret;
}
int main()
{
 char **x = strsep("foobarfoobarfoo", "bar");
 while (*x)
 {
 printf("%s\n", *x++);
 }
 return 0;
}
Toby Speight
87.8k14 gold badges104 silver badges325 bronze badges
answered Aug 23, 2017 at 14:17
\$\endgroup\$
4
  • \$\begingroup\$ You probably want strcspn(3) instead of strstr(3). \$\endgroup\$ Commented Jul 2, 2024 at 22:06
  • 1
    \$\begingroup\$ BTW, strsep(3) is an existing glibc function (with different semantics from your proposal), so you probably want to choose a different name. \$\endgroup\$ Commented Jul 2, 2024 at 22:07
  • \$\begingroup\$ thats true. I believe own functions that start with str are not allowed in C standard even \$\endgroup\$ Commented Jul 3, 2024 at 6:43
  • \$\begingroup\$ Since C23 they are only "potentially reserved", which means they're not reserved until the standard standardizes them. <open-std.org/jtc1/sc22/WG14/www/docs/n2625.pdf>. So, I can define my own stpsep(), which I'll show in an answer, and be happy while the standard (or an implementation) doesn't call anything like that. \$\endgroup\$ Commented Jul 3, 2024 at 9:19

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.