String splitting in C

Question 1

I am trying to improve my C skills, and I hope someone might be able to provide me with some feedback on the following code.

It is just basic string splitting, it should behave the similar to Ruby's String#split or Clojure's clojure.string.split. I couldn't think of a simple/efficient way to create an array of variable-size strings so I went the callback route.

Anyway, any and all feedback is greatly appreciated, thank you! Check out the code:

void strsplit(char *str, char *delim, int limit, void (*cb)(char *s, int idx))
{
 char *search = strdup(str);
 if (limit == 1) {
 cb(search, 0);
 }
 else {
 int i = 0, count = 0, len = strlen(str), len_delim = strlen(delim);
 char *segment = NULL, *leftover = NULL;
 limit = limit > 0 ? limit - 1 : len;
 segment = strtok(search, delim);
 for (i = 0; segment != NULL && i < limit; i++) {
 count += strlen(segment) + len_delim;
 cb(segment, i);
 segment = strtok(NULL, delim);
 }
 if (len != limit && count < len) {
 leftover = (char*) malloc(len - count + 1);
 memcpy(leftover, str + count, len - count);
 leftover[len - count] = '0円';
 cb(leftover, i);
 free(leftover);
 }
 }
}

Also see the code with a test framework.

Question 2

Hmm...at least for me, strdup doesn't lead to warm fuzzy feelings, and strtok makes me feel...less than happy.

Question 3

@JerryCoffin Can you point me to some literature as to why I should avoid these guys?

Question 4

strdup should be avoided because it's non-standard (though, in fairness, it is pretty widely available). strtok should be avoided because it's ugly to use, requires heroic efforts to be thread safe, and can't be used on string literals. Duplicating the input avoids the last problem, but not the other two.

Question 5

I might be missing the call here, but you need to be sure to call free on your search string since you called strdup to allocate it.

Question 6

Here's a few comments. You need a comment giving a better definition of what the function should do. Otherwise we have to guess what your approximation to the Ruby function might be. An alternative interface might be to pass in a pointer array and its length instead of the callback.

I prefer to use strspn and strcspn or strsep if available to locate the tokens (just my preference).
str and delim should be const char *
I'd prefer to see the limit == 1 case handled in the main clause. Is there a good reason to treat it separately?
Each variable on its own line
I think you can modify the algorithm to avoid having to count the length of the string (strlen(str)). I don't think either len or count is necessary.
efficiency: you traverse the string with strlen at the start, then strtok traverses each word and then you do it again with strlen on the word.
why limit - 1 ??
not sure count will be computed correctly (len_delim added but the actual sequence of separators present may not be of that length)
prefer len < limit to len != limit (more robust to future changes)
is the condition for the final if clause correct? When do you want it to be executed?
prefer brackets round multiple conditions ((len != limit) && (count < len))
why use both strdup and malloc ?
malloc return should not be cast (C not C++)
why malloc the remaining string from str instead of just modifying search - it is already writeable because you strdup-ed it
the duped string should be freed on return, else it is a leak. Depends what the callback does with the bits of course.

Possibly missed other things - but there are enough things above to be going on with I guess :-)

Question 7

Awesome, thank you, this is exactly the critical feedback I was hoping for.

Question 8

The reason for limit == 1 is to handle the case that no splitting is required, this mimics the behaviour of the Ruby and Clojure behaviour.

Question 9

I then use limit - 1 to match zero-based iteration with the limit.

Question 10

I believe I require len and count to track the offset of the last match, so I can then return the rest of the unsplit string in case of a limit.

Question 11

I incorrectly assumed that strdup allocated on the stack, so I did not free it. How embarrassing.

Question 12

I'd suggest not rolling your own implementation; just use strtok_r properly and save yourself some time. For memory allocation, you can either use the offsets into the string in-place, or use strndup to get copies of each token as you find it.

Question 13

This is purely for learning experience. It isn't actually being used anywhere.

William Morris William Morris 9,38219 silver badges42 bronze badges · Accepted Answer · 2012-04-30 04:42:54Z

Here's a few comments. You need a comment giving a better definition of what the function should do. Otherwise we have to guess what your approximation to the Ruby function might be. An alternative interface might be to pass in a pointer array and its length instead of the callback.

I prefer to use strspn and strcspn or strsep if available to locate the tokens (just my preference).
str and delim should be const char *
I'd prefer to see the limit == 1 case handled in the main clause. Is there a good reason to treat it separately?
Each variable on its own line
I think you can modify the algorithm to avoid having to count the length of the string (strlen(str)). I don't think either len or count is necessary.
efficiency: you traverse the string with strlen at the start, then strtok traverses each word and then you do it again with strlen on the word.
why limit - 1 ??
not sure count will be computed correctly (len_delim added but the actual sequence of separators present may not be of that length)
prefer len < limit to len != limit (more robust to future changes)
is the condition for the final if clause correct? When do you want it to be executed?
prefer brackets round multiple conditions ((len != limit) && (count < len))
why use both strdup and malloc ?
malloc return should not be cast (C not C++)
why malloc the remaining string from str instead of just modifying search - it is already writeable because you strdup-ed it
the duped string should be freed on return, else it is a leak. Depends what the callback does with the bits of course.

Possibly missed other things - but there are enough things above to be going on with I guess :-)

Awesome, thank you, this is exactly the critical feedback I was hoping for.
The reason for limit == 1 is to handle the case that no splitting is required, this mimics the behaviour of the Ruby and Clojure behaviour.
I then use limit - 1 to match zero-based iteration with the limit.
I believe I require len and count to track the offset of the last match, so I can then return the rest of the unsplit string in case of a limit.
I incorrectly assumed that strdup allocated on the stack, so I did not free it. How embarrassing.

Stack Exchange Network

String splitting in C

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

String splitting in C

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions