I am trying to improve my C skills, and I hope someone might be able to provide me with some feedback on the following code.
It is just basic string splitting, it should behave the similar to Ruby's String#split or Clojure's clojure.string.split. I couldn't think of a simple/efficient way to create an array of variable-size strings so I went the callback route.
Anyway, any and all feedback is greatly appreciated, thank you! Check out the code:
void strsplit(char *str, char *delim, int limit, void (*cb)(char *s, int idx))
{
char *search = strdup(str);
if (limit == 1) {
cb(search, 0);
}
else {
int i = 0, count = 0, len = strlen(str), len_delim = strlen(delim);
char *segment = NULL, *leftover = NULL;
limit = limit > 0 ? limit - 1 : len;
segment = strtok(search, delim);
for (i = 0; segment != NULL && i < limit; i++) {
count += strlen(segment) + len_delim;
cb(segment, i);
segment = strtok(NULL, delim);
}
if (len != limit && count < len) {
leftover = (char*) malloc(len - count + 1);
memcpy(leftover, str + count, len - count);
leftover[len - count] = '0円';
cb(leftover, i);
free(leftover);
}
}
}
Also see the code with a test framework.
2 Answers 2
Here's a few comments. You need a comment giving a better definition of what the function should do. Otherwise we have to guess what your approximation to the Ruby function might be. An alternative interface might be to pass in a pointer array and its length instead of the callback.
I prefer to use
strspn
andstrcspn
orstrsep
if available to locate the tokens (just my preference).str
anddelim
should be const char *I'd prefer to see the
limit == 1
case handled in the main clause. Is there a good reason to treat it separately?Each variable on its own line
I think you can modify the algorithm to avoid having to count the length of the string (
strlen(str)
). I don't think eitherlen
orcount
is necessary.efficiency: you traverse the string with
strlen
at the start, thenstrtok
traverses each word and then you do it again withstrlen
on the word.why limit - 1 ??
not sure
count
will be computed correctly (len_delim
added but the actual sequence of separators present may not be of that length)prefer
len < limit
tolen != limit
(more robust to future changes)is the condition for the final
if
clause correct? When do you want it to be executed?prefer brackets round multiple conditions
((len != limit) && (count < len))
why use both
strdup
andmalloc
?malloc return should not be cast (C not C++)
why malloc the remaining string from
str
instead of just modifyingsearch
- it is already writeable because you strdup-ed itthe duped string should be freed on return, else it is a leak. Depends what the callback does with the bits of course.
Possibly missed other things - but there are enough things above to be going on with I guess :-)
-
\$\begingroup\$ Awesome, thank you, this is exactly the critical feedback I was hoping for. \$\endgroup\$gf3– gf32012年05月05日 20:03:15 +00:00Commented May 5, 2012 at 20:03
-
\$\begingroup\$ The reason for
limit == 1
is to handle the case that no splitting is required, this mimics the behaviour of the Ruby and Clojure behaviour. \$\endgroup\$gf3– gf32012年05月05日 20:05:29 +00:00Commented May 5, 2012 at 20:05 -
\$\begingroup\$ I then use
limit - 1
to match zero-based iteration with the limit. \$\endgroup\$gf3– gf32012年05月05日 20:06:06 +00:00Commented May 5, 2012 at 20:06 -
\$\begingroup\$ I believe I require
len
andcount
to track the offset of the last match, so I can then return the rest of the unsplit string in case of a limit. \$\endgroup\$gf3– gf32012年05月05日 20:08:03 +00:00Commented May 5, 2012 at 20:08 -
\$\begingroup\$ I incorrectly assumed that
strdup
allocated on the stack, so I did not free it. How embarrassing. \$\endgroup\$gf3– gf32012年05月05日 20:09:22 +00:00Commented May 5, 2012 at 20:09
I'd suggest not rolling your own implementation; just use strtok_r
properly and save yourself some time. For memory allocation, you can either use the offsets into the string in-place, or use strndup
to get copies of each token as you find it.
-
1\$\begingroup\$ This is purely for learning experience. It isn't actually being used anywhere. \$\endgroup\$gf3– gf32012年04月27日 21:30:12 +00:00Commented Apr 27, 2012 at 21:30
strdup
doesn't lead to warm fuzzy feelings, andstrtok
makes me feel...less than happy. \$\endgroup\$strdup
should be avoided because it's non-standard (though, in fairness, it is pretty widely available).strtok
should be avoided because it's ugly to use, requires heroic efforts to be thread safe, and can't be used on string literals. Duplicating the input avoids the last problem, but not the other two. \$\endgroup\$free
on yoursearch
string since you calledstrdup
to allocate it. \$\endgroup\$