Question
Recently I've written the following splitv()
function, without using strtok
. Any way to make it shorter and more efficient?
Descrption
The function calculates the index-th
element of the string str
split by delim
. The element is written into buff
and 0
is returned. When an index error is occurred, any non 0
value is returned. This function was inspired by the python split function: buff = str.split(delim)[index]
, but I think it works too hard and can be improved.
int splitv(char buff[], char str[], char delim[], int index) {
// returns 0 if an element is written. Else non 0.
char *start = str, *end;
int i = 0, buff_i = 0;
// check if delim not in str
if ((end = strstr(str, delim)) == NULL) {
if (index == 0) strcpy(buff, str);
return index;
}
// delim is in str
for (i = 0; i < index; i++) {
start = end + strlen(delim);
end = strstr(start, delim);
if (end == NULL) {
// reached the last element
if (i == index - 1) {
end = str + strlen(str);
break;
}
// index error: index >= elements_count
return -1;
}
}
// copy element to buffer
while (start != end) {
buff[buff_i] = *start;
start++;
buff_i++;
}
buff[buff_i] = '0円';
return 0;
}
Thanks in advance!
2 Answers 2
Before we talk about improving the efficiency of your function, lets talk about safety:
- You never check the validity of your function parameters.
- You have no way of knowing how big
buff
is.
For the first issue, if str
or delim
is null, passing them directly in to strstr
will cause a segfault and crash your program.
The second issue is worse: if buff
is smaller then str
, your function may end up clobbering memory when it tries to copy from str
to buff
.
This is undefined behavior, and the results are unpredictable.
Now for some efficiency suggestions:
Calculate string length only once.
Sincestrlen
works by running over the string looking for '0円' terminator, it is more efficient to call it only once at the beginning of your function and store the result in a variable.
This can also allow you to implement a simple shortcut for edge cases such as wherestr
ordelimiter
are empty strings, or delimiter is longer thenstr
.You don't need your own
while
loop at the end and thebuff_i
variable.
You can simply usememcpy
orstrncpy
to copy the element in to the buffer.
memcpy
would be slightly more efficient, since it does not check for '0円', whilestrncpy
does.Since your logic is that if delimiter is never found, but the requested element is 0 the result is "success", you don't need to handle this as a separate edge case.
That means you can get rid of the firstif
block, and handle everything in one loop.
You can also get rid of the special checks in theif
loop and just focus on the search.
The following code example does not include the suggested validity checks:
int i;
char *start = str;
char *end = start;
size_t delim_len = strlen(delim);
size_t copy_len;
for (i = 0; i <= index && end != NULL; i++) {
end = strstr(start, delim);
if (end != NULL && i != index) start = end + delim_len;
}
/* element not found */
if (i != index) return index;
if (end == NULL) {
copy_len = strlen(start);
} else {
copy_len = (ssize_t)(end - start);
}
memcpy(buff, start, copy_len);
buff[copy_len] = '0円';
return 0;
-
1\$\begingroup\$ @chux-ReinstateMonica typo on my side, thanks for pointing it out, fixed. \$\endgroup\$Lev M.– Lev M.2021年02月10日 20:14:19 +00:00Commented Feb 10, 2021 at 20:14
int splitv(char buff[], char str[], char delim[], int index) { // returns 0 if an element is written. Else non 0.
A few points about this interface:
- Shouldn't
delim
be a pointer toconst char
? I think the function ought not to be modifying it. Similarly, if we can avoid modifying the input string (which is one major advantage to not usingstrtok()
), then take that as pointer-to-const, too. - Please accept an argument to indicate the capacity of
buff()
so that we can write code without buffer overflows! index
ought to be unsigned, as negative values make no sense here.- We could make the return value more useful by returning the length of the match (or negative on failure). Like the interface to
snprintf()
, this is very useful in the event the buffer is too small. I'd even go as far as acceptingNULL, 0
for the buffer arguments in the same manner. - It's more conventional to write the character pointer arguments as
char *arg
than with the indefinite array syntax. The two are identical as far as the compiler is concerned, of course, but keeping with convention makes code easier for other programmers to read.
Let's see how we'd use the existing interface:
#include <stdio.h>
int main(void)
{
char list[] = "Alice, Bob, Charlie, Don";
char delim[] = ", ";
char buff[10]; /* is that enough?? */
if (splitv(buff, list, delim, 2)) {
fputs("Couldn't split the string.\n", stderr);
return 1;
}
printf("Found %s\n", buff);
}
See how much easier the new interface is:
int splitv(char *buff, size_t buff_len,
const char *str, const char *delim,
unsigned int index);
#include <stdio.h>
int main(void)
{
const char *list = "Alice, Bob, Charlie, Don";
char buff[10];
int len = splitv(buff, sizeof buff, list, ", ", 2);
if (len < 0) {
fputs("Couldn't split the string.\n", stderr);
return 1;
}
if ((size_t)len >= sizeof buff) {
fputs("Substring too long.\n", stderr);
return 1;
}
printf("Found %s\n", buff);
}
We don't have to guess if our buffer is big enough (and adaptive code can use the result to allocate a big enough buffer), and we can pass string literals as arguments.
const char *start = str, *end; int i = 0, buff_i = 0;
Prefer to declare one variable per line, and prefer to declare where you can also initialise:
const char *start = str;
// check if delim not in str
const char *end = strstr(str, delim);
if (!end) {
for (int i = 0; i < index; ++i) {
We have two bits of code doing the same thing when we reach the last element. It may help to rethink what this code does:
- Skip over zero or more substrings ending in
delim
. - Find the next occurrence of
delim
, or the end of string. - Copy the substring.
Here's a simplified implementation that conforms to my suggested interface, and is structured according to those steps:
#include <stdint.h>
#include <string.h>
int splitv(char *buff, size_t buff_len,
const char *str, const char *delim,
unsigned int index)
{
/* returns length of the substring if it exists */
/* returns negative on failure */
/* useful constant */
const size_t delim_len = strlen(delim);
while (index--) {
/* advance past next delimiter */
const char *next = strstr(str, delim);
if (!next) {
/* not enough delimiters */
return -1;
}
str = next + delim_len;
}
/* okay, we've found the right substring; now find its end */
const char *end = strstr(str, delim);
if (!end) {
end = str + strlen(str);
}
/* decide how much to copy, and write it */
size_t len = (size_t)(end - str);
if (buff && buff_len > 0) {
size_t write_size = len < buff_len ? len : buff_len - 1;
/* write the output */
memcpy(buff, str, write_size);
buff[write_size] = 0;
}
/* return the substring length (not including null terminator) */
return (int)len;
}
Note that I've hoisted strlen(delim)
into a constant, as we know this doesn't need to be recomputed every time round the loop. I've also used standard memcpy
function instead of the hand-rolled loop to write to buff
.
-
1\$\begingroup\$ Further than making
index
unsigned, it should be asize_t
, as this function won't work on most (all?) 64-bit systems with more than 4B occurrences \$\endgroup\$lights0123– lights01232021年02月06日 04:07:34 +00:00Commented Feb 6, 2021 at 4:07 -
2\$\begingroup\$ Yes, I think so. In fact, in an early iteration I did exactly that, and I'm not sure why I changed it back. And
int
as return type is wrong, too (probably should besize_t
, with(size_t)-1
used as the error sentinel. Can I argue that my code isn't the finished item, just a step on the way? \$\endgroup\$Toby Speight– Toby Speight2021年02月06日 11:14:52 +00:00Commented Feb 6, 2021 at 11:14 -
\$\begingroup\$ Yes, quite right. Thanks @chux \$\endgroup\$Toby Speight– Toby Speight2021年02月10日 07:15:13 +00:00Commented Feb 10, 2021 at 7:15