I am trying to implement substring in C. How efficient and standard is this code? I have a hard time understanding how C functions are usually meant to work, like how printf()
returns an int
or size_t
.
size_t substring(char *destination, char const *string, int start, int len) {
// writes a substring to the destinaton starting at start index of string until end stepping by step
// follows the same start at 0 off by one on the end format as Java substring method
int substringLength = len;
int stringLength = strlen(string);
if (start > stringLength || len > stringLength || start < 0 || len < 0) {
fputs("start and len must be 0 < start/len < length of string", stderr);
return -1;
}
if (start >= len) {
fputs("start must be start < len", stderr);
return -1;
}
memcpy(destination, &string[start], substringLength);
if (destination[stringLength] != '\x0') {
destination[stringLength] = '\x0';
}
return substringLength;
}
6 Answers 6
Argument confusion
If you were trying to replicate the java substring function, you should have had a start
and end
argument. As it is, you have a start
and length
argument, but you didn't handle the length
argument correctly.
Bug 1
This line doesn't correctly check for the case of copying off the end of the source string:
if (start > stringLength || len > stringLength || start < 0 || len < 0) {
It needs to be:
if (start + len > stringLength || start < 0 || len < 0) {
Edit: If you want to check for integer overflow, you can also throw in || start + len < 0
, and use size_t
everywhere instead of int
.
Bug 2
This check is unnecessary and can return an error incorrectly:
if (start >= len) { fputs("start must be start < len", stderr); return -1; }
For example, if the string has 10 bytes, start
is 8 and len
is 1, this check will return an error.
Bug 3
When you terminate the destination string, you use the wrong index. It should be len
, not stringLength
. Also, I'm not sure why you do a check before you write to it:
if (destination[stringLength] != '\x0') { destination[stringLength] = '\x0'; }
should be:
destination[len] = '0円';
Also, '0円'
is the standard way to write the null character. I've never seen it written as '\x0'
before.
Unnecessary local variable
The variable substringLength
is unnecessary because it is always the same as len
. You can just use len
instead.
Return value
The java substring function returns the string. Your function returns the length but I don't think that this will be useful to the caller. The reason you might want to return a string is so you can chain calls like this:
strcat(substring(buf, path, dir_start, dir_length), basename);
-
\$\begingroup\$ Note:
start + len > stringLength
is not safe against integer overflow. \$\endgroup\$Stack Exchange Broke The Law– Stack Exchange Broke The Law2015年12月03日 07:35:27 +00:00Commented Dec 3, 2015 at 7:35 -
\$\begingroup\$ On the other hand, if trying to emulate the .NET Substring(), then passing the length would be right. \$\endgroup\$200_success– 200_success2015年12月04日 09:19:21 +00:00Commented Dec 4, 2015 at 9:19
-
\$\begingroup\$ @200_success But in the comments in the code, he specifically says he is trying to emulate the java version. \$\endgroup\$JS1– JS12015年12月04日 18:36:38 +00:00Commented Dec 4, 2015 at 18:36
I have a hard time understanding how C functions are usually meant to work
The best thing to do is to read the documentation, which is only a Google search away.
For instance, you'll determine that strlen()
returns a size_t
, so stringLength
should also be a size_t
, same with len
. Mixing this with an int
or other signed type can cause "possible loss of data" warnings.
Also, if you have a string called destination
, then the other one should be called source
.
By first glance it looks kind of OK, but there are at least a few bugs:
- Given a
start > 0
andlen < strlen(string)
, but still large, then yourmemcpy
might end up writing beyond allocated memory - Why the
start < len
test? Givenstart = 5
andlen = 2
, butstring = 'A somewhat long string'
. It should be allowed to copy 2 characters from that string. I think what you mean to check for is something like(start + len) <= stringLength
You should also be aware that C also have substring functions in default libraries, like strncpy
or strcpy
, so you're kind of reinventing the wheel.
Printing an error message to stderr and returning -1 if argument validation fails isn't very good practice. For one thing, there's no way to distinguish different error cases; for another, there's no way to avoid the log spam.
If an error you detect is always programmer error then you should catch it with an assertion that stops the program dead in its tracks (or better yet, a static assertion that stops the program from compiling). Otherwise, if the error could be caused by user input, you should return an error code (preferably a different error code for each case) and let your caller decide how the error should be handled. Printing a canned error to the screen may not be what the user of your function wants!
Finally, you have the opportunity to decide that some cases aren't errors at all. It can be convenient, and reduce the number of special cases in calling code, to allow the boundaries of a substring to be outside of the source string, or to allow the end position to be less than the start position, and to just trim the output string apprirately, and many standard substring functions allow some or all of these combinations.
For example, you might use the substring function to get the "first 80 characters of headline
" to display a summary, but it doesn't need to be an error if the headline is less than 80 characters to begin with; the summary in that case would just be the whole headline. You might also get the rest of the headline by asking for a substring beginning at character 80, for a short headline this can just be the empty string instead of returning an error. This preserves the property that headline = summary + remainder. There are tradeoffs involved in stronger and weaker validation, but my point is that if you can come up with meaningful behavior for a set of arguments, then your function might be more useful by allowing it, rather than being needlessly restrictive.
char *destination
needs a size
Any new string function can be made far more robust by passing in the size of the destination array. The minor performance saving by not having to do size limitations is far outweighed by the grief cause by message overruns.
Use size_t
for all string sizes and indexes. Type int
may be of insufficient range especially for this function whose goal is to make sub-strings of larger strings.
Avoid an argument name like string
. Use a name that describes its role.
// size_t substring(char *destination, char const *string, int start, int len) {
size_t substring(char *destination, size_t dsize, char const *source,
size_t start, size_t len) {
Investigate restrict
. Your present code has trouble if destination
and string
overlap. Either adjust code to handle that or use restrict
to indicate your function can not cope with overlap.
Define corner cases
Instead of returning (size_t)-1
, on "start and len must be 0 < start/len < length of string", define a plausible functionality.
Along with using size_t dsize
, there should be no error conditions possible. All possible legitimate inputs should result in a defined result.
// dest and source may overlap
void substring1(char *dest, size_t dsize, const char *source, size_t start, size_t length) {
size_t source_len = strlen(source);
if (start > source_len) start = source_len;
if (start + length*1ull > source_len) length = source_len - start;
if (length + 1 > dsize) {
if (dsize == 0) {
return;
}
length = dsize - 1;
}
memmove(dest, &source[start], length);
dest[length] = 0;
}
-
\$\begingroup\$ It's the caller's responsibility to provide a
destination
buffer that is long enough to contain the result. It seems silly to have to passdsize
just to reassuresubstring1()
of that fact. \$\endgroup\$200_success– 200_success2015年12月04日 19:33:37 +00:00Commented Dec 4, 2015 at 19:33 -
\$\begingroup\$ @200_success C's standard library has grown with
size
restricted functions likesnprintf()
,strftime()
, many others too and has dropped the unrestricted size function likegets()
. The standard's direction to usingsize_t
in more functions is not silly. Since my coding style reflects that model, checkingsize
is a legitimate coding style. We may professionally disagree on this, but neither approach is silly - the larger program's coding goals often drive the best approach. \$\endgroup\$chux– chux2015年12月04日 19:47:44 +00:00Commented Dec 4, 2015 at 19:47 -
\$\begingroup\$ We already have a length argument. Adding another one makes the API more cumbersome but not a bit safer. \$\endgroup\$Deduplicator– Deduplicator2015年12月05日 12:41:24 +00:00Commented Dec 5, 2015 at 12:41
Interface
Your function looks more like the .NET Substring function (for which you specify the length
) than the Java substring function (for which you specify the exclusive endIndex
).
Validation
Calling strlen()
anywhere in this function is, in my opinion, not acceptable. If I'm trying to extract the first 5 bytes of a megabyte-long string, why should the operation require the entire string to be traversed?
Polluting standard error is unconventional in a library function like this: no function in the standard C library reports errors that way. Errors should be reported by returning an error code, returning an error code via an out-parameter, or via a global error flag (which is less preferred, but still acceptable). In this case, you could also use assertions instead, since out-of-bounds errors are programmer errors, not user errors.
If you want to return an error code, you should not return -1
as a size_t
, because size_t
is is unsigned.
Alternatively, consider redefining the behaviour such that out-of-bounds access just produces an empty or shorter result, rather than an error. (That's not how the .NET and Java functions behave, though.) Then, you could design the function such that it returns the length of the string that was actually copied (which would ideally be the same as len
, but might be shorter if the bounds were wrong).
Implementation
Instead of memcpy()
, the function you want to use is strncpy()
:
strncpy(destination, string + start, len);
destination[len] = '0円'; /* Important! */
Or, better yet, if stpncpy()
is available on your target platform:
char *termination = stpncpy(destination, string + start, len)
*termination = '0円'; /* Important! */
return termination - destination; /* The length of the result */
memmove()
ormemcpy()
to accomplish this. \$\endgroup\$memcpy()
. What's your point? \$\endgroup\$memmove()
ormemcpy()
directly, since those functions don't handle NUL-termination. \$\endgroup\$