Substring function in C

Question 1

I am trying to implement substring in C. How efficient and standard is this code? I have a hard time understanding how C functions are usually meant to work, like how printf() returns an int or size_t.

size_t substring(char *destination, char const *string, int start, int len) {
 // writes a substring to the destinaton starting at start index of string until end stepping by step
 // follows the same start at 0 off by one on the end format as Java substring method
 int substringLength = len;
 int stringLength = strlen(string);
 if (start > stringLength || len > stringLength || start < 0 || len < 0) {
 fputs("start and len must be 0 < start/len < length of string", stderr);
 return -1;
 }
 if (start >= len) {
 fputs("start must be start < len", stderr);
 return -1;
 }
 memcpy(destination, &string[start], substringLength);
 if (destination[stringLength] != '\x0') {
 destination[stringLength] = '\x0';
 }
 return substringLength;
}

Question 2

It would be good to note that you could just use the standard memmove() or memcpy() to accomplish this.

Question 3

@syb0rg We are being asked to review this wrapper for memcpy(). What's your point?

Question 4

@200_success I guess it was poorly communicated in my comment, but I in terms of efficiency alone I was thinking he could discard the wrapper and just use the standard functions alone.

Question 5

@syb0rg There is significant benefit to using a wrapper instead of calling memmove() or memcpy() directly, since those functions don't handle NUL-termination.

Question 6

Argument confusion

If you were trying to replicate the java substring function, you should have had a start and end argument. As it is, you have a start and length argument, but you didn't handle the length argument correctly.

Bug 1

This line doesn't correctly check for the case of copying off the end of the source string:

if (start > stringLength || len > stringLength || start < 0 || len < 0) {

It needs to be:

if (start + len > stringLength || start < 0 || len < 0) {

Edit: If you want to check for integer overflow, you can also throw in || start + len < 0, and use size_t everywhere instead of int.

Bug 2

This check is unnecessary and can return an error incorrectly:

if (start >= len) {
 fputs("start must be start < len", stderr);
 return -1;
}

For example, if the string has 10 bytes, start is 8 and len is 1, this check will return an error.

Bug 3

When you terminate the destination string, you use the wrong index. It should be len, not stringLength. Also, I'm not sure why you do a check before you write to it:

if (destination[stringLength] != '\x0') {
 destination[stringLength] = '\x0';
}

should be:

destination[len] = '0円';

Also, '0円' is the standard way to write the null character. I've never seen it written as '\x0' before.

Unnecessary local variable

The variable substringLength is unnecessary because it is always the same as len. You can just use len instead.

Return value

The java substring function returns the string. Your function returns the length but I don't think that this will be useful to the caller. The reason you might want to return a string is so you can chain calls like this:

strcat(substring(buf, path, dir_start, dir_length), basename);

Question 7

Note: start + len > stringLength is not safe against integer overflow.

Question 8

On the other hand, if trying to emulate the .NET Substring(), then passing the length would be right.

Question 9

@200_success But in the comments in the code, he specifically says he is trying to emulate the java version.

Question 10

I have a hard time understanding how C functions are usually meant to work

The best thing to do is to read the documentation, which is only a Google search away.

For instance, you'll determine that strlen() returns a size_t, so stringLength should also be a size_t, same with len. Mixing this with an int or other signed type can cause "possible loss of data" warnings.

Also, if you have a string called destination, then the other one should be called source.

Question 11

By first glance it looks kind of OK, but there are at least a few bugs:

Given a start > 0 and len < strlen(string), but still large, then your memcpy might end up writing beyond allocated memory
Why the start < len test? Given start = 5 and len = 2, but string = 'A somewhat long string'. It should be allowed to copy 2 characters from that string. I think what you mean to check for is something like (start + len) <= stringLength

You should also be aware that C also have substring functions in default libraries, like strncpy or strcpy, so you're kind of reinventing the wheel.

Question 12

Printing an error message to stderr and returning -1 if argument validation fails isn't very good practice. For one thing, there's no way to distinguish different error cases; for another, there's no way to avoid the log spam.

If an error you detect is always programmer error then you should catch it with an assertion that stops the program dead in its tracks (or better yet, a static assertion that stops the program from compiling). Otherwise, if the error could be caused by user input, you should return an error code (preferably a different error code for each case) and let your caller decide how the error should be handled. Printing a canned error to the screen may not be what the user of your function wants!

Finally, you have the opportunity to decide that some cases aren't errors at all. It can be convenient, and reduce the number of special cases in calling code, to allow the boundaries of a substring to be outside of the source string, or to allow the end position to be less than the start position, and to just trim the output string apprirately, and many standard substring functions allow some or all of these combinations.

For example, you might use the substring function to get the "first 80 characters of headline" to display a summary, but it doesn't need to be an error if the headline is less than 80 characters to begin with; the summary in that case would just be the whole headline. You might also get the rest of the headline by asking for a substring beginning at character 80, for a short headline this can just be the empty string instead of returning an error. This preserves the property that headline = summary + remainder. There are tradeoffs involved in stronger and weaker validation, but my point is that if you can come up with meaningful behavior for a set of arguments, then your function might be more useful by allowing it, rather than being needlessly restrictive.

Question 13

char *destination needs a size

Any new string function can be made far more robust by passing in the size of the destination array. The minor performance saving by not having to do size limitations is far outweighed by the grief cause by message overruns.

Use size_t for all string sizes and indexes. Type int may be of insufficient range especially for this function whose goal is to make sub-strings of larger strings.

Avoid an argument name like string. Use a name that describes its role.

// size_t substring(char *destination, char const *string, int start, int len) {
size_t substring(char *destination, size_t dsize, char const *source, 
 size_t start, size_t len) {

Investigate restrict. Your present code has trouble if destination and string overlap. Either adjust code to handle that or use restrict to indicate your function can not cope with overlap.

Define corner cases

Instead of returning (size_t)-1, on "start and len must be 0 < start/len < length of string", define a plausible functionality.

Along with using size_t dsize, there should be no error conditions possible. All possible legitimate inputs should result in a defined result.

// dest and source may overlap
void substring1(char *dest, size_t dsize, const char *source, size_t start, size_t length) {
 size_t source_len = strlen(source);
 if (start > source_len) start = source_len;
 if (start + length*1ull > source_len) length = source_len - start;
 if (length + 1 > dsize) {
 if (dsize == 0) {
 return;
 }
 length = dsize - 1;
 }
 memmove(dest, &source[start], length);
 dest[length] = 0;
}

Ref: Sample implementations

Question 14

It's the caller's responsibility to provide a destination buffer that is long enough to contain the result. It seems silly to have to pass dsize just to reassure substring1() of that fact.

Question 15

@200_success C's standard library has grown with size restricted functions like snprintf(), strftime(), many others too and has dropped the unrestricted size function like gets(). The standard's direction to using size_t in more functions is not silly. Since my coding style reflects that model, checking size is a legitimate coding style. We may professionally disagree on this, but neither approach is silly - the larger program's coding goals often drive the best approach.

Question 16

We already have a length argument. Adding another one makes the API more cumbersome but not a bit safer.

Question 17

Interface

Your function looks more like the .NET Substring function (for which you specify the length) than the Java substring function (for which you specify the exclusive endIndex).

Validation

Calling strlen() anywhere in this function is, in my opinion, not acceptable. If I'm trying to extract the first 5 bytes of a megabyte-long string, why should the operation require the entire string to be traversed?

Polluting standard error is unconventional in a library function like this: no function in the standard C library reports errors that way. Errors should be reported by returning an error code, returning an error code via an out-parameter, or via a global error flag (which is less preferred, but still acceptable). In this case, you could also use assertions instead, since out-of-bounds errors are programmer errors, not user errors.

If you want to return an error code, you should not return -1 as a size_t, because size_t is is unsigned.

Alternatively, consider redefining the behaviour such that out-of-bounds access just produces an empty or shorter result, rather than an error. (That's not how the .NET and Java functions behave, though.) Then, you could design the function such that it returns the length of the string that was actually copied (which would ideally be the same as len, but might be shorter if the bounds were wrong).

Implementation

Instead of memcpy(), the function you want to use is strncpy():

strncpy(destination, string + start, len);
destination[len] = '0円'; /* Important! */

Or, better yet, if stpncpy() is available on your target platform:

char *termination = stpncpy(destination, string + start, len)
*termination = '0円'; /* Important! */
return termination - destination; /* The length of the result */

JS1 JS1 28.9k3 gold badges41 silver badges83 bronze badges · Answer 1 · 2015-12-03 03:11:51Z

Argument confusion

If you were trying to replicate the java substring function, you should have had a start and end argument. As it is, you have a start and length argument, but you didn't handle the length argument correctly.

Bug 1

This line doesn't correctly check for the case of copying off the end of the source string:

if (start > stringLength || len > stringLength || start < 0 || len < 0) {

It needs to be:

if (start + len > stringLength || start < 0 || len < 0) {

Edit: If you want to check for integer overflow, you can also throw in || start + len < 0, and use size_t everywhere instead of int.

Bug 2

This check is unnecessary and can return an error incorrectly:

if (start >= len) {
 fputs("start must be start < len", stderr);
 return -1;
}

For example, if the string has 10 bytes, start is 8 and len is 1, this check will return an error.

Bug 3

When you terminate the destination string, you use the wrong index. It should be len, not stringLength. Also, I'm not sure why you do a check before you write to it:

if (destination[stringLength] != '\x0') {
 destination[stringLength] = '\x0';
}

should be:

destination[len] = '0円';

Also, '0円' is the standard way to write the null character. I've never seen it written as '\x0' before.

Unnecessary local variable

The variable substringLength is unnecessary because it is always the same as len. You can just use len instead.

Return value

The java substring function returns the string. Your function returns the length but I don't think that this will be useful to the caller. The reason you might want to return a string is so you can chain calls like this:

strcat(substring(buf, path, dir_start, dir_length), basename);

Note: start + len > stringLength is not safe against integer overflow.
On the other hand, if trying to emulate the .NET Substring(), then passing the length would be right.
@200_success But in the comments in the code, he specifically says he is trying to emulate the java version.

Jamal Jamal 35.2k13 gold badges134 silver badges238 bronze badges · Answer 2 · 2015-12-03 01:23:21Z

I have a hard time understanding how C functions are usually meant to work

The best thing to do is to read the documentation, which is only a Google search away.

For instance, you'll determine that strlen() returns a size_t, so stringLength should also be a size_t, same with len. Mixing this with an int or other signed type can cause "possible loss of data" warnings.

Also, if you have a string called destination, then the other one should be called source.

holroy holroy 11.8k1 gold badge27 silver badges59 bronze badges · Answer 3 · 2015-12-03 02:27:18Z

By first glance it looks kind of OK, but there are at least a few bugs:

Given a start > 0 and len < strlen(string), but still large, then your memcpy might end up writing beyond allocated memory
Why the start < len test? Given start = 5 and len = 2, but string = 'A somewhat long string'. It should be allowed to copy 2 characters from that string. I think what you mean to check for is something like (start + len) <= stringLength

You should also be aware that C also have substring functions in default libraries, like strncpy or strcpy, so you're kind of reinventing the wheel.

hobbs hobbs 2261 silver badge7 bronze badges · Answer 4 · 2015-12-03 08:29:04Z

Printing an error message to stderr and returning -1 if argument validation fails isn't very good practice. For one thing, there's no way to distinguish different error cases; for another, there's no way to avoid the log spam.

If an error you detect is always programmer error then you should catch it with an assertion that stops the program dead in its tracks (or better yet, a static assertion that stops the program from compiling). Otherwise, if the error could be caused by user input, you should return an error code (preferably a different error code for each case) and let your caller decide how the error should be handled. Printing a canned error to the screen may not be what the user of your function wants!

Finally, you have the opportunity to decide that some cases aren't errors at all. It can be convenient, and reduce the number of special cases in calling code, to allow the boundaries of a substring to be outside of the source string, or to allow the end position to be less than the start position, and to just trim the output string apprirately, and many standard substring functions allow some or all of these combinations.

For example, you might use the substring function to get the "first 80 characters of headline" to display a summary, but it doesn't need to be an error if the headline is less than 80 characters to begin with; the summary in that case would just be the whole headline. You might also get the rest of the headline by asking for a substring beginning at character 80, for a short headline this can just be the empty string instead of returning an error. This preserves the property that headline = summary + remainder. There are tradeoffs involved in stronger and weaker validation, but my point is that if you can come up with meaningful behavior for a set of arguments, then your function might be more useful by allowing it, rather than being needlessly restrictive.

chux chux 36.2k2 gold badges43 silver badges96 bronze badges · Answer 5 · 2015-12-04 05:25:53Z

char *destination needs a size

Any new string function can be made far more robust by passing in the size of the destination array. The minor performance saving by not having to do size limitations is far outweighed by the grief cause by message overruns.

Use size_t for all string sizes and indexes. Type int may be of insufficient range especially for this function whose goal is to make sub-strings of larger strings.

Avoid an argument name like string. Use a name that describes its role.

// size_t substring(char *destination, char const *string, int start, int len) {
size_t substring(char *destination, size_t dsize, char const *source, 
 size_t start, size_t len) {

Investigate restrict. Your present code has trouble if destination and string overlap. Either adjust code to handle that or use restrict to indicate your function can not cope with overlap.

Define corner cases

Instead of returning (size_t)-1, on "start and len must be 0 < start/len < length of string", define a plausible functionality.

Along with using size_t dsize, there should be no error conditions possible. All possible legitimate inputs should result in a defined result.

// dest and source may overlap
void substring1(char *dest, size_t dsize, const char *source, size_t start, size_t length) {
 size_t source_len = strlen(source);
 if (start > source_len) start = source_len;
 if (start + length*1ull > source_len) length = source_len - start;
 if (length + 1 > dsize) {
 if (dsize == 0) {
 return;
 }
 length = dsize - 1;
 }
 memmove(dest, &source[start], length);
 dest[length] = 0;
}

Ref: Sample implementations

It's the caller's responsibility to provide a destination buffer that is long enough to contain the result. It seems silly to have to pass dsize just to reassure substring1() of that fact.
@200_success C's standard library has grown with size restricted functions like snprintf(), strftime(), many others too and has dropped the unrestricted size function like gets(). The standard's direction to using size_t in more functions is not silly. Since my coding style reflects that model, checking size is a legitimate coding style. We may professionally disagree on this, but neither approach is silly - the larger program's coding goals often drive the best approach.
We already have a length argument. Adding another one makes the API more cumbersome but not a bit safer.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 6 · 2015-12-04 19:29:07Z

Interface

Your function looks more like the .NET Substring function (for which you specify the length) than the Java substring function (for which you specify the exclusive endIndex).

Validation

Calling strlen() anywhere in this function is, in my opinion, not acceptable. If I'm trying to extract the first 5 bytes of a megabyte-long string, why should the operation require the entire string to be traversed?

Polluting standard error is unconventional in a library function like this: no function in the standard C library reports errors that way. Errors should be reported by returning an error code, returning an error code via an out-parameter, or via a global error flag (which is less preferred, but still acceptable). In this case, you could also use assertions instead, since out-of-bounds errors are programmer errors, not user errors.

If you want to return an error code, you should not return -1 as a size_t, because size_t is is unsigned.

Alternatively, consider redefining the behaviour such that out-of-bounds access just produces an empty or shorter result, rather than an error. (That's not how the .NET and Java functions behave, though.) Then, you could design the function such that it returns the length of the string that was actually copied (which would ideally be the same as len, but might be shorter if the bounds were wrong).

Implementation

Instead of memcpy(), the function you want to use is strncpy():

strncpy(destination, string + start, len);
destination[len] = '0円'; /* Important! */

Or, better yet, if stpncpy() is available on your target platform:

char *termination = stpncpy(destination, string + start, len)
*termination = '0円'; /* Important! */
return termination - destination; /* The length of the result */

Stack Exchange Network

Substring function in C

6 Answers 6

Argument confusion

Bug 1

Bug 2

Bug 3

Unnecessary local variable

Return value

Interface

Validation

Implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Substring function in C

6 Answers 6

Argument confusion

Bug 1

Bug 2

Bug 3

Unnecessary local variable

Return value

Interface

Validation

Implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions