Replace n or all occurrences of substring in C

Question 1

I wrote this code as an answer to a question. But I'd like you to have a look at it. This post is basically a copy of the answer I posted.

The code does no error checking. It assumes that the output buffer exists and is big enough and that src, orig and new are valid strings. The headers needed:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdint.h>

The main replace function

// Replace the first occurrence of orig in src with new and write result to dest
// Return pointer to first character after the end of first match in src and
// NULL if no match
const char *replace(char *dest, const char *src,
 const char *orig, const char *new) {
 char *ptr = strstr(src, orig); // First match
 // If no match, we want dest to contain a copy of src
 if(!ptr) {
 strcpy(dest, src);
 return NULL;
 }
 const ptrdiff_t offset = ptr - src; // Calculate offset
 const int origlen = strlen(orig);
 
 strncpy(dest, src, offset); // Copy everything before match
 strcpy(&dest[offset], new); // Copy replacement
 strcpy(&dest[offset + strlen(new)], &src[offset + origlen]); // Copy rest
 return src + offset + origlen;
}

Then we just add a function to handle N occurrences

// Replace maximum n occurrences. Stops when no more matches.
// Returns number of replacements
size_t replaceN(char *dest, const char *src, const char *orig, 
 const char *new, size_t n) {
 size_t ret = 0;
 // Maybe an unnecessary optimization to avoid multiple calls in 
 // loop, but it also adds clarity
 const int newlen = strlen(new);
 const int origlen = strlen(orig);
 
 do {
 const char *ptr = replace(dest, src, orig, new); // Replace
 if(!ptr) return ret; // Quit if no more matches
 // Length of the part of src before first match
 const ptrdiff_t offset = ptr - src;
 
 src = ptr; // Move src past what we have already copied.
 
 ret++;
 dest += offset - origlen + newlen; // Advance pointer to dest to the end
 } while(n > ret);
 return ret;
}

And a simple wrapper for all occurrences. Note that it's safe to use SIZE_MAX here, because there will never be more occurrences than that.

// Replace all. Returns the number of replacements, because why not?
size_t replaceAll(char *dest, const char *src, const char *orig, const char *new) {
 return replaceN(dest, src, orig, new, SIZE_MAX); 
}

It's easy to write a wrapper for the allocation. We borrow some code from https://stackoverflow.com/q/9052490/6699433

size_t countOccurrences(const char *str, const char *substr) {
 size_t count = 0;
 while((str = strstr(str, substr)) != NULL) {
 count++;
 str++; // We're standing at the match, so we need to advance
 }
 return count;
}

Then some code to calculate size and allocate a buffer

// Allocate a buffer big enough to hold src with n replacements 
// of orig to new
char *allocateBuffer(const char *src, const char *orig, 
 const char *new, size_t n) {
 return malloc(strlen(src) + 
 n * (strlen(new) - strlen(orig)) +
 1 // Remember the zero terminator
 );
}

And the two final functions

// Allocates a buffer and replaces max n occurrences of orig with new and
// writes it to the allocated buffer.
// Returns the buffer and NULL if allocation failed
char *replaceNAndAllocate(const char *src, const char *orig, 
 const char *new, size_t n) {
 const size_t count = countOccurrences(src, orig);
 n = n < count ? n : count; // Min of n and count
 char *buf = allocateBuffer(src, orig, new, n);
 if(!buf) return NULL;
 replaceN(buf, src, orig, new, n);
 return buf;
}
// Allocates a buffer and replaces all occurrences of orig with new and
// writes it to the allocated buffer.
// Returns the buffer and NULL if allocation failed
char *replaceAllAndAllocate(const char *src, const char *orig, const char *new) {
 return replaceNAndAllocate(src, orig, new, SIZE_MAX);
}

And finally, a simple main with a test with multiple occurrences and with original string being a substring of replacement string:

int main(void) {
 char src[] = "!!!asdf!!!asdf!!!asdf!!!asdf!!!";
 char orig[] = "asdf";
 char new[] = "asdfaaaaaaaaasdf";
 
 puts(replaceAllAndAllocate(src, orig, new));
}

No warnings with -Wall -Wextra -pedantic and the output is:

$ ./a.out 
!!!asdfaaaaaaaaasdf!!!asdfaaaaaaaaasdf!!!asdfaaaaaaaaasdf!!!asdfaaaaaaaaasdf!!!

Question 2

The interface

There are a number of useful things to return. Let's list them in order of which is least surprising:
- The pointer to the start of the destination. This follows the standard library, but is least useful.
- The pointer to the end of the destination. This follows an alternative convention embodies in the stpcpy() function and similar added to POSIX in 2008.
- Number of characters produced / to produce. Generally only used if the sink can accept any number of characters, or the destinations size is also passed. countOccurrences() seems your flawed attempt to begin implementing the alternative approach to sizing the buffer.
- Return a count of replacements done. Degenerates to boolean if maximum number of replacements is 1.
- Try to combine bool (not count) with one of the other options. strstr() does it, arguable returning a pointer to the terminator might have been better.
None of those options reflects your choice.
Provide some way to properly size the destination buffer. As a bonus, replaceNAndAllocate() becomes fairly trivial.

The algorithm

replaceN()'s use of replace() results in a standard Shlemiel The Painter-algorithm. It should be linear, but ends up quadratic.
countOccurrences() can fail if some proper suffix (neither empty nor the full string) is a prefix of the needle, specifically if the resultant string is found. Skip over the full match found, not the first character.

Implementation

Yes, strncpy() (which is not actually a string-function) will behave identically to memcpy() if the first n - 1 bytes are non-null. It is still inefficient and false advertisement though.
&pointer[offset_calculation] might not be longer than pointer + offset_calculation, but it is a bit less idiomatic. You decide whether it should be changed.
replaceN() should be the one to implement directly, replace() being a convenience-function. The overhead is trivial for the latter, and fixes the current problem with its runtime.
Most of the time, you use pointer directly, but sometimes you use pointer != NULL. Fix the inconsistency, preferably in favor of the former, which is more idiomatic and shorter.
countOccurrences() and allocateBuffer() seem to be internal helpers. Mark them static so they are not exported.
Your test-program fails to free the buffer replaceAllAndAllocate() allocates.

Question 3

I'm pretty sure we're not using Shlemiel's algorithm, because src gets advanced within the loop (that being the reason the worker function returns the end of the first match). However, if you missed that, it says something about the clarity of that part of the code!

Question 4

Oh, actually I see that while the next search starts from the right place, we repeatedly copy the string tail, and that's the bit that's quadratic. Good catch!

Question 5

Very good points. I'm going to write a more efficient variant. This was mostly intended to be clear code with most focus on reusing functions. I wrote replace with the intention of using it in replaceN and that's why I chose the return value that makes more sense in that regard. And it can still be used as a boolean. But I'm not sure what you mean by "Provide some way to properly size the destination buffer." I have a function for that, which is allocateBuffer

Question 6

Algorithm 1) Yes, I agree. I'll write another more efficient variant. 2) Thanks for bug report.

Question 7

Implementation 1) Will fix. Makes more sense. 2) I'll think about it. I often mix them, it it depends on if I focus on the array or the pointer. 3) You certainly have a point here. replace is not needed. 4) I needed to put the exp in parenthesis to suppress warnings. That's why I added != NULL. Because then it does not look like you have too many parenthesis. Yes, I know it does not make sense. 5) Maybe 6) Did not care about that. It's not really a part of the whole. Basically just added it so others can have a quick test case.

Question 8

First impression is pretty good - care taken with exception conditions and thought to the interface.

Headers: I find it helpful to have a consistent order for headers, to quickly see if the header I need is already included. For this, I group the Standard Library headers in alphabetical order. You might find that helpful too.

char *ptr = strstr(src, orig); // First match

I'd write const *ptr there, to avoid accidents.

strncpy(dest, src, offset); // Copy everything before match

Probably memcpy is more appropriate here, given we know that src doesn't end before src + offset.

strcpy(&dest[offset], new); // Copy replacement
strcpy(&dest[offset + strlen(new)], &src[offset + origlen]); // Copy rest

It's strange to index an array, only to extract a pointer. I'd write those as:

strcpy(dest + offset, new); // Copy replacement
strcpy(dest + offset + strlen(new), src + offset + origlen); // Copy rest

Arguably, we could measure strlen(new) first, then use memcpy for the first of those.

If countOccurrences() and allocateBuffer() aren't intended to be part of the public interface, declare them with internal linkage (static).

The test program is an optimist. If you've taken the care to safely return a null pointer from the function, the caller ought to have the decency to set a good example by testing the value rather than blindly passing it on!

Question 9

I group the Standard Library headers like I do my own headers: in increasing order of specificity. Although that is, admittedly, a bit subjective. But I find this easier to understand than alphabetical order. If I want to know whether a specific header is already included, I'm almost certainly going to just search for it to ensure an exact match.

Question 10

I updated the original post now stackoverflow.com/a/68237099/6699433

Deduplicator Deduplicator 19.6k1 gold badge32 silver badges65 bronze badges · Answer 1 · 2021-07-06 01:34:39Z

The interface

There are a number of useful things to return. Let's list them in order of which is least surprising:
- The pointer to the start of the destination. This follows the standard library, but is least useful.
- The pointer to the end of the destination. This follows an alternative convention embodies in the stpcpy() function and similar added to POSIX in 2008.
- Number of characters produced / to produce. Generally only used if the sink can accept any number of characters, or the destinations size is also passed. countOccurrences() seems your flawed attempt to begin implementing the alternative approach to sizing the buffer.
- Return a count of replacements done. Degenerates to boolean if maximum number of replacements is 1.
- Try to combine bool (not count) with one of the other options. strstr() does it, arguable returning a pointer to the terminator might have been better.
None of those options reflects your choice.
Provide some way to properly size the destination buffer. As a bonus, replaceNAndAllocate() becomes fairly trivial.

The algorithm

replaceN()'s use of replace() results in a standard Shlemiel The Painter-algorithm. It should be linear, but ends up quadratic.
countOccurrences() can fail if some proper suffix (neither empty nor the full string) is a prefix of the needle, specifically if the resultant string is found. Skip over the full match found, not the first character.

Implementation

Yes, strncpy() (which is not actually a string-function) will behave identically to memcpy() if the first n - 1 bytes are non-null. It is still inefficient and false advertisement though.
&pointer[offset_calculation] might not be longer than pointer + offset_calculation, but it is a bit less idiomatic. You decide whether it should be changed.
replaceN() should be the one to implement directly, replace() being a convenience-function. The overhead is trivial for the latter, and fixes the current problem with its runtime.
Most of the time, you use pointer directly, but sometimes you use pointer != NULL. Fix the inconsistency, preferably in favor of the former, which is more idiomatic and shorter.
countOccurrences() and allocateBuffer() seem to be internal helpers. Mark them static so they are not exported.
Your test-program fails to free the buffer replaceAllAndAllocate() allocates.

I'm pretty sure we're not using Shlemiel's algorithm, because src gets advanced within the loop (that being the reason the worker function returns the end of the first match). However, if you missed that, it says something about the clarity of that part of the code!
Oh, actually I see that while the next search starts from the right place, we repeatedly copy the string tail, and that's the bit that's quadratic. Good catch!
Very good points. I'm going to write a more efficient variant. This was mostly intended to be clear code with most focus on reusing functions. I wrote replace with the intention of using it in replaceN and that's why I chose the return value that makes more sense in that regard. And it can still be used as a boolean. But I'm not sure what you mean by "Provide some way to properly size the destination buffer." I have a function for that, which is allocateBuffer
Algorithm 1) Yes, I agree. I'll write another more efficient variant. 2) Thanks for bug report.
Implementation 1) Will fix. Makes more sense. 2) I'll think about it. I often mix them, it it depends on if I focus on the array or the pointer. 3) You certainly have a point here. replace is not needed. 4) I needed to put the exp in parenthesis to suppress warnings. That's why I added != NULL. Because then it does not look like you have too many parenthesis. Yes, I know it does not make sense. 5) Maybe 6) Did not care about that. It's not really a part of the whole. Basically just added it so others can have a quick test case.

Toby Speight Toby Speight 87.2k14 gold badges104 silver badges322 bronze badges · Answer 2 · 2021-07-05 20:15:25Z

First impression is pretty good - care taken with exception conditions and thought to the interface.

Headers: I find it helpful to have a consistent order for headers, to quickly see if the header I need is already included. For this, I group the Standard Library headers in alphabetical order. You might find that helpful too.

char *ptr = strstr(src, orig); // First match

I'd write const *ptr there, to avoid accidents.

strncpy(dest, src, offset); // Copy everything before match

Probably memcpy is more appropriate here, given we know that src doesn't end before src + offset.

strcpy(&dest[offset], new); // Copy replacement
strcpy(&dest[offset + strlen(new)], &src[offset + origlen]); // Copy rest

It's strange to index an array, only to extract a pointer. I'd write those as:

strcpy(dest + offset, new); // Copy replacement
strcpy(dest + offset + strlen(new), src + offset + origlen); // Copy rest

Arguably, we could measure strlen(new) first, then use memcpy for the first of those.

If countOccurrences() and allocateBuffer() aren't intended to be part of the public interface, declare them with internal linkage (static).

The test program is an optimist. If you've taken the care to safely return a null pointer from the function, the caller ought to have the decency to set a good example by testing the value rather than blindly passing it on!

I group the Standard Library headers like I do my own headers: in increasing order of specificity. Although that is, admittedly, a bit subjective. But I find this easier to understand than alphabetical order. If I want to know whether a specific header is already included, I'm almost certainly going to just search for it to ensure an exact match.
I updated the original post now stackoverflow.com/a/68237099/6699433

Stack Exchange Network

Replace n or all occurrences of substring in C

2 Answers 2

The interface

The algorithm

Implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Replace n or all occurrences of substring in C

2 Answers 2

The interface

The algorithm

Implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions