I needed the ability to truncate the trailing end of path strings in a routine that builds new search paths as it recursively searches directories. After not finding what I was looking for I created function below.
Expected behavior is that function remove_trailing_chars()
will update in
to remove any occurrences of chars contained in rem
, iff they exist contiguously at the very end of the original version of in
.
Once a character in the in
string becomes the trailing char and it is not included in rem
, then function updates in
with latest version and returns.
it has been tested for several variations of input char
arrays in
and rem
, including these.
char in[] = "this is a string with \\ *\\*";//edit this string as needed to test
char rem[] = "\\* ";//edit this string as needed to test
results in "this is a string with"
without following space
char in[] = "this is a string with *\\*";//edit this string as needed to test
char rem[] = "\\*";//edit this string as needed to test
results in "this is a string with "
includes following space
I am interested in suggestions for efficiency improvements in speed, and readability improvements. (suggestions on more idiomatic methods are welcome.) I do not believe memory should be an issue with this for my usage, but if there are thoughts on any pitfalls in that area, please include them as well.
Here is the code, including one usage case... (Compiler command line and its disassembly are included further down as well.)
#include <stdbool.h>//bool
#include <string.h>//strlen, strcpy
#include <stdlib.h>
//prototypes
void remove_trailing_chars(char *in, const char *rem);
/// demonstrate removing all chars in 'rem' if trailing in 'in'.
int main(void)
{
char in[] = "this is a string with \\ *\\*";//edit this string as needed to test
char rem[] = "\\* ";//edit this string as needed to test
remove_trailing_chars(in, rem);
return 0;
}
/// remove all occurrences of chars in 'rem' from end of 'in'
void remove_trailing_chars(char *in, const char *rem)
{
bool found = true;//when false, last char of 'in' found no matches in 'rem'
int len = strlen(in);
char in_dup[len+1];
strcpy(in_dup, in);
while(found)
{
found = false;//for this element of rem
len = strlen(in_dup);
int i = 0;
while(rem[i])
{
if(in_dup[len-1] == rem[i])
{
in_dup[len - 1] = 0;
found = true;
break;
}
else
{
i++;
}
}
}
strcpy(in, in_dup);
}
Using GCC, build was done with:
Release target:
mingw32-gcc.exe -Wall -O2 -Wall -std=c99 -g -c C:\tempExtract\remove_trainling_chars\main.c -o obj\Release\main.o
Debug target: (to allow viewing disassembly)
gcc.exe -Wall -g -Wall -std=c99 -g -c C:\tempExtract\remove_trainling_chars\main.c -o obj\Debug\main.o
2 Answers 2
Accessing array out of bounds
The code here may try access in_dup[-1]
in some cases:
len = strlen(in_dup); int i = 0; while(rem[i]) { if(in_dup[len-1] == rem[i]) ^^^^^^^^^^^^^
That is, when the input string is empty, or when the entire input string is made of characters in rem
, then in_dup
will become empty, len
becomes 0, and len - 1
will be an illegal access on in_dup
.
In short, the code is missing a check on reaching the beginning of the input.
Avoid unnecessary copying
The code copies in
to in_dup
, works with in_dup
, then copies back from it to in
. This is unnecessary, you could work directly with in
.
Avoid unnecessary computations
len = strlen(in_dup)
is executed every time after some characters are removed from the end. This is inefficient, because strlen
needs to loop over the entire string. Instead, you could count the number of characters removed, and then you'll know exactly the end of the input string.
Simplify algorithm
Consider this simpler algorithm:
- Loop from the end of the input, going backwards, until the beginning
- Loop over the characters in
rem
, check if it matches the last character of the input- If there is a match, delete the last character and break out of this inner loop
- If there is no match, then we're done, break out of the outer loop
Implementation, including the other tips above applied as well:
void remove_trailing_chars(char *in, const char *rem)
{
int remLength = strlen(rem);
for (int i = strlen(in) - 1; i >= 0; i--) {
int j = 0;
while (j < remLength) {
if (in[i] == rem[j]) {
in[i] = '0円';
break;
}
j++;
}
if (j == remLength) break;
}
}
-
\$\begingroup\$ I appreciate the detail in your explanations, and follow most of them. I am unclear on your use of
strlen
in thefor
expression though. i.e. you had made the point earlier about removingstrlen
from being called repeatedly in a loop. \$\endgroup\$ryyker– ryyker2021年11月03日 12:49:59 +00:00Commented Nov 3, 2021 at 12:49 -
1\$\begingroup\$ @ryyker the initializer of the loop is only executed once ;-) \$\endgroup\$janos– janos2021年11月03日 13:41:27 +00:00Commented Nov 3, 2021 at 13:41
Recomputing of
strlen(in_dup)
at each iteration of thewhile
loop drive the time complexity to quadratic. Better compute it once, and subtract 1 per iteration.I see no reason to copy
in
toin_dup
, and then back. The second copy assumes thatin
is writable. Better operate directly onin
.Use standard library.
strchr
does precisely the same job as the inner loop, and likely does it better.A variable like
bool found
is usually a red flag.
All that said, consider
char * end = in + strlen(in);
while (end > in) {
char ch = *--end;
if (strchr(rem, ch) {
*end = 0;
} else {
break;
}
}
-
\$\begingroup\$ Great list of suggestions. All good except one question: Why does
bool found
usually indicate a red flag? btw, suggested code segment has one small typo:(strchr(rem, ch)
->(strchr(rem, ch))
. Other than that, very clean. Thank you for reviewing. \$\endgroup\$ryyker– ryyker2021年11月03日 13:45:45 +00:00Commented Nov 3, 2021 at 13:45