strstr() implementation

Question 1

In this strstr implementation, I am basically skipping the already matched and checked characters with an if else condition. Is this a right way of implementing it?

char *strstrImp(char *input, char *find) {
 if(*find=='0円')
 return input;
 if(*input=='0円')
 return NULL;
 char *start=input;
 char *p;
 while(*start!='0円')
 {
 p=start;
 char *q=find;
 while(*p!='0円' && *q!='0円' && *p==*q)
 {
 p++;
 q++;
 }
 if(*q=='0円')
 {
 return start;
 }
 //update the starting position based on whether any character matched
 if(p == start) //no match
 start++;
 else
 start = p; //skip characters and start from the previously left place
 }
 return NULL;
}
int main()
{
 char * res;
 res = strstrImp("helloworld","low");
 printf("%s",res);
}

Question 2

Are you trying to implement this in C++ specifically? This looks more like C.

Question 3

Right way? It is a way. What do you mean by "Right Way" there is more than one "Right Way". If you are looking for the optimal way to implement this then you spend time building a state transition table first then you can scan the input in O(n) (or one loop). Unfortunately building the state transition table is non trivial code. Thus this is a perfectly acceptable solution.

Question 4

Const-correctness

Since you tagged this question as c++, you must take care of const-correctness:

cr54045.cpp:36:18: warning: conversion from string literal to 'char *' is deprecated [-Wdeprecated-writable-strings]
 res = strstrImp("helloworld","low");
 ^
cr54045.cpp:36:31: warning: conversion from string literal to 'char *' is deprecated [-Wdeprecated-writable-strings]
 res = strstrImp("helloworld","low");
 ^

There should be a

const char *strstrImp(const char *input, const char *find)
{
 return strstrImp(const_cast<char *>(input), find);
}

Incorrect optimization

Try this test case, which compares your implementation against the standard strstr().

int main()
{
 const char *input = "alfalfalfa romeo", *find = "alfalfa ";
 printf("%s\n", strstrImp(input, find)); // (null)
 printf("%s\n", strstr(input, find)); // alfalfa romeo
}

Logic simplification

Since the start = p optimization is bogus, we will only end the loop with start++. The function body would then look like:

if (*find == '0円')
 return input;
if (*input == '0円')
 return NULL;
char *start = input;
while (*start != '0円')
{
 ...
 start++;
}
return NULL;

Next, notice that the if (*input == '0円') return NULL special case is redundant. Also notice that you never refer to input again, so you might as well think of start and input as the same thing. The outline simplifies to:

if (*find == '0円')
 return input;
while (*input != '0円')
{
 ...
 input++;
}
return NULL;

Next, let's turn our attention inside the outer loop.

Since p is only ever used inside the loop, you should pull its declaration inside the loop. p and q should both be const char *.

while (*input != '0円')
{
 const char *p = input, *q = find;
 while (*p != 0 && *q != '0円' && *p == *q)
 {
 p++;
 q++;
 }
 if (*q=='0円')
 {
 return input;
 }
 input++;
}

It turns out that the inner-loop condition can be simplified, because the *p == '0円' test is redundant:

$$\begin{array}{ccc|c|c} \text{*p == *q} & \text{*q != '0円'} & \text{*p != '0円'} & {\begin{array}{l}\text{*p!='0円' &&}\\\text{*q!='0円' &&}\\\text{*p==*q}\end{array}} & \begin{array}{l}\text{*q != '0円' &&}\\\text{*p == *q}\end{array} \\ \hline \\ \begin{array}{c}T\\T\end{array} & \begin{array}{c}T\\T\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}T\\F\ (impossible)\end{array} & T \\ \hline \\ \begin{array}{c}T\\T\end{array} & \begin{array}{c}F\\F\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}F\ (impossible)\\F\end{array} & F \\ \hline \\ \begin{array}{c}F\\F\end{array} & \begin{array}{c}T\\T\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}F\\F\end{array} & F \\ \hline \\ \begin{array}{c}F\\F\end{array} & \begin{array}{c}F\\F\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}F\\F\end{array} & F \\ \end{array}$$

The function can now look like:

if (*find=='0円')
 return input;
while (*input != '0円')
{
 const char *p, *q;
 for (p = input, q = find; *q != '0円' && *p == *q; p++, q++) {}
 if (*q == '0円')
 {
 return input;
 }
 input++;
}
return NULL;

A well designed function shouldn't have surprising behaviour, and therefore shouldn't have special cases. Can we get rid of if (*find == '0円') return input;?

The behaviour of strstr() is, if find is "", then the function returns input. In order to reach the return input; statement when find == "" and input == "", we have to get rid of both the if (*find == '0円') special case and the while (*input != '0円') guard condition.

char *strstrImp(char *input, const char *find)
{
 do
 {
 const char *p, *q;
 for (p = input, q = find; *q != '0円' && *p == *q; p++, q++) {}
 if (*q == '0円')
 {
 return input;
 }
 } while (*(input++) != '0円');
 return NULL;
}
const char *strstrImp(const char *input, const char *find)
{
 return strstrImp(const_cast<char *>(input), find);
}

Suppose we make that change — then what about the case where input == "" and find is non-empty? It turns out, that works just fine: it will flow through all of the conditions and return NULL! There you go — a compact implementation of strstr() with reasonably simple logic and without special cases.

Question 5

I would argue that also in C const correctness is useful

Question 6

Get rid of the fugly const_cast and you have my vote.

Question 7

@Emily How would you do it instead?

Question 8

What obvious thing am I missing? Why can you just have the single const char *strstrImp(const char *input, const char *find) function? Is it because you want the ability to return a non const char*? Personally I don't mind adding const-ness but removing it has to be well justified. So I would swap the two bodies around and remove the constness on the return value of the first version.

Question 9

@DeadMG There is a very good reason, and it's called "standards conformance". The question was about re-implementing strstr as per the standard en.cppreference.com/w/cpp/string/byte/strstr.

Question 10

It seems to me the code can be quite a bit simpler than what you've shown above. I think I'd prefer to implement roughly the following algorithm:

for each position in haystack
 if needle matches haystack starting at this position
 return current position
return no position

In C++, that might look something like this:

namespace {
 bool match(char const *input1, char const *input2) {
 for (; *input2 && *input1; ++input1, ++input2)
 if (*input1 != *input2)
 return false;
 return true;
 }
}
char *find_sub(char *haystack, char const *needle) {
 for (; *haystack; ++haystack) 
 if (match(haystack, needle))
 return haystack;
 return NULL;
}
char const *find_sub(char const *haystack, char const *needle) {
 return find_sub(const_cast<char *>(haystack), needle);
}

Note that I've renamed the functions to find_str. At least according to the C standard, any and all names starting with str are reserved for the implementation. At least by my reading, strstrImp is probably allowed in C++, but I'd rather use a name that's clearly allowed in both C and C++ if possible.

For what it's worth, a few test cases:

int main() {
 char *input1 = "asddf";
 assert(find_sub(input1, "dd") == input1 + 2);
 char const *input = "asddf";
 assert(find_sub(input, "f") == input + 4);
 assert(find_sub(input, "x") == NULL);
 assert(find_sub(input, "") == input);
}

Much like the version @200_success posted, this handles searching for an empty string without any special-case code. It also handles the overloading pretty much the same way he did--while the const_cast isn't exactly beautiful, it beats the alternative of duplicating the code for find_sub, once for char * and again for char const * input. I did initially implement the overloading as @Loki Astari suggested in a comment, with the non-const version calling the const version instead of vice versa. After some though, I think this is preferable though. Since we're certain that the other overload of find_sub won't attempt to modify either of the subject strings, it's safe to cast away the constness in this case. If we instead add const-ness to the non-const overload, then we have to cast it away again on the result, so we have to write two explicit casts. By going the other direction, the cast on the return value from char * to char const * can happen implicitly, so the code is marginally less ugly.

Also note that I've used NULL to return a null pointer. This allows that function definition to be compatible with both older C++ compilers and C compilers. If you're sure it'll only be used with (reasonably) current C++ compilers, it's probably better to replace that with nullptr instead.

Question 11

The way I learned C strings was to check for end of string with something like

while (*input) {...};

rather than the more verbose

while (*input != '0円') {...};

Both are correct. It is just that one is more verbose, or long winded. Verbosity is not necessarily a virtue. Brevity is not a vice.

I also prefer

for (;;) {...}

over the similar

while (1) {...}

When I learned C, many decades ago, (for many>= 3) some compilers would actually build a constant 1 for while (1) and test that the constant 1 was not 0. Your preferences may differ.

The above implementation with those additional ... preferences would become

char *strstrImp(const char *input, const char *find) {
 for (;*input;input++) {
 const char *p = input, *q = find;
 while (*p && *q && *p++ == *q++) {}
 if (!*q) return (char *)input;
 }
 return NULL;
}

My simple implementation with the above variable names is

char *strstrImp(const char *input, const char *find) {
 const char *p = input, *q = find;
 for (;;) {
 if (!*q) return (char *)input;
 if (!*p) return NULL;
 if (*p++ == *q++) continue;
 p = ++input;
 q = find;
 }
}

The types are all appropriate for the C standard strstr. I like it because it has a single loop rather than the typical two loop approach. The continue test in the middle of the loop allows us to decide whether to progress on the inner loop, or reset the variables for the outer loop. It's not crazy fast with large input or find strings. For small and medium input and find strings, it is about as fast as any other implementation.

Question 12

Already there are great answers for C++. But in case of C, I would like to add one more point: instead of using two loops we can divide the tasks into functions to make it more readable. In response to the comments the whole point or observation is made for readability purpose.

strstrImp -> find/locate substring in string and returns a pointer points to the first character of the found substr in str otherwise a null pointer.

char const *strstrImp (const char *str, const char *substr)
{
 size_t len=strlen(substr);
 for (int i = 0; str[i] != '0円'; i++)
 {
 if (strncmp((str + i), substr,len) == 0)
 {
 return (char *)(str + i);
 }
 }
 return NULL;
}

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2014-06-12 07:53:45Z

Const-correctness

Since you tagged this question as c++, you must take care of const-correctness:

cr54045.cpp:36:18: warning: conversion from string literal to 'char *' is deprecated [-Wdeprecated-writable-strings]
 res = strstrImp("helloworld","low");
 ^
cr54045.cpp:36:31: warning: conversion from string literal to 'char *' is deprecated [-Wdeprecated-writable-strings]
 res = strstrImp("helloworld","low");
 ^

There should be a

const char *strstrImp(const char *input, const char *find)
{
 return strstrImp(const_cast<char *>(input), find);
}

Incorrect optimization

Try this test case, which compares your implementation against the standard strstr().

int main()
{
 const char *input = "alfalfalfa romeo", *find = "alfalfa ";
 printf("%s\n", strstrImp(input, find)); // (null)
 printf("%s\n", strstr(input, find)); // alfalfa romeo
}

Logic simplification

Since the start = p optimization is bogus, we will only end the loop with start++. The function body would then look like:

if (*find == '0円')
 return input;
if (*input == '0円')
 return NULL;
char *start = input;
while (*start != '0円')
{
 ...
 start++;
}
return NULL;

Next, notice that the if (*input == '0円') return NULL special case is redundant. Also notice that you never refer to input again, so you might as well think of start and input as the same thing. The outline simplifies to:

if (*find == '0円')
 return input;
while (*input != '0円')
{
 ...
 input++;
}
return NULL;

Next, let's turn our attention inside the outer loop.

Since p is only ever used inside the loop, you should pull its declaration inside the loop. p and q should both be const char *.

while (*input != '0円')
{
 const char *p = input, *q = find;
 while (*p != 0 && *q != '0円' && *p == *q)
 {
 p++;
 q++;
 }
 if (*q=='0円')
 {
 return input;
 }
 input++;
}

It turns out that the inner-loop condition can be simplified, because the *p == '0円' test is redundant:

$$\begin{array}{ccc|c|c} \text{*p == *q} & \text{*q != '0円'} & \text{*p != '0円'} & {\begin{array}{l}\text{*p!='0円' &&}\\\text{*q!='0円' &&}\\\text{*p==*q}\end{array}} & \begin{array}{l}\text{*q != '0円' &&}\\\text{*p == *q}\end{array} \\ \hline \\ \begin{array}{c}T\\T\end{array} & \begin{array}{c}T\\T\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}T\\F\ (impossible)\end{array} & T \\ \hline \\ \begin{array}{c}T\\T\end{array} & \begin{array}{c}F\\F\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}F\ (impossible)\\F\end{array} & F \\ \hline \\ \begin{array}{c}F\\F\end{array} & \begin{array}{c}T\\T\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}F\\F\end{array} & F \\ \hline \\ \begin{array}{c}F\\F\end{array} & \begin{array}{c}F\\F\end{array} & \begin{array}{c}T\\F\end{array} & \begin{array}{c}F\\F\end{array} & F \\ \end{array}$$

The function can now look like:

if (*find=='0円')
 return input;
while (*input != '0円')
{
 const char *p, *q;
 for (p = input, q = find; *q != '0円' && *p == *q; p++, q++) {}
 if (*q == '0円')
 {
 return input;
 }
 input++;
}
return NULL;

A well designed function shouldn't have surprising behaviour, and therefore shouldn't have special cases. Can we get rid of if (*find == '0円') return input;?

The behaviour of strstr() is, if find is "", then the function returns input. In order to reach the return input; statement when find == "" and input == "", we have to get rid of both the if (*find == '0円') special case and the while (*input != '0円') guard condition.

char *strstrImp(char *input, const char *find)
{
 do
 {
 const char *p, *q;
 for (p = input, q = find; *q != '0円' && *p == *q; p++, q++) {}
 if (*q == '0円')
 {
 return input;
 }
 } while (*(input++) != '0円');
 return NULL;
}
const char *strstrImp(const char *input, const char *find)
{
 return strstrImp(const_cast<char *>(input), find);
}

Suppose we make that change — then what about the case where input == "" and find is non-empty? It turns out, that works just fine: it will flow through all of the conditions and return NULL! There you go — a compact implementation of strstr() with reasonably simple logic and without special cases.

What obvious thing am I missing? Why can you just have the single const char *strstrImp(const char *input, const char *find) function? Is it because you want the ability to return a non const char*? Personally I don't mind adding const-ness but removing it has to be well justified. So I would swap the two bodies around and remove the constness on the return value of the first version.
@DeadMG There is a very good reason, and it's called "standards conformance". The question was about re-implementing strstr as per the standard en.cppreference.com/w/cpp/string/byte/strstr.

Stack Exchange Network

strstr() implementation

4 Answers 4

Const-correctness

Incorrect optimization

Logic simplification

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

strstr() implementation

4 Answers 4

Const-correctness

Incorrect optimization

Logic simplification

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions