Find the first occurrence of a substring in a string

Question 1

This is a simple version from the function strstr. It returns the address of the first occurrence of the substring s2 in s1. I want to know possible problems in the code, and how to improve it, in general.

#include <stdio.h>
/* Version of strstr, which returns the adress of
 the first occurrence of s2 in s1, else, NULL
*/
char *strs(char *s1, char* s2)
{
 char *ptr = NULL;
 int i, n;
 for (i = 0, n = 0; s1[i] != '0円'; i++)
 {
 if (s1[i] == s2[n])
 {
 if (n == 0)
 ptr = &(s1[i]);
 if (s2[n + 1] == '0円')
 return ptr;
 ++n;
 }
 else
 n = 0;
 }
 return NULL;
}
int main(void)
{
 char s1[] = "I'm waiting in my cold cell\nWhen the bell begins to chime\n";
 char s2[] = "waiting";
 char *ptr = strs(s1, s2);
 if (ptr == NULL)
 {
 printf("There is no occurrence of \"%s\" in the first string!\n",
 s2);
 }
 else
 {
 printf("%s\n", ptr);
 }
}

Question 2

Comments about your code:

I believe that you would rename it to something like strstr if you could.

You should accept pointers to const char, since you don't modify them.

You should initialize i and n right inside the loop. It will narrow their scope and make the code more maintainable since it won't pollute the scope of the function. On top of that, they will be deallocated whenever the loop ends.

using int for index is a very bad idea. The biggest issues with it that it is signed and not guaranteed to be large enough. You should use size_t instead, because it is guaranteed that size_t will have enough capacity to index array of maximum possible size.

You're increasing overhead by using subscript ([]) operator.

Loop could be rewritten in other, more readable and straightforward form, one of which is shown in suggested implementation. I know that you would like to have high performance, but I don't think that it is that important in this situation. May be with more context of the usage of the algorithm I could suggest particular algorithm (there are quite a few of them). The best in my opinion is Boyer–Moore string search algorithm.

Suggested implementation:

I think that strncmp should be used. I know that including string.h header defeats the purpose, but you could write your own. It increases code reuse in C!

To make it work we just need to precalculate length of the substring.

const char *strs(const char *text, const char *substr)
{
 size_t substrlen = strlen(substr);
 while (*text != 0)
 {
 if (strncmp(text, substr, substrlen) == 0)
 {
 return text;
 }
 ++text;
 }
 return NULL;
}

The performance gain from the data being const depends on physical constness and the compiler. Further reading.

About performance gain..

Question 3

Thanks for the answer! I really considered using strncmp, but it just felt like cheating. By the way, should I use const char in the pointer because makes the code more readable, or because makes it faster? (or both)

Question 4

First reason + it prevents from occasional modifying string. The semantic of the function is that the function doesn't modify the string, so it will make it easier for people to understand that it for sure (not considering casts) won't modify the string. About the performance, it depends on the data being physically const. Unfortunately I don't have enough knowledge about it. Probably another review will be able to answer the question. I will upvote it so people could see it.

Question 5

@LúcioCardoso, updated the answer with relevant SO posts. Have a nice read.

Question 6

"The algorithm can be written to use less memory and possibly incur less overhead." can you provide an explanation to make this actually meaningful?

Question 7

@JeroenVannevel, all the performance, both space and time, optimizations listed below that sentence are provided to justify it. I wanted to stress on possibly because I don't want to make overstatement since I didn't make any measurements. Suggested implementation probably uses less memory, since size_t is usually equal or greater than int. There are more ways to perform optimizations, so I think making too qualified statement would be inaccurate

Incomputable Incomputable 9,7143 gold badges34 silver badges73 bronze badges · Accepted Answer · 2016-08-07 17:56:46Z

Comments about your code:

I believe that you would rename it to something like strstr if you could.

You should accept pointers to const char, since you don't modify them.

You should initialize i and n right inside the loop. It will narrow their scope and make the code more maintainable since it won't pollute the scope of the function. On top of that, they will be deallocated whenever the loop ends.

using int for index is a very bad idea. The biggest issues with it that it is signed and not guaranteed to be large enough. You should use size_t instead, because it is guaranteed that size_t will have enough capacity to index array of maximum possible size.

You're increasing overhead by using subscript ([]) operator.

Loop could be rewritten in other, more readable and straightforward form, one of which is shown in suggested implementation. I know that you would like to have high performance, but I don't think that it is that important in this situation. May be with more context of the usage of the algorithm I could suggest particular algorithm (there are quite a few of them). The best in my opinion is Boyer–Moore string search algorithm.

Suggested implementation:

I think that strncmp should be used. I know that including string.h header defeats the purpose, but you could write your own. It increases code reuse in C!

To make it work we just need to precalculate length of the substring.

const char *strs(const char *text, const char *substr)
{
 size_t substrlen = strlen(substr);
 while (*text != 0)
 {
 if (strncmp(text, substr, substrlen) == 0)
 {
 return text;
 }
 ++text;
 }
 return NULL;
}

The performance gain from the data being const depends on physical constness and the compiler. Further reading.

About performance gain..

Thanks for the answer! I really considered using strncmp, but it just felt like cheating. By the way, should I use const char in the pointer because makes the code more readable, or because makes it faster? (or both)
First reason + it prevents from occasional modifying string. The semantic of the function is that the function doesn't modify the string, so it will make it easier for people to understand that it for sure (not considering casts) won't modify the string. About the performance, it depends on the data being physically const. Unfortunately I don't have enough knowledge about it. Probably another review will be able to answer the question. I will upvote it so people could see it.
@LúcioCardoso, updated the answer with relevant SO posts. Have a nice read.
"The algorithm can be written to use less memory and possibly incur less overhead." can you provide an explanation to make this actually meaningful?
@JeroenVannevel, all the performance, both space and time, optimizations listed below that sentence are provided to justify it. I wanted to stress on possibly because I don't want to make overstatement since I didn't make any measurements. Suggested implementation probably uses less memory, since size_t is usually equal or greater than int. There are more ways to perform optimizations, so I think making too qualified statement would be inaccurate

Stack Exchange Network

Find the first occurrence of a substring in a string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find the first occurrence of a substring in a string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions