Find substrings in string

Question 1

I'm self-learning C now, and I want to find out if my code is good and clean or kinda messy. For this exercise, the program tells if the text which was inserted has the keyword "asd", without using functions from string.h. If so, it returns how many times it was found. it's not a big program, but I want to make sure it's clean and good so I don't pick up any bad practices.

#include "stdafx.h"
#include <stdio.h>
char findInStr(char *str, char *find) {
 unsigned char howMany = 0, strCounter = 0, findCounter = 0;
 do {
 if (findCounter == 3)
 howMany++;
 if (str[strCounter] == find[findCounter])
 findCounter++;
 else
 findCounter = 0;
 strCounter++;
 } while (str[strCounter-1] != '0円');
 return howMany;
}
int main() {
 char str[60];
 char find[5] = "asd";
 char found;
 printf("enter text: ");
 gets_s(str, 60);
 found = findInStr(str, find);
 if (found)
 printf("found!!! %d\n times", found);
 else
 printf("not found\n");
 return 0;
}

Question 2

The problem statement

There's an ambiguity in the requirements, because you don't specify how to handle overlapping substrings. For example, should findInStr("ababa", "aba") return 1, or 2? Either of those may be useful, but be clear which interpretation is needed. This code appears to count non-overlapping substrings.

Non-standard header

Despite its name, stdafx.h is not a standard header, so you can't use it in portable code. Thankfully this program can easily become standard C, if you replace gets_s() with the equivalent fgets() call:

if (!fgets(str, sizeof str, stdin)) {
 perror("gets");
 return 1;
}

Note: this is not quite equivalent (thanks to Martin R for the comment), as get_s() discards the final newline; that won't make a difference unless we specifically look for a string ending in newline (which could now match once, rather than never).

Naming

I like the variable names you've used - I found it easy to understand your algorithm with the clear names. (They do need to be a more suitable type - see Martin's answer for a very clear explanation).

I think they would be even clearer if you give them a line each for their declarations.

Bug (or undocumented constraint)

This test assumes that find will always be three characters long:

 if (findCounter == 3)
 howMany++;

We can make more versatile by counting until we see the NUL character at the end of find, like this:

 if (find[findCounter] == '0円') {
 howMany++;
 findCounter = 0;
 }

Bug

If a character is repeated in find, we don't start looking again in the right place. We can test this with findInStr("ababaca", "abac") - this doesn't find the string because when we reach the second b, that doesn't match c and we start again from there (instead of going back to the start of the potential match).

The fix for this to subtract findCounter from strCounter before we reset it to zero, so we resume searching from the right place:

 if (str[strCounter] == find[findCounter]) {
 findCounter++;
 } else {
 strCounter -= findCounter;
 findCounter = 0;
 }

Modified version

Keeping your algorithm, but incorporating the fixes from this and other answers, I get:

#include <stdio.h>
/* Count the number of non-overlapping occurrences of find in str.
 * Returns zero if either string is empty.
 */
size_t findInStr(const char *str, const char *find)
{
 size_t howMany = 0;
 size_t strCounter = 0;
 size_t findCounter = 0;
 do {
 if (find[findCounter] == '0円') {
 ++howMany;
 findCounter = 0;
 }
 if (str[strCounter] == find[findCounter]) {
 ++findCounter;
 } else {
 strCounter -= findCounter;
 findCounter = 0;
 }
 } while (str[strCounter++] != '0円');
 return howMany;
}

I changed the test program to accept command-line arguments; this makes for easier testing. I hope that this is educational reading (note that I check the arguments before attempting to use them).

int main(int argc, char **argv)
{
 if (argc != 3) {
 fprintf(stderr, "Usage: %s haystack needle\n", argv[0]);
 return 1;
 }
 const char *const haystack = argv[1];
 const char *const needle = argv[2];
 const size_t found = findInStr(haystack, needle);
 switch (found) {
 case 0:
 printf("'%s' was not found in '%s'.\n", needle, haystack);
 break;
 case 1:
 printf("'%s' was found once in '%s'.\n", needle, haystack);
 break;
 default:
 printf("'%s' was found %zd times in '%s'.\n", needle, found, haystack);
 break;
 }
 return 0;
}

Pointer version

This is a bit more advanced, and harder to get right, but I'll show you what this looks like when converted to use pointers rather than indexes. A good optimizing compiler will likely generate the same code for both, but you'll want to be able to study and work with pointer code when you come across it.

size_t findInStr(const char *str, const char *const find)
{
 size_t howMany = 0;
 for (const char *find_pos = find; *str; ++str) {
 if (*str == *find_pos) {
 /* matched this char */
 if (!*++find_pos) {
 /* reached the end of find; count it and reset */
 ++howMany;
 find_pos = find;
 }
 } else {
 /* not matched - back up and continue */
 str -= find_pos - find;
 find_pos = find;
 }
 }
 return howMany;
}

Here, I've also used simply *str rather than the longer, but equivalent *str != '0円'.

Self-test program

An alternative main() can run a test suite, which can be useful in to avoid introducing bugs as you modify the code. Here's what I accumulated when writing the above:

/* return failure count (0 or 1) */
static int test(const char *haystack, const char *const needle, size_t expected)
{
 const size_t actual = findInStr(haystack, needle);
 if (actual == expected) {
 return 0;
 }
 /* incorrect result */
 fprintf(stderr,
 "FAIL: findInStr(\"%s\", \"%s\") returned %zd instead of %zd\n",
 haystack, needle, actual, expected);
 return 1;
}
int main(void)
{
 return
 + test("abc", "", 0)
 + test("abc", "d", 0)
 + test("abc", "cd", 0)
 + test("abc", "ac", 0)
 + test("abc", "ab", 1)
 + test("ab", "abc", 0)
 + test("abab", "ab", 2)
 + test("ababa", "aba", 1)
 + test("ababac", "bac", 1)
 ;
}

You might choose to use a preprocessor #ifdef/#else to be able to compile the test or interactive version from one source, or you might write two programs sharing a single findInStr() implementation.

Question 3

Goode catch about not finding strings with repeated characters! – Minor remark: gets_s (from the Microsoft C runtime) replaces the newline character with a 0円 character, so it is not completely equivalent to fgets()

Question 4

Thanks @Martin - I've edited to incorporate your observation. (Not having a Microsoft platform, I don't have the man pages for their functions, so just guessed based on the name).

Question 5

I simply looked it up in the MSDN online documentation: msdn.microsoft.com/en-us/library/5b5x9wc7.aspx :)

Question 6

The return type

Don't use char as the return type in

char findInStr(char *str, char *find);

On many (and on all POSIX compliant) platforms, char is a 8-bit integer which may be unsigned or signed, so the maximal return value would be 255 or 127.

Another problem is that you count the number of occurrences in an unsigned char, which means that – for example – a count of 200 would be returned as -56 if the character type is (8 bit) signed.

A better choice would be int (which is guaranteed to have at least 16 bits) or long (at least 32 bits). To be completely on the safe side, use size_t, which is a type than can hold the size of any object.

Const parameters

Since your function does not modify the passed strings, it is a good habit to declare them as constant:

size_t findInStr(const char *str, const char *find);

The compiler can then check that you don't (inadvertently) modify the pointed-to memory, and may be able to do further optimizations on the calling side.

It also documents that the function does not modify the strings.

Array sizes

In

char find[5] = "asd";

the array is one element too large. That does no harm, but there is a risk of forgetting to change the array size if the string on the right-hand side is modified. Better let the compiler determine the size automatically:

char find[] = "asd";

And here

char str[60];
// ...
gets_s(str, 60);

the array size is specified twice, which bears the risk of changing it at one place later but not at the other place. That is avoided with

char str[60];
// ...
gets_s(str, sizeof(str));

Question 7

Bugs

You don't reset findCounter properly, resulting in this miscount:

enter text: asdasd
found!!! 1
 times

Your function tests for findCounter == 3, but accepts a find parameter of any length.

Style

Never omit the "optional" braces for your if statements, especially if the statement spans more than one line. You will eventually contribute to a coding accident, and it will be your fault.

Your loop is awkward, especially the -1 in ... while (str[strCounter-1] != '0円'). A for loop would be more idiomatic and readable:

for (int strCounter = 0; str[strCounter] != '0円'; strCounter++) {
 if (str[strCounter] == find[findCounter]) {
 findCounter++;
 if (findCounter == findLen) {
 findCounter = 0;
 howMany++;
 }
 } else {
 findCounter = 0;
 }
}

Avoid using unsigned integers; they cause more trouble than they are worth.

Question 8

thanks, agree with everything except the -1 problem. if you omit it, the function won't recognize the asd if it's written at the end. unless you have another idea?

Question 9

I've been wondering about signed VS unsigned for a while now and it seems to be a very heated debate.

Question 10

You need to do -1 only because you incremented strCounter prematurely. If you had used the for loop as I suggested, or if you had used ... while (str[strCounter++] != '0円'), then it would have also worked, but less awkwardly.

Question 11

Disagreed on the for loop brackets: it's a basic language feature that everyone should get used to. Also disagreed on the unsigned int advice. Using unsigned ints will make you think properly about what should and shouldn't be unsigned, and put asserts and static casts in the correct places, which means you catch errors as soon as possible, not as late as possible, like that article suggests (while messing around with memset of all things).

Question 12

@user673679, I'm with you on the unsigned types, but with 200_success on the use of braces (but only since the notorious Apple goto fail bug), despite mitigations such as gcc -Wmisleading-indentation). Just shows that these two observations are somewhat opinionated rather than absolute.

Toby Speight Toby Speight 87.1k14 gold badges104 silver badges322 bronze badges · Accepted Answer · 2018-03-09 10:32:19Z

The problem statement

There's an ambiguity in the requirements, because you don't specify how to handle overlapping substrings. For example, should findInStr("ababa", "aba") return 1, or 2? Either of those may be useful, but be clear which interpretation is needed. This code appears to count non-overlapping substrings.

Non-standard header

Despite its name, stdafx.h is not a standard header, so you can't use it in portable code. Thankfully this program can easily become standard C, if you replace gets_s() with the equivalent fgets() call:

if (!fgets(str, sizeof str, stdin)) {
 perror("gets");
 return 1;
}

Note: this is not quite equivalent (thanks to Martin R for the comment), as get_s() discards the final newline; that won't make a difference unless we specifically look for a string ending in newline (which could now match once, rather than never).

Naming

I like the variable names you've used - I found it easy to understand your algorithm with the clear names. (They do need to be a more suitable type - see Martin's answer for a very clear explanation).

I think they would be even clearer if you give them a line each for their declarations.

Bug (or undocumented constraint)

This test assumes that find will always be three characters long:

 if (findCounter == 3)
 howMany++;

We can make more versatile by counting until we see the NUL character at the end of find, like this:

 if (find[findCounter] == '0円') {
 howMany++;
 findCounter = 0;
 }

Bug

If a character is repeated in find, we don't start looking again in the right place. We can test this with findInStr("ababaca", "abac") - this doesn't find the string because when we reach the second b, that doesn't match c and we start again from there (instead of going back to the start of the potential match).

The fix for this to subtract findCounter from strCounter before we reset it to zero, so we resume searching from the right place:

 if (str[strCounter] == find[findCounter]) {
 findCounter++;
 } else {
 strCounter -= findCounter;
 findCounter = 0;
 }

Modified version

Keeping your algorithm, but incorporating the fixes from this and other answers, I get:

#include <stdio.h>
/* Count the number of non-overlapping occurrences of find in str.
 * Returns zero if either string is empty.
 */
size_t findInStr(const char *str, const char *find)
{
 size_t howMany = 0;
 size_t strCounter = 0;
 size_t findCounter = 0;
 do {
 if (find[findCounter] == '0円') {
 ++howMany;
 findCounter = 0;
 }
 if (str[strCounter] == find[findCounter]) {
 ++findCounter;
 } else {
 strCounter -= findCounter;
 findCounter = 0;
 }
 } while (str[strCounter++] != '0円');
 return howMany;
}

I changed the test program to accept command-line arguments; this makes for easier testing. I hope that this is educational reading (note that I check the arguments before attempting to use them).

int main(int argc, char **argv)
{
 if (argc != 3) {
 fprintf(stderr, "Usage: %s haystack needle\n", argv[0]);
 return 1;
 }
 const char *const haystack = argv[1];
 const char *const needle = argv[2];
 const size_t found = findInStr(haystack, needle);
 switch (found) {
 case 0:
 printf("'%s' was not found in '%s'.\n", needle, haystack);
 break;
 case 1:
 printf("'%s' was found once in '%s'.\n", needle, haystack);
 break;
 default:
 printf("'%s' was found %zd times in '%s'.\n", needle, found, haystack);
 break;
 }
 return 0;
}

Pointer version

This is a bit more advanced, and harder to get right, but I'll show you what this looks like when converted to use pointers rather than indexes. A good optimizing compiler will likely generate the same code for both, but you'll want to be able to study and work with pointer code when you come across it.

size_t findInStr(const char *str, const char *const find)
{
 size_t howMany = 0;
 for (const char *find_pos = find; *str; ++str) {
 if (*str == *find_pos) {
 /* matched this char */
 if (!*++find_pos) {
 /* reached the end of find; count it and reset */
 ++howMany;
 find_pos = find;
 }
 } else {
 /* not matched - back up and continue */
 str -= find_pos - find;
 find_pos = find;
 }
 }
 return howMany;
}

Here, I've also used simply *str rather than the longer, but equivalent *str != '0円'.

Self-test program

An alternative main() can run a test suite, which can be useful in to avoid introducing bugs as you modify the code. Here's what I accumulated when writing the above:

/* return failure count (0 or 1) */
static int test(const char *haystack, const char *const needle, size_t expected)
{
 const size_t actual = findInStr(haystack, needle);
 if (actual == expected) {
 return 0;
 }
 /* incorrect result */
 fprintf(stderr,
 "FAIL: findInStr(\"%s\", \"%s\") returned %zd instead of %zd\n",
 haystack, needle, actual, expected);
 return 1;
}
int main(void)
{
 return
 + test("abc", "", 0)
 + test("abc", "d", 0)
 + test("abc", "cd", 0)
 + test("abc", "ac", 0)
 + test("abc", "ab", 1)
 + test("ab", "abc", 0)
 + test("abab", "ab", 2)
 + test("ababa", "aba", 1)
 + test("ababac", "bac", 1)
 ;
}

You might choose to use a preprocessor #ifdef/#else to be able to compile the test or interactive version from one source, or you might write two programs sharing a single findInStr() implementation.

Goode catch about not finding strings with repeated characters! – Minor remark: gets_s (from the Microsoft C runtime) replaces the newline character with a 0円 character, so it is not completely equivalent to fgets()
Thanks @Martin - I've edited to incorporate your observation. (Not having a Microsoft platform, I don't have the man pages for their functions, so just guessed based on the name).
I simply looked it up in the MSDN online documentation: msdn.microsoft.com/en-us/library/5b5x9wc7.aspx :)

Stack Exchange Network

Find substrings in string

3 Answers 3

The problem statement

Non-standard header

Naming

Bug (or undocumented constraint)

Bug

Modified version

Pointer version

Self-test program

The return type

Const parameters

Array sizes

Bugs

Style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find substrings in string

3 Answers 3

The problem statement

Non-standard header

Naming

Bug (or undocumented constraint)

Bug

Modified version

Pointer version

Self-test program

The return type

Const parameters

Array sizes

Bugs

Style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions