Attempt at a Different Variation of the strstr(...) Function

Question 1

I decided to work on an idea I had to 'optimize' the classic C function strstr. Most of the implementations I had seen that did not make use of advanced optimizations usually ran a forward-running comparison loop to confirm an actual match within the initial match detection loop.

I thought we could 'optimize' it somewhat such that instead of performing a forward char-by-char comparison, we would be performing a simplified version of Boyer-Moore. This would allow us to not only.

Exit out of the inner loop earlier: Since for left-to-right text, a string match condition has somewhat of an end bias (no matter how many characters match, even an n - 1 char mismatch at the end would result in a non-match).
Advance further than a single char on a non-match (worst-case: advance by 2).

But I am not sure if I have missed anything or have failed to take something into consideration. I truly apologize for the terrible variable naming.

/* -------------------------------------------------- //
 Parameters:
 1. [c] - The string to search within.
 2. [find] - The string to look for.
 3. [n] - The length of [c].
 4. [findn] - The length of [find].
 Return:
 The pointer offset indicating the address
 of the first characted of the substring
 [find] within [c]. If [find] was not found
 to be within [c], then a NULL pointer is
 returned.
 NOTE: There was a manual cast from
 (const char *) to (char *) which was done
 to match the return signature of the original
 strstr(...). If this is a security flaw, we
 can just change the return signature as it'd
 probably be a good idea to keep the parameter
 signatures as is.
// -------------------------------------------------- */
char *findin(const char *__restrict c, const char *__restrict find, size_t n, size_t findn)
{
 size_t u = 0, uu;
 --findn;
 while((u + findn) < n)
 {
 if(c[u] == find[0])
 {
 uu = findn;
 while((uu > 0) && (c[u + uu] == find[uu]))
 {
 --uu;
 }
 if(!uu)
 {
 return (char *)(c + u);
 }
 else
 {
 u = u + findn;
 }
 }
 else
 {
 ++u;
 }
 }
 return 0;
}

Usage

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
 char str[] = "What an apt day to appreciate an apple!";
 char substr[] = "app";
 size_t str_length = strlen(str);
 size_t substr_length = strlen(substr);
 char *p = findin(str, substr, str_length, substr_length);
 printf("%s\n", p);
 return 0;
}

Needless to say, this is not the exact function signature nor intended as a replacement for strstr. I attempted to implement my initial thought. I don't believe strstr is used to look through a 1MB chunk of text, so any optimization would have negligible impact. But I wonder, is the standard strstr much more optimized & making use of hardware-level intrinsics?

Question 2

You need to include stddef.h> (or perhaps <string.h> or any of several others) to get a definition for size_t.

Question 3

Have you studied many other implementations? This one seems much less sophisticated than [the glibc implementation](hb=e331531a771443edae4135e6bcd016282cf1a3aa), for example.

Question 4

(find "abac" in "ababacac")

Question 5

There isn't really "the standard strstr" but there are plenty of advanced implementations floating around, the naive "just use two nested loops" O(nm) algorithm mostly exists in the imagination of beginner level content and not so much in the serious library implementations (I cannot rule it out though)

Question 6

Sorry my link got mangled - it should point at the glibc implementation.

Question 7

The code shown fails to find "abc" in "aabc":
Design and automate tests. (Printing as a string a function result that may be NULL "invokes undefined behaviour".)

The code documents the parameters to the single function presented.
For the function's purpose it relies on the function's name and the description of the value returned.
It does not tell what happens when there is more than one match.

"truly apologize for the terrible variable naming"
Don't: improve the naming!
Traditional names are haystack and needle.

Algorithm:
So seeing 'b' in the haystack where there is 'c' in the needle, how much can one advance?
The 'b' could happen to be the last 'b' in the needle, which happens to be its second character: the next candidate position is mismatch position - last position of haystack character in needle.
Characters not in needle at all allow to continue after the mismatch.

This is the bad character rule from Boyer-Moore, it can be implemented without the good suffix rule and expected to speed up establishing no match or finding matches depending on needle.

Question 8

thank you for the detailed response. initially, what i try to do with algorithms is i read up on it superficially and test if i am able to come close to "thinking up" a solution based on the "clues" i have. i suppose some algorithms are simply too complex to approach in this manner! i also had the idea to alternate the direction of comparison after you highlighted the initial flaw, but i'm guessing there are other methods to not O(nm) this in the first place... everything has a cost, and i knew it couldn't be this simple, can't get ahead!

Question 9

Implementing the bad character rule with good performance sure was more straightforward when characters were single byte. You can try to establish the shifts just in time.

Question 10

off the top of your head, would alternating the direction of comparison on every iteration get me out of the inconsistent jumping ahead issue and hence still keeping the big O better than O(nm)?

Question 11

I don't think it does. Far as memory serves, worst case faster than O(nm) is annoyingly hard/complicated.

Question 12

(Go about it differently: Try to think up a worst case needle. All characters the same (boooring), palindromes, ...)

mindoverflow – mindoverflow · Accepted Answer · 2025-05-26 06:52:53Z

The code shown fails to find "abc" in "aabc":
Design and automate tests. (Printing as a string a function result that may be NULL "invokes undefined behaviour".)

The code documents the parameters to the single function presented.
For the function's purpose it relies on the function's name and the description of the value returned.
It does not tell what happens when there is more than one match.

"truly apologize for the terrible variable naming"
Don't: improve the naming!
Traditional names are haystack and needle.

Algorithm:
So seeing 'b' in the haystack where there is 'c' in the needle, how much can one advance?
The 'b' could happen to be the last 'b' in the needle, which happens to be its second character: the next candidate position is mismatch position - last position of haystack character in needle.
Characters not in needle at all allow to continue after the mismatch.

This is the bad character rule from Boyer-Moore, it can be implemented without the good suffix rule and expected to speed up establishing no match or finding matches depending on needle.

thank you for the detailed response. initially, what i try to do with algorithms is i read up on it superficially and test if i am able to come close to "thinking up" a solution based on the "clues" i have. i suppose some algorithms are simply too complex to approach in this manner! i also had the idea to alternate the direction of comparison after you highlighted the initial flaw, but i'm guessing there are other methods to not O(nm) this in the first place... everything has a cost, and i knew it couldn't be this simple, can't get ahead!
Implementing the bad character rule with good performance sure was more straightforward when characters were single byte. You can try to establish the shifts just in time.
off the top of your head, would alternating the direction of comparison on every iteration get me out of the inconsistent jumping ahead issue and hence still keeping the big O better than O(nm)?
I don't think it does. Far as memory serves, worst case faster than O(nm) is annoyingly hard/complicated.
(Go about it differently: Try to think up a worst case needle. All characters the same (boooring), palindromes, ...)

Stack Exchange Network

Attempt at a Different Variation of the strstr(...) Function

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Attempt at a Different Variation of the strstr(...) Function

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions