2
\$\begingroup\$

I decided to work on an idea I had to 'optimize' the classic C function strstr. Most of the implementations I had seen that did not make use of advanced optimizations usually ran a forward-running comparison loop to confirm an actual match within the initial match detection loop.

I thought we could 'optimize' it somewhat such that instead of performing a forward char-by-char comparison, we would be performing a simplified version of Boyer-Moore. This would allow us to not only.

  1. Exit out of the inner loop earlier: Since for left-to-right text, a string match condition has somewhat of an end bias (no matter how many characters match, even an n - 1 char mismatch at the end would result in a non-match).
  2. Advance further than a single char on a non-match (worst-case: advance by 2).

But I am not sure if I have missed anything or have failed to take something into consideration. I truly apologize for the terrible variable naming.

/* -------------------------------------------------- //
 Parameters:
 1. [c] - The string to search within.
 2. [find] - The string to look for.
 3. [n] - The length of [c].
 4. [findn] - The length of [find].
 Return:
 The pointer offset indicating the address
 of the first characted of the substring
 [find] within [c]. If [find] was not found
 to be within [c], then a NULL pointer is
 returned.
 NOTE: There was a manual cast from
 (const char *) to (char *) which was done
 to match the return signature of the original
 strstr(...). If this is a security flaw, we
 can just change the return signature as it'd
 probably be a good idea to keep the parameter
 signatures as is.
// -------------------------------------------------- */
char *findin(const char *__restrict c, const char *__restrict find, size_t n, size_t findn)
{
 size_t u = 0, uu;
 --findn;
 while((u + findn) < n)
 {
 if(c[u] == find[0])
 {
 uu = findn;
 while((uu > 0) && (c[u + uu] == find[uu]))
 {
 --uu;
 }
 if(!uu)
 {
 return (char *)(c + u);
 }
 else
 {
 u = u + findn;
 }
 }
 else
 {
 ++u;
 }
 }
 return 0;
}
  • Usage
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    int main()
    {
     char str[] = "What an apt day to appreciate an apple!";
     char substr[] = "app";
     size_t str_length = strlen(str);
     size_t substr_length = strlen(substr);
     char *p = findin(str, substr, str_length, substr_length);
     printf("%s\n", p);
     return 0;
    }
    

Needless to say, this is not the exact function signature nor intended as a replacement for strstr. I attempted to implement my initial thought. I don't believe strstr is used to look through a 1MB chunk of text, so any optimization would have negligible impact. But I wonder, is the standard strstr much more optimized & making use of hardware-level intrinsics?

Peilonrayz
44.4k7 gold badges80 silver badges157 bronze badges
asked May 25 at 7:34
\$\endgroup\$
7
  • 2
    \$\begingroup\$ You need to include stddef.h> (or perhaps <string.h> or any of several others) to get a definition for size_t. \$\endgroup\$ Commented May 25 at 7:54
  • 3
    \$\begingroup\$ Have you studied many other implementations? This one seems much less sophisticated than [the glibc implementation](hb=e331531a771443edae4135e6bcd016282cf1a3aa), for example. \$\endgroup\$ Commented May 25 at 8:03
  • 3
    \$\begingroup\$ (find "abac" in "ababacac") \$\endgroup\$ Commented May 25 at 8:13
  • 3
    \$\begingroup\$ There isn't really "the standard strstr" but there are plenty of advanced implementations floating around, the naive "just use two nested loops" O(nm) algorithm mostly exists in the imagination of beginner level content and not so much in the serious library implementations (I cannot rule it out though) \$\endgroup\$ Commented May 25 at 9:31
  • 2
    \$\begingroup\$ Sorry my link got mangled - it should point at the glibc implementation. \$\endgroup\$ Commented May 26 at 7:05

1 Answer 1

6
\$\begingroup\$

The code shown fails to find "abc" in "aabc":
Design and automate tests. (Printing as a string a function result that may be NULL "invokes undefined behaviour".)

The code documents the parameters to the single function presented.
For the function's purpose it relies on the function's name and the description of the value returned.
It does not tell what happens when there is more than one match.

"truly apologize for the terrible variable naming"
Don't: improve the naming!
Traditional names are haystack and needle.

Algorithm:
So seeing 'b' in the haystack where there is 'c' in the needle, how much can one advance?
The 'b' could happen to be the last 'b' in the needle, which happens to be its second character: the next candidate position is mismatch position - last position of haystack character in needle.
Characters not in needle at all allow to continue after the mismatch.

This is the bad character rule from Boyer-Moore, it can be implemented without the good suffix rule and expected to speed up establishing no match or finding matches depending on needle.

\$\endgroup\$
5
  • \$\begingroup\$ thank you for the detailed response. initially, what i try to do with algorithms is i read up on it superficially and test if i am able to come close to "thinking up" a solution based on the "clues" i have. i suppose some algorithms are simply too complex to approach in this manner! i also had the idea to alternate the direction of comparison after you highlighted the initial flaw, but i'm guessing there are other methods to not O(nm) this in the first place... everything has a cost, and i knew it couldn't be this simple, can't get ahead! \$\endgroup\$ Commented May 27 at 6:14
  • 1
    \$\begingroup\$ Implementing the bad character rule with good performance sure was more straightforward when characters were single byte. You can try to establish the shifts just in time. \$\endgroup\$ Commented May 27 at 6:17
  • \$\begingroup\$ off the top of your head, would alternating the direction of comparison on every iteration get me out of the inconsistent jumping ahead issue and hence still keeping the big O better than O(nm)? \$\endgroup\$ Commented May 27 at 6:24
  • 1
    \$\begingroup\$ I don't think it does. Far as memory serves, worst case faster than O(nm) is annoyingly hard/complicated. \$\endgroup\$ Commented May 27 at 6:26
  • 3
    \$\begingroup\$ (Go about it differently: Try to think up a worst case needle. All characters the same (boooring), palindromes, ...) \$\endgroup\$ Commented May 27 at 6:30

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.