9
\$\begingroup\$

I was recently porting an application that uses memmem from Linux to Windows. While memmem is available as a library function on Linux, MSDN / Windows does not offer it. I could have used strstr but then it has a very different prototype and expects only null-terminated strings as arguments. Googling around, I discovered a few implementations of memmem but none appeared to truly mimic memmem.

I therefore wrote one of my own:

void *memmem(const void *haystack_start, size_t haystack_len, const void *needle_start, size_t needle_len)
{
 const unsigned char *haystack = (const unsigned char *) haystack_start;
 const unsigned char *needle = (const unsigned char *) needle_start;
 const unsigned char *h = NULL;
 const unsigned char *n = NULL;
 size_t x = needle_len;
 /* The first occurrence of the empty string is deemed to occur at
 the beginning of the string. */
 if (needle_len == 0)
 return (void *) haystack_start;
 /* Sanity check, otherwise the loop might search through the whole
 memory. */
 if (haystack_len < needle_len)
 return NULL;
 for (; *haystack && haystack_len--; haystack++) {
 x = needle_len;
 n = needle;
 h = haystack;
 if (haystack_len < needle_len)
 break;
 if ((*haystack != *needle) || ( *haystack + needle_len != *needle + needle_len))
 continue;
 for (; x ; h++ , n++) {
 x--;
 if (*h != *n) 
 break;
 if (x == 0)
 return (void *)haystack;
 }
 }
 return NULL;
}

The code does work, but my question is: could it be done more efficiently, or faster?

I hope one of the awesome hacks here will also be kind enough to point out if this implementation has a vulnerability that I may have not noticed.

Heslacher
50.9k5 gold badges83 silver badges177 bronze badges
asked Dec 6, 2017 at 15:27
\$\endgroup\$
8
  • 2
    \$\begingroup\$ Look up the string searching algorithms in en.wikipedia.org/wiki/String_searching_algorithm -- these will get you much better performance than the Θ(nm) performance of the naive algorithm you're using. \$\endgroup\$ Commented Dec 6, 2017 at 22:14
  • \$\begingroup\$ Hey @snowbody any implementation of Boyer–Moore string search algorithm in C, that I could use on Windows as well? Please do share. \$\endgroup\$ Commented Dec 7, 2017 at 7:04
  • 2
    \$\begingroup\$ I don't understand, the en.wikipedia.org/wiki/… page has a C algorithm on it, didn't it work for you? What compiler are you using? \$\endgroup\$ Commented Dec 7, 2017 at 14:14
  • \$\begingroup\$ @snowbody my bad, I didn't notice the c implementation, discussed below the Python implementation. Thanks for pointing out. \$\endgroup\$ Commented Dec 7, 2017 at 14:52
  • \$\begingroup\$ Ok this has a bug. Fails if needle and haystack are exactly the same. :( \$\endgroup\$ Commented Dec 12, 2017 at 13:42

2 Answers 2

6
\$\begingroup\$
  1. You wrote:

     for (; *haystack && haystack_len--; haystack++) {
     x = needle_len;
     n = needle;
     h = haystack;
     if (haystack_len < needle_len)
     break;
    

A better way would be:

 for (; --haystack_len >= needle_len; haystack++) {
 x = needle_len;
 n = needle;
 h = haystack;

The *haystack in the condition expression is actually a bug as it prevents searching if there are any zero bytes in the range.

  1. I think it is a bad idea to have the second clause in:

     if ((*haystack != *needle) || ( *haystack + needle_len != *needle + needle_len))
    

because you might bust the cache. Keep data access local. Checking the last character is an unnecessary complexity.

Sᴀᴍ Onᴇᴌᴀ
29.5k16 gold badges45 silver badges201 bronze badges
answered Dec 6, 2017 at 22:49
\$\endgroup\$
5
\$\begingroup\$

Declare the Variables When They are Needed
The first five lines of the function memmem() are:

 const unsigned char *haystack = (const unsigned char *) haystack_start;
 const unsigned char *needle = (const unsigned char *) needle_start;
 const unsigned char *h = NULL;
 const unsigned char *n = NULL;
 size_t x = needle_len;

Since x, h and n are loop variants, it would be better to delcare them where their values are assigned:

 for (; *haystack && haystack_len--; haystack++) {
 size_t x = needle_len;
 unsigned char *n = needle;
 unsigned char *h = haystack;

The pointers n and h are variable, they will change inside the loop, the characters won't change.

The variables haystack and needle start aren't needed utill just before the loop, because they also are only used in the loop..

 unsigned char *haystack = (unsigned char *) haystack_start;
 unsigned char *needle = (unsigned char *) needle_start;
 for (; *haystack && haystack_len--; haystack++) {
 size_t x = needle_len;
 unsigned char *n = needle;
 unsigned char *h = haystack;

Simplify the Code
There is no reason to stay in the function or the outer loop once haystack_len is less than needle_len.

 if (haystack_len < needle_len)
 {
 return NULL;
 }

Make Maintenance Easier While the following is perfectly legitimate, it's not the best programming practice, if someone needs to add a second statement to the THEN clause, it might be better to add braces around all THEN and ELSE clauses:

 if (haystack_len < needle_len)
 return NULL;

would be better as:

 if (haystack_len < needle_len)
 {
 return NULL;
 }

One Possible Optimization
Most computers have a decrement and test opcode, the for loop

 for (; x ; h++ , n++) {
 x--;
 if (*h != *n)
 break;
 if (x == 0)
 return (void *)haystack;
 }

would probably be faster this way:

 for (x = needle_len + 1; --x ; )
 {
 if (*h++ != *n++)
 break;
 }
 if (x == 0)
 {
 return (void *)haystack;
 }

The variable x needs to be one greater than the length because --x is tested at the top of the loop.

answered Dec 7, 2017 at 15:52
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.