Optimizing string handling function

Question 1

How can I optimize the following function? It checks whether two strings have a certain number of similar consecutive characters. As it is now, it is very slow, so how can it be improved in terms of speed/performance:

#include <stdio.h>
int overlap(char *str, const char *substr, int length);
int main(void)
{
 if (overlap("abcef01239", "abckef01798", 4))
 {
 puts("Match ...");
 }
 return 0;
}
int overlap(char *str, const char *substr, int length)
{
 const char *_sub = substr;
 while (*str)
 {
 int n = 0;
 int a = *str;
 int b = *substr;
 while (a && b)
 {
 if (a != b)
 break;
 n++;
 if (n == length)
 return 1;
 a = str[n];
 b = substr[n];
 }
 if (!*substr)
 {
 str++;
 substr = _sub;
 }
 else
 {
 substr++;
 }
 }
 return 0;
}

Question 2

This question is incomplete. To help reviewers give you better answers, please add sufficient context to your question. The more you tell us about what your code does and what the purpose of doing that is, the easier it will be for reviewers to help you. The current title states your concerns about the code; it needs an edit to simply state the task; see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.

Question 3

@TobySpeight I don't know why you find it incomplete. It describes what the code does: "It checks whether two strings have a certain number of similar consecutive characters" and explains the problem that it is slow.

Question 4

After several re-reads, I think the intent is that it should return true if the two strings have any common substring of length or more characters. Is that the case? If so, you could make it clearer by including more of your tests.

Question 5

@TobySpeight The function is supposed to return true if there are equivalent consecutive characters of length length between two strings. In main(), it returns true because "abcef01239" and "abckef01798 have exactly 4 similar consecutive characters, which is the length passed as the third argument.

Question 6

Const correctness

We don't modify the contents of str, so pass it as a const char*. This allows us to give string literals as arguments without compiler warnings.

Check the arguments

What if str or substr is null? What does it mean for length to be negative? We should use size_t for indexing and lengths.

int overlap(const char *str, const char *substr, size_t length)
{
 if (!str || !substr)
 /* invalid arguments */
 return 0;

Add more tests

I'm not 100% sure from the description what the code is supposed to do. That can be made clearer by providing more tests. I recommend using a test framework to help identify the failures, but we can make our own for this:

int test(const char *a, const char *b, size_t len,
 int expected, const char *file, int line, const char *message)
{
 const int actual = overlap(a, b, len);
 if (actual == expected) return 0;
 fprintf(stderr, "%s:%d: %s should be %d, but got %d\n",
 file, line,
 message, expected, actual);
 return 1;
}
#define TEST(a, b, len, expected) \
 test(a, b, len, expected, __FILE__, __LINE__, "overlap(" #a ", " #b ", " #len ")")
int main(void)
{
 return
 + TEST(NULL, NULL, 1, 0)
 + TEST("abc", "def", 0, 1)
 + TEST("abc", "abc", 4, 0)
 + TEST("abc", "abc", 3, 1)
 + TEST("abc", "bcd", 3, 0)
 + TEST("abc", "bcd", 2, 1)
 + TEST("bcd", "abc", 2, 1)
 + TEST("abcef01239", "abckef01798", 4, 1);
}

Algorithm

We're going more times over the data than we need to - although it looks like we're scaling as O(n2), it's actually O(n3), because of the if/else at the end of the loop that resets substr from time to time.

What we need to do is for each offset of str and substr, walk along the pair of them, counting character matches as we go (reset the count on a non-match). If the count reaches length, then we've found a substring match.

As a worked example, let's test "ab1cdefg" against "bcde2":

ab1cdefg
bcde2
.....

In that first pass, nothing matched. Now shift one of the strings by one character (i.e. index by n+1 instead of by n):

ab1cdefg
 bcde2
 1....

The longest run was 1, when we matched the b. Next iteration:

ab1cdefg
 bcde2
 .123.

Here, we matched c, d and e successively, for a run length of 3.

In code, that looks something like:

#include <string.h>
/* internal method - returns true if a and b are equal for a run of
 length, without shifting */
static int overlap_run(const char *a, const char *b, size_t length)
{
 size_t count = 0;
 while (*a && *b) {
 /* add one to count, or reset it to 0 */
 if (*a == *b)
 ++count;
 else
 count = 0;
 if (count == length)
 /* match found */
 return 1;
 ++a, ++b;
 }
 return 0;
}
int overlap(const char *a, const char *b, size_t length)
{
 if (!a || !b)
 /* invalid arguments */
 return 0;
 if (!length)
 /* every pair of strings has a zero-length match */
 return 1;
 const size_t la = strlen(a);
 const size_t lb = strlen(b);
 if (length > la || length > lb)
 /* strings are too short to match */
 return 0;
 for (size_t offset = 0; offset <= la - length; ++offset)
 if (overlap_run(a+offset, b, length))
 return 1;
 for (size_t offset = 1; offset <= lb - length; ++offset)
 if (overlap_run(a, b+offset, length))
 return 1;
 /* no match */
 return 0;
}

You might be able to avoid pipeline stalls by replacing

 if (*a == *b)
 ++count;
 else
 count = 0;

With

 count += (*a == *b);
 count *= (*a == *b);

But you'll need to benchmark to find out on your hardware.

Question 7

Well, assert() is generally the proper way to check for programmer error, instead of returning an error-code.

Question 8

Unfortunately the spec doesn't say whether passing NULL is an error or not, so I'm sitting on the fence there!

Question 9

@TobySpeight Most standard library C functions do not specify functionality when a string pointer is NULL, thus it is UB when those functions are called with NULL. Of course employing code that does have a NULL test is OK - and reasonably of low cost.

Toby Speight Toby Speight 87.3k14 gold badges104 silver badges322 bronze badges · Answer 1 · 2018-04-19 12:47:21Z

Const correctness

We don't modify the contents of str, so pass it as a const char*. This allows us to give string literals as arguments without compiler warnings.

Check the arguments

What if str or substr is null? What does it mean for length to be negative? We should use size_t for indexing and lengths.

int overlap(const char *str, const char *substr, size_t length)
{
 if (!str || !substr)
 /* invalid arguments */
 return 0;

Add more tests

I'm not 100% sure from the description what the code is supposed to do. That can be made clearer by providing more tests. I recommend using a test framework to help identify the failures, but we can make our own for this:

int test(const char *a, const char *b, size_t len,
 int expected, const char *file, int line, const char *message)
{
 const int actual = overlap(a, b, len);
 if (actual == expected) return 0;
 fprintf(stderr, "%s:%d: %s should be %d, but got %d\n",
 file, line,
 message, expected, actual);
 return 1;
}
#define TEST(a, b, len, expected) \
 test(a, b, len, expected, __FILE__, __LINE__, "overlap(" #a ", " #b ", " #len ")")
int main(void)
{
 return
 + TEST(NULL, NULL, 1, 0)
 + TEST("abc", "def", 0, 1)
 + TEST("abc", "abc", 4, 0)
 + TEST("abc", "abc", 3, 1)
 + TEST("abc", "bcd", 3, 0)
 + TEST("abc", "bcd", 2, 1)
 + TEST("bcd", "abc", 2, 1)
 + TEST("abcef01239", "abckef01798", 4, 1);
}

Algorithm

We're going more times over the data than we need to - although it looks like we're scaling as O(n2), it's actually O(n3), because of the if/else at the end of the loop that resets substr from time to time.

What we need to do is for each offset of str and substr, walk along the pair of them, counting character matches as we go (reset the count on a non-match). If the count reaches length, then we've found a substring match.

As a worked example, let's test "ab1cdefg" against "bcde2":

ab1cdefg
bcde2
.....

In that first pass, nothing matched. Now shift one of the strings by one character (i.e. index by n+1 instead of by n):

ab1cdefg
 bcde2
 1....

The longest run was 1, when we matched the b. Next iteration:

ab1cdefg
 bcde2
 .123.

Here, we matched c, d and e successively, for a run length of 3.

In code, that looks something like:

#include <string.h>
/* internal method - returns true if a and b are equal for a run of
 length, without shifting */
static int overlap_run(const char *a, const char *b, size_t length)
{
 size_t count = 0;
 while (*a && *b) {
 /* add one to count, or reset it to 0 */
 if (*a == *b)
 ++count;
 else
 count = 0;
 if (count == length)
 /* match found */
 return 1;
 ++a, ++b;
 }
 return 0;
}
int overlap(const char *a, const char *b, size_t length)
{
 if (!a || !b)
 /* invalid arguments */
 return 0;
 if (!length)
 /* every pair of strings has a zero-length match */
 return 1;
 const size_t la = strlen(a);
 const size_t lb = strlen(b);
 if (length > la || length > lb)
 /* strings are too short to match */
 return 0;
 for (size_t offset = 0; offset <= la - length; ++offset)
 if (overlap_run(a+offset, b, length))
 return 1;
 for (size_t offset = 1; offset <= lb - length; ++offset)
 if (overlap_run(a, b+offset, length))
 return 1;
 /* no match */
 return 0;
}

You might be able to avoid pipeline stalls by replacing

 if (*a == *b)
 ++count;
 else
 count = 0;

With

 count += (*a == *b);
 count *= (*a == *b);

But you'll need to benchmark to find out on your hardware.

Well, assert() is generally the proper way to check for programmer error, instead of returning an error-code.
Unfortunately the spec doesn't say whether passing NULL is an error or not, so I'm sitting on the fence there!
@TobySpeight Most standard library C functions do not specify functionality when a string pointer is NULL, thus it is UB when those functions are called with NULL. Of course employing code that does have a NULL test is OK - and reasonably of low cost.

Stack Exchange Network

Optimizing string handling function

1 Answer 1

Const correctness

Check the arguments

Add more tests

Algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing string handling function

1 Answer 1

Const correctness

Check the arguments

Add more tests

Algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions