How can I optimize the following function? It checks whether two strings have a certain number of similar consecutive characters. As it is now, it is very slow, so how can it be improved in terms of speed/performance:
#include <stdio.h>
int overlap(char *str, const char *substr, int length);
int main(void)
{
if (overlap("abcef01239", "abckef01798", 4))
{
puts("Match ...");
}
return 0;
}
int overlap(char *str, const char *substr, int length)
{
const char *_sub = substr;
while (*str)
{
int n = 0;
int a = *str;
int b = *substr;
while (a && b)
{
if (a != b)
break;
n++;
if (n == length)
return 1;
a = str[n];
b = substr[n];
}
if (!*substr)
{
str++;
substr = _sub;
}
else
{
substr++;
}
}
return 0;
}
1 Answer 1
Const correctness
We don't modify the contents of str
, so pass it as a const char*
. This allows us to give string literals as arguments without compiler warnings.
Check the arguments
What if str
or substr
is null? What does it mean for length
to be negative? We should use size_t
for indexing and lengths.
int overlap(const char *str, const char *substr, size_t length)
{
if (!str || !substr)
/* invalid arguments */
return 0;
Add more tests
I'm not 100% sure from the description what the code is supposed to do. That can be made clearer by providing more tests. I recommend using a test framework to help identify the failures, but we can make our own for this:
int test(const char *a, const char *b, size_t len,
int expected, const char *file, int line, const char *message)
{
const int actual = overlap(a, b, len);
if (actual == expected) return 0;
fprintf(stderr, "%s:%d: %s should be %d, but got %d\n",
file, line,
message, expected, actual);
return 1;
}
#define TEST(a, b, len, expected) \
test(a, b, len, expected, __FILE__, __LINE__, "overlap(" #a ", " #b ", " #len ")")
int main(void)
{
return
+ TEST(NULL, NULL, 1, 0)
+ TEST("abc", "def", 0, 1)
+ TEST("abc", "abc", 4, 0)
+ TEST("abc", "abc", 3, 1)
+ TEST("abc", "bcd", 3, 0)
+ TEST("abc", "bcd", 2, 1)
+ TEST("bcd", "abc", 2, 1)
+ TEST("abcef01239", "abckef01798", 4, 1);
}
Algorithm
We're going more times over the data than we need to - although it looks like we're scaling as O(n2), it's actually O(n3), because of the if
/else
at the end of the loop that resets substr
from time to time.
What we need to do is for each offset of str
and substr
, walk along the pair of them, counting character matches as we go (reset the count on a non-match). If the count reaches length
, then we've found a substring match.
As a worked example, let's test "ab1cdefg"
against "bcde2"
:
ab1cdefg
bcde2
.....
In that first pass, nothing matched. Now shift one of the strings by one character (i.e. index by n+1
instead of by n
):
ab1cdefg
bcde2
1....
The longest run was 1, when we matched the b
. Next iteration:
ab1cdefg
bcde2
.123.
Here, we matched c
, d
and e
successively, for a run length of 3.
In code, that looks something like:
#include <string.h>
/* internal method - returns true if a and b are equal for a run of
length, without shifting */
static int overlap_run(const char *a, const char *b, size_t length)
{
size_t count = 0;
while (*a && *b) {
/* add one to count, or reset it to 0 */
if (*a == *b)
++count;
else
count = 0;
if (count == length)
/* match found */
return 1;
++a, ++b;
}
return 0;
}
int overlap(const char *a, const char *b, size_t length)
{
if (!a || !b)
/* invalid arguments */
return 0;
if (!length)
/* every pair of strings has a zero-length match */
return 1;
const size_t la = strlen(a);
const size_t lb = strlen(b);
if (length > la || length > lb)
/* strings are too short to match */
return 0;
for (size_t offset = 0; offset <= la - length; ++offset)
if (overlap_run(a+offset, b, length))
return 1;
for (size_t offset = 1; offset <= lb - length; ++offset)
if (overlap_run(a, b+offset, length))
return 1;
/* no match */
return 0;
}
You might be able to avoid pipeline stalls by replacing
if (*a == *b)
++count;
else
count = 0;
With
count += (*a == *b);
count *= (*a == *b);
But you'll need to benchmark to find out on your hardware.
-
\$\begingroup\$ Well,
assert()
is generally the proper way to check for programmer error, instead of returning an error-code. \$\endgroup\$Deduplicator– Deduplicator2018年04月19日 15:04:09 +00:00Commented Apr 19, 2018 at 15:04 -
1\$\begingroup\$ Unfortunately the spec doesn't say whether passing NULL is an error or not, so I'm sitting on the fence there! \$\endgroup\$Toby Speight– Toby Speight2018年04月19日 15:27:13 +00:00Commented Apr 19, 2018 at 15:27
-
1\$\begingroup\$ @TobySpeight Most standard library C functions do not specify functionality when a string pointer is
NULL
, thus it is UB when those functions are called withNULL
. Of course employing code that does have aNULL
test is OK - and reasonably of low cost. \$\endgroup\$chux– chux2018年04月25日 20:15:11 +00:00Commented Apr 25, 2018 at 20:15
length
or more characters. Is that the case? If so, you could make it clearer by including more of your tests. \$\endgroup\$length
between two strings. Inmain()
, it returns true because "abcef01239" and "abckef01798 have exactly4
similar consecutive characters, which is the length passed as the third argument. \$\endgroup\$