Wildcard string search with capture

Question 1

My goal was to create a function as performant as possible with results being:

True/False as to whether wild matches str.
Optional capture of certain patterns.

My purposes required:

Case sensitive match.
Pattern must match entire string to succeed.
Assumption that * and ? do not exist in str.
* matches zero or more characters.
? matches exactly one character.
Lazy * in this case meaning that going from left to right, * would match as little as possible while still ensuring a match if such were possible.

It's been a challenge as many times I almost had it but then found an edge case that didn't work. This is basically globbing, but I need such to work for strings rather than files and I also hoped to make something much faster than using a regex which IMHO would be overkill for my needs.

I have tested the following to work on many patterns but I can't be certain yet if there isn't a permutation where it fails. I would be very interested if anyone can point out a situation where it fails the terms set above. I'm also interested to hear if there are any ways the code can be improved (made faster).

Hacks like (unsigned)(*wild-40)<=1 are probably unnecessary given the ability of gcc to optimize, but I put it in there anyway.

static __attribute__ ((noinline)) int globSearch(char *wild,char *str,char **c)
 {
 int N=-1,n=0,nm=0;
 while(*str)
 {
 if(*wild == '*'){ N=n; wild++; }
 else if((unsigned)(*wild-40)<=1) // faster version of if(*wild=='('||*wild==')')
 {
 c[n++]=str;
 wild++;
 }
 else if(*wild == '?'){ wild++; str++; }
 else if(*wild == *str)
 {
 if(nm){while(N < n){ c[N++]+=nm; }}
 nm=0;
 N=-1;
 wild++;
 str++;
 }
 else if(N > -1){ str++; nm++; }
 else return 0;
 }
 while((unsigned)(*wild-40) <= 2) // while( *wild == '(' || *wild == ')' || *wild == '*' )
 {
 if(*wild++ != '*')c[n++]=str;
 }
 if(nm){while(N < n){ c[N++]+=nm; }}
 return !*wild;
 }

Testing function

int main()
 {
 char tmp[100]={0};
 char *c[20]={0};
 char *m="(hello) (*)(?*?)?(?)?(*)(*)";
 char *str="hello world";
 int t=globSearch(m,str,c);
 if(t)
 {
 for(int x=0;c[x]&&c[x+1];x+=2)
 {
 int z=c[x+1]-c[x];
 memcpy(tmp,c[x],z);
 tmp[z]=0;
 printf("%s:%d\n",tmp,z);
 }
 }
 return 0;
 }

The above code returns successful match with patterns captured: (hello:5) (0) (wo:2) (l:1) (0) (0)

Other patterns and results:

*(hello*)* -> (hello:5)
hello *(d) -> (d:1)
(hello) (*)(d) -> (hello:5) (worl:4) (d:1)
(hello) *(*)r(*)? -> (hello:5) (wo:2) (l:1)
(hello) *(*)r(*)*? -> (hello:5) (wo:2) (0)

All the results are exactly what I was hoping given the parameters. Can it be done faster?

Question 2

__attribute__ ((noinline)) is rarely used. What's your reason behind it?

Question 3

I wouldn't use that in production code; It's merely a means to ensure the code isn't inlined so that I can easily find the function in the disassembled object file. When performance tuning, I like to check what the compiler outputs. I probably shouldn't have left that in when posting here.

Question 4

I see a number of things that may help you improve your code.

Use a `switch` instead of long `if ...else` chain

The pattern matching logic is much easier to see if a swtich statement is used instead of the long if...else chain. The default case can then be used solely for the non-pattern characters and a relatively short if...else chain would go there. On my machine, this also makes the code slightly faster.

Use longer, more meaningful names

Names like n and N are not very descriptive. Instead, you might use capturegroup and groupnum.

Use `const` where practical

The strings should not be altered by the search and the captured array c is simply a list of pointers into the string. For that reason, I would advise changing the signature of the function to this:

int globSearch(const char *pattern, const char *str, const char **capture)

Consider checking for `NULL` pointers

If speed is the primary consideration and if the caller checks, you don't need this, but I'd normally recommend having the called function check for NULL pointers before attempting to dereference them.

Differentiate between `'('` and `')'`

Right now, your program produces the same output for this pattern:

char *m = "(hello) (*)(?*?)?(?)?(*)(*)";

as this pattern:

char *m = ")hello) )*))?*?)?)?)?)*))*)";

Maybe that's OK, but it also produces the same for this pattern:

char *m = "(hello) (*((?*?)?)?)?(*)(*)";

That's a problem because it appears to those familiar with regular expressions that this is pattern contains nested capture patterns, which is not how the code actually interprets the string.

Consider changing the interface

Instead of an array of pointers, it might be more useful to have pointers and counts. Otherwise, any time the values are returned, you'll have to do things like this:

int z = c[x + 1] - c[x];
memcpy(tmp, c[x], z);
tmp[z] = 0;
printf("%s:%d\n", tmp, z);

Pass the size of the array of pointers

Any time you pass an array to a function, you should either use a sentinel value (such as the terminating '0円' character to mark the end of a string) or pass a size with the pointer. Otherwise, the called routine has no way to assure that it won't overrun the end of the array.

Avoid buffer overflow vulnerabilities

I realized it's just demo code, but any time you use memcpy it's important to make sure that the destination buffer has enough space for the number of bytes being copied. That's the case here with these particular fixed strings, but it's better generally to dynamically check that at runtime, especially when the length parameter is not a constant.

Eliminate the `return 0;` in `main`

For many years, the C standard has specified that the compiler must automatically generate the equivalent of return 0; at the end of main so there is no reason to specfically include it manually.

Question 5

I didn't know about return 0 being redundant in main! i agree with most of your suggestions and will implement solutions where it doesn't seriously affect efficiency.

Question 6

I'll comment on some things about your code style that Edward didn't already cover.

Spaces

Put spaces around operators. This makes the code easier to read since one can at a glance distinguish between tokens. This applies to assignment, arithmetic, boolean operators etc. I would also put spaces between if / while and the parenthesized expression after it.

(unsigned) (*wild - 40) <= 1
c[N++] += nm
...

Newlines

It's bad practice to put multiple statements on one line, because you can't put breakpoints on the latter statements. It's also harder to read and does not really achieve anything.

else if (*wild == '?') {
 wild++;
 str++;
}

There was a section on braces, but I edited it out since I hadn't done my research properly.

Question 7

The brace style I use is known as the "Whitesmiths style" and has been used for more than 20 years. Back in the 90's it was one of the most common styles but much less so today. I tend to agree about the newlines in general but prefer where there are only 2 or less simple statements to put them on one line. To my eyes it makes the code more readable as I don't have to scroll so much.

Edward EdwardEdward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2015-06-11 18:05:57Z

I see a number of things that may help you improve your code.

Use a `switch` instead of long `if ...else` chain

The pattern matching logic is much easier to see if a swtich statement is used instead of the long if...else chain. The default case can then be used solely for the non-pattern characters and a relatively short if...else chain would go there. On my machine, this also makes the code slightly faster.

Use longer, more meaningful names

Names like n and N are not very descriptive. Instead, you might use capturegroup and groupnum.

Use `const` where practical

The strings should not be altered by the search and the captured array c is simply a list of pointers into the string. For that reason, I would advise changing the signature of the function to this:

int globSearch(const char *pattern, const char *str, const char **capture)

Consider checking for `NULL` pointers

If speed is the primary consideration and if the caller checks, you don't need this, but I'd normally recommend having the called function check for NULL pointers before attempting to dereference them.

Differentiate between `'('` and `')'`

Right now, your program produces the same output for this pattern:

char *m = "(hello) (*)(?*?)?(?)?(*)(*)";

as this pattern:

char *m = ")hello) )*))?*?)?)?)?)*))*)";

Maybe that's OK, but it also produces the same for this pattern:

char *m = "(hello) (*((?*?)?)?)?(*)(*)";

That's a problem because it appears to those familiar with regular expressions that this is pattern contains nested capture patterns, which is not how the code actually interprets the string.

Consider changing the interface

Instead of an array of pointers, it might be more useful to have pointers and counts. Otherwise, any time the values are returned, you'll have to do things like this:

int z = c[x + 1] - c[x];
memcpy(tmp, c[x], z);
tmp[z] = 0;
printf("%s:%d\n", tmp, z);

Pass the size of the array of pointers

Any time you pass an array to a function, you should either use a sentinel value (such as the terminating '0円' character to mark the end of a string) or pass a size with the pointer. Otherwise, the called routine has no way to assure that it won't overrun the end of the array.

Avoid buffer overflow vulnerabilities

I realized it's just demo code, but any time you use memcpy it's important to make sure that the destination buffer has enough space for the number of bytes being copied. That's the case here with these particular fixed strings, but it's better generally to dynamically check that at runtime, especially when the length parameter is not a constant.

Eliminate the `return 0;` in `main`

For many years, the C standard has specified that the compiler must automatically generate the equivalent of return 0; at the end of main so there is no reason to specfically include it manually.

I didn't know about return 0 being redundant in main! i agree with most of your suggestions and will implement solutions where it doesn't seriously affect efficiency.

Stack Exchange Network

Wildcard string search with capture

2 Answers 2

Use a `switch` instead of long `if ...else` chain

Use longer, more meaningful names

Use `const` where practical

Consider checking for `NULL` pointers

Differentiate between `'('` and `')'`

Consider changing the interface

Pass the size of the array of pointers

Avoid buffer overflow vulnerabilities

Eliminate the `return 0;` in `main`

Spaces

Newlines

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Wildcard string search with capture

2 Answers 2

Use a switch instead of long if ...else chain

Use longer, more meaningful names

Use const where practical

Consider checking for NULL pointers

Differentiate between '(' and ')'

Consider changing the interface

Pass the size of the array of pointers

Avoid buffer overflow vulnerabilities

Eliminate the return 0; in main

Spaces

Newlines

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use a `switch` instead of long `if ...else` chain

Use `const` where practical

Consider checking for `NULL` pointers

Differentiate between `'('` and `')'`

Eliminate the `return 0;` in `main`