My goal was to create a function as performant as possible with results being:
- True/False as to whether
wild
matchesstr
. - Optional capture of certain patterns.
My purposes required:
- Case sensitive match.
- Pattern must match entire string to succeed.
- Assumption that
*
and?
do not exist instr
. *
matches zero or more characters.?
matches exactly one character.- Lazy
*
in this case meaning that going from left to right,*
would match as little as possible while still ensuring a match if such were possible.
It's been a challenge as many times I almost had it but then found an edge case that didn't work. This is basically globbing, but I need such to work for strings rather than files and I also hoped to make something much faster than using a regex which IMHO would be overkill for my needs.
I have tested the following to work on many patterns but I can't be certain yet if there isn't a permutation where it fails. I would be very interested if anyone can point out a situation where it fails the terms set above. I'm also interested to hear if there are any ways the code can be improved (made faster).
Hacks like (unsigned)(*wild-40)<=1
are probably unnecessary given the ability of gcc to optimize, but I put it in there anyway.
static __attribute__ ((noinline)) int globSearch(char *wild,char *str,char **c)
{
int N=-1,n=0,nm=0;
while(*str)
{
if(*wild == '*'){ N=n; wild++; }
else if((unsigned)(*wild-40)<=1) // faster version of if(*wild=='('||*wild==')')
{
c[n++]=str;
wild++;
}
else if(*wild == '?'){ wild++; str++; }
else if(*wild == *str)
{
if(nm){while(N < n){ c[N++]+=nm; }}
nm=0;
N=-1;
wild++;
str++;
}
else if(N > -1){ str++; nm++; }
else return 0;
}
while((unsigned)(*wild-40) <= 2) // while( *wild == '(' || *wild == ')' || *wild == '*' )
{
if(*wild++ != '*')c[n++]=str;
}
if(nm){while(N < n){ c[N++]+=nm; }}
return !*wild;
}
Testing function
int main()
{
char tmp[100]={0};
char *c[20]={0};
char *m="(hello) (*)(?*?)?(?)?(*)(*)";
char *str="hello world";
int t=globSearch(m,str,c);
if(t)
{
for(int x=0;c[x]&&c[x+1];x+=2)
{
int z=c[x+1]-c[x];
memcpy(tmp,c[x],z);
tmp[z]=0;
printf("%s:%d\n",tmp,z);
}
}
return 0;
}
The above code returns successful match with patterns captured: (hello:5) (0) (wo:2) (l:1) (0) (0)
Other patterns and results:
*(hello*)*
->(hello:5)
hello *(d)
->(d:1)
(hello) (*)(d)
->(hello:5) (worl:4) (d:1)
(hello) *(*)r(*)?
->(hello:5) (wo:2) (l:1)
(hello) *(*)r(*)*?
->(hello:5) (wo:2) (0)
All the results are exactly what I was hoping given the parameters. Can it be done faster?
2 Answers 2
I see a number of things that may help you improve your code.
Use a switch
instead of long if ...else
chain
The pattern matching logic is much easier to see if a swtich
statement is used instead of the long if...else
chain. The default
case can then be used solely for the non-pattern characters and a relatively short if...else
chain would go there. On my machine, this also makes the code slightly faster.
Use longer, more meaningful names
Names like n
and N
are not very descriptive. Instead, you might use capturegroup
and groupnum
.
Use const
where practical
The strings should not be altered by the search and the captured array c
is simply a list of pointers into the string. For that reason, I would advise changing the signature of the function to this:
int globSearch(const char *pattern, const char *str, const char **capture)
Consider checking for NULL
pointers
If speed is the primary consideration and if the caller checks, you don't need this, but I'd normally recommend having the called function check for NULL
pointers before attempting to dereference them.
Differentiate between '('
and ')'
Right now, your program produces the same output for this pattern:
char *m = "(hello) (*)(?*?)?(?)?(*)(*)";
as this pattern:
char *m = ")hello) )*))?*?)?)?)?)*))*)";
Maybe that's OK, but it also produces the same for this pattern:
char *m = "(hello) (*((?*?)?)?)?(*)(*)";
That's a problem because it appears to those familiar with regular expressions that this is pattern contains nested capture patterns, which is not how the code actually interprets the string.
Consider changing the interface
Instead of an array of pointers, it might be more useful to have pointers and counts. Otherwise, any time the values are returned, you'll have to do things like this:
int z = c[x + 1] - c[x];
memcpy(tmp, c[x], z);
tmp[z] = 0;
printf("%s:%d\n", tmp, z);
Pass the size of the array of pointers
Any time you pass an array to a function, you should either use a sentinel value (such as the terminating '0円'
character to mark the end of a string) or pass a size with the pointer. Otherwise, the called routine has no way to assure that it won't overrun the end of the array.
Avoid buffer overflow vulnerabilities
I realized it's just demo code, but any time you use memcpy
it's important to make sure that the destination buffer has enough space for the number of bytes being copied. That's the case here with these particular fixed strings, but it's better generally to dynamically check that at runtime, especially when the length parameter is not a constant.
Eliminate the return 0;
in main
For many years, the C standard has specified that the compiler must automatically generate the equivalent of return 0;
at the end of main
so there is no reason to specfically include it manually.
-
\$\begingroup\$ I didn't know about return 0 being redundant in main! i agree with most of your suggestions and will implement solutions where it doesn't seriously affect efficiency. \$\endgroup\$poby– poby2015年06月11日 23:16:45 +00:00Commented Jun 11, 2015 at 23:16
I'll comment on some things about your code style that Edward didn't already cover.
Spaces
Put spaces around operators. This makes the code easier to read since one can at a glance distinguish between tokens. This applies to assignment, arithmetic, boolean operators etc. I would also put spaces between if
/ while
and the parenthesized expression after it.
(unsigned) (*wild - 40) <= 1
c[N++] += nm
...
Newlines
It's bad practice to put multiple statements on one line, because you can't put breakpoints on the latter statements. It's also harder to read and does not really achieve anything.
else if (*wild == '?') {
wild++;
str++;
}
There was a section on braces, but I edited it out since I hadn't done my research properly.
-
1\$\begingroup\$ The brace style I use is known as the "Whitesmiths style" and has been used for more than 20 years. Back in the 90's it was one of the most common styles but much less so today. I tend to agree about the newlines in general but prefer where there are only 2 or less simple statements to put them on one line. To my eyes it makes the code more readable as I don't have to scroll so much. \$\endgroup\$poby– poby2015年06月14日 00:06:03 +00:00Commented Jun 14, 2015 at 0:06
__attribute__ ((noinline))
is rarely used. What's your reason behind it? \$\endgroup\$