I've created a regular expression (regex) parsing library in C, and would like some feedback on it. Speed is really important to me, but any and all suggestions are acceptable.
#include <ctype.h>
static int regex_matchHere(const char *regex, char *s, int *len);
static int regex_matchGroup(int c, int group);
static int regex_matchQuantity(int quant, int c, const char *regex, char *s, int *len);
int regex_match(const char *regex, char *s, int *len)
{
char *p = s;
/* force match from the beginning of the string */
if (regex[0] == '^') return (regex_matchHere(regex + 1, s, len) ? 0 : -1);
/* iterate the string to find matching position */
do
{
*len = 0;
if (regex_matchHere(regex, p, len)) return (int)(p - s);
} while (*p++ != '0円');
return -1;
}
static int regex_matchHere(const char *regex, char *s, int *len)
{
int c = regex[0];
if (regex[0] == '0円') return 1; /* end of regex = full match */
else if (regex[0] == '$' && regex[1] == '0円') return (*s == '0円'); /* check end of string */
else if (regex[0] == '\\' && regex[1] != '0円') /* check escaped symbol */
{
c = regex[1];
if (c != '^' && c != '$' && c != '\\' && c != '+' && c != '*' && c != '-' && c != '?') c = c | 0x100;
regex = regex + 1;
}
/* check for special operators *,+,?,- */
if (regex[1] == '*' || regex[1] == '+' || regex[1] == '-' || regex[1] == '?') return regex_matchQuantity(regex[1], c, regex+2, s, len);
else if (*s != '0円' && regex_matchGroup(*s, c))
{
*len = *len + 1;
return regex_matchHere(regex+1, s+1, len);
}
return 0;
}
static int regex_matchGroup(int c, int group)
{
if ((group & 0xff) == '.') group ^= 0x100;
if (group < 0x100) return c == group; /* a single char */
/* a meta char, like \d, ... */
switch (group & 0xff)
{
case 'd': return isdigit(c);
case 's': return isspace(c);
case 'D': return !isdigit(c);
case 'S': return !isspace(c);
case '.': return 1;
}
return 0;
}
static int regex_matchQuantity(int quant, int c, const char *regex, char *s, int *len)
{
if (quant == '?')
{
if (regex_matchGroup(*s, c))
{
*len = *len + 1;
s = s + 1;
}
return regex_matchHere(regex, s, len);
}
if (quant == '+' || quant == '*') /* match as much as possible */
{
char *p;
for (p = s; *p != '0円' && regex_matchGroup(*p, c); p++) *len = *len + 1;
if (quant == '+' && p == s) return 0;
do
{
if (regex_matchHere(regex, p, len)) return 1;
*len = *len - 1;
} while (p-- > s);
}
else if (quant == '-') /* match as little as possible */
{
do
{
if (regex_matchHere(regex, s, len)) return 1;
*len = *len + 1;
} while (*s != '0円' && regex_matchGroup(*s++, c));
}
return 0;
}
-
2\$\begingroup\$ Although it can be divined from the code, it wouldn't hurt to explicitly state the exact sort of regular expressions this is intended to parse/match. \$\endgroup\$Jerry Coffin– Jerry Coffin2014年03月11日 14:50:03 +00:00Commented Mar 11, 2014 at 14:50
1 Answer 1
What you did well
The code seems clean and logically organized. I like your 0x100-bit hack to indicate special characters. You could make that convention more obvious in the comments, though.
What you could improve on
The return value of
regex_match()
is weird. I'd like it to return a non-zero value if the match succeeded, and a zero value if the match failed, so that I can call it like this:if (regex_match(...)) { // Do stuff for successful match } else { // Do stuff for failed match }
Trying to return the position of the match just leads to confusion, reminiscent of the way PHP's
strpos()
returns 0 to indicate a successful match at the beginning of the subject (butFALSE
to indicate a non-match). You don't want to be like PHP, do you?I suggest that the signature for
regex_match()
should look like this:/** * Returns 1 if matched, 0 if not matched. * * Pass a pointer to a match_result if you care to find out the * details of the match (its length, position, and possibly other * information supported in the future, such as parenthesized * capture groups), or pass a NULL if you don't care about the details. */ int regex_match(const char *regex, const char *subject, struct match_result *result);
Alternatively, return a pointer to a new
struct match_result
if the match succeeded. The caller would have tofree()
the result later, though, so I don't like it as much.Regular expressions often include modifier flags, such as a case-insensitive flag or a continue-searching-where-the-previous-match-ended flag. You might want to plan your interfaces accordingly. (To support the latter, the
struct match_result*
would probably become an in-out parameter rather than an out-parameter.)For performance, regular expressions are frequently compiled into an automaton. You interpret the regular expressions as you go. You may wish design the library's interface to have a
regex_compile()
function that transforms the expression into a struct that is meaningful to your library but opaque to the user. For now, the "compilation" could just be the identity transformation; you can enhance it later when the need for better performance arises or when you enhance the feature set of the regular expressions.The function name
regex_matchGroup()
confuses me. "Group" implies something like parentheses, I think.regex_matchAtom()
might be a more appropriate name.You need unit tests!
-
4\$\begingroup\$ This is a nice review, except for the part about the return value – returning an index an expecting the caller to do
if (0 <= regex_match(...))
is perfectly natural. The problem with PHP is that it took a C-ish idiom and messed it up by returningfalse
instead of-1
, but that does not discredit the original idiom. \$\endgroup\$amon– amon2014年03月11日 08:50:41 +00:00Commented Mar 11, 2014 at 8:50
Explore related questions
See similar questions with these tags.