Task:
On an alphabet set [a-z], a simplified regular expression is much simpler than the normal regular expression.
It has only two meta characters: '.' and '*'.
'.' -- exact one arbitrary character match.
'*' -- zero or more arbitrary character match.
Note that it is different from the normal regular expression in that '' is on its own and does NOT act as a decorator on the leading character, I.e. in our simplified regular expression. "" is equivalent to ".*" in regular expression. Write a parser which takes a pattern string and a candidate string as input and decides to either accept or reject the candidate string based on the pattern string.
- Only the full match is acceptable. The parser rejects partial matches.
- You can safely assume all the characters in both input pattern string and candidate string are valid characters, i.e. you don't need to validate the input strings.
- Your parser need to at least pass all the following test cases.
- You can write the parser in any programing language you like.
- Write up an analysis on the run-time complexity of your code.
Example:
| pattern string | candidate string | result | | abc | abc | accept | | * | abc | accept | | *abc | abc | accept | | *abc | aaabbbabc | accept | | a*bc | aaabbbabc | accept | | a*bc | abc | accept | | a* | abc | accept | | a* | a | accept | | a* | aa | accept | | a* | abcdef | accept | | *abc* | abc | accept | | ***** | abc | accept | | ... | abc | accept | | .* | abc | accept | | .bc* | abc | accept | | .b*c*a | abca | accept | | * | /EMPTY STRING/ | accept | | abc | abcd | reject | | *a | abcd | reject | | a | /EMPTY STRING/ | reject | | .a*c | abc | reject | | a.*b | abc | reject | | .. | abc | reject | | /EMPTY STRING/ | /EMPTY STRING/ | reject | | /EMPTY STRING/ | abc | reject |
My Solution:
public class RegxEngine
{
public static boolean match(String regx, String candidate)
{
// Empty regx will always return false
if (regx.isEmpty())
{
return false;
}
else
{
if (regx.charAt(0) == '*')
{
// Last * matches everything
if (regx.length() == 1)
{
return true;
}
else
{
return matchStar(regx.substring(1), candidate);
}
}
// Return false if text is empty but pattern is not *
else if (candidate.isEmpty())
{
return false;
}
else if (regx.charAt(0) == '.' || regx.charAt(0) == candidate.charAt(0))
{
// If the last regx matches the last text
if (regx.length() == 1 && candidate.length() == 1)
{
return true;
}
// If hasn't reached the end, try to match the rest strings
else
{
return match(regx.substring(1), candidate.substring(1));
}
}
else
{
return false;
}
}
}
// Otherwise skip as many chars as required
private static boolean matchStar(String regx, String candidate)
{
for (int i = 0; i < candidate.length(); i++)
{
if (match(regx, candidate.substring(i)))
{
return true;
}
else
{
continue;
}
}
return false;
}
}
This was an interview question and I did get through to the next round. I am asking simply looking for more ideas.
My Questions:
- Is my code \$O(m*n)\$, where \$m\$ is the length of the candidate string and \$n\$ is the length of the pattern string?
- In terms of performance/time complexity, could this be done better?
2 Answers 2
In answer to 1) - No. Consider the regex *********a
and the string b
. Your code will have to consider all 9! (362880) cases before returning false.
That being said, this solution seems to be going in the right direction, and for this restricted language we can stop short of actually building the DFA equivalent to the expression. (Which is expensive in space to create, but will be cheaper each time it is run against a candidate string).
We acheive this by caching the result of each regex-substring for future use (the dynamic programming approach). This prevents the problematic recursive explosion in the original problem and will guarantee an O(n*m) runtime, but could cost up to O(n*m) space for each regex candidate pair.
static Map<Pair<String, String>, boolean> result_cache;
//Assume we have a pair, as in:
//http://stackoverflow.com/a/677248/1331457
public static boolean match(String regx, String candidate)
{
Pair<String, String> key = Pair(regx, candidate);
if result_cache.containsKey(key){
return true;
}
boolean result = false;
// Empty regx will always return false
if (regx.isEmpty())
{
result = false;
}
else
{
if (regx.charAt(0) == '*')
{
// Last * matches everything
if (regx.length() == 1)
{
result = true;
}
else
{
result = matchStar(regx.substring(1), candidate);
}
}
// Return false if text is empty but pattern is not *
else if (candidate.isEmpty())
{
result = false;
}
else if (regx.charAt(0) == '.' || regx.charAt(0) == candidate.charAt(0))
{
// If the last regx matches the last text
if (regx.length() == 1 && candidate.length() == 1)
{
result = true;
}
// If hasn't reached the end, try to match the rest strings
else
{
result = match(regx.substring(1), candidate.substring(1));
}
}
else
{
result = false;
}
}
result_cache.put(key, result);
return result;
}
//Similarly for other methods.
-
\$\begingroup\$ Thanks for the answers. You really made me think when you mentioned
*********a
. What if we preprocess the pattern string to be*a
which is equivalent to*********a
? \$\endgroup\$Sicong– Sicong2012年04月14日 15:35:58 +00:00Commented Apr 14, 2012 at 15:35 -
\$\begingroup\$ This is a good first pass optimisation. But it won't help you in cases like
*.*.*.*.*.*a
,b
. The dynamic programming approach that I suggest above should deal with both in linear time. \$\endgroup\$cmh– cmh2012年04月14日 15:48:20 +00:00Commented Apr 14, 2012 at 15:48 -
\$\begingroup\$ @cmh can you explain why its 9!. \$\endgroup\$whokares– whokares2014年12月23日 11:06:43 +00:00Commented Dec 23, 2014 at 11:06
-
\$\begingroup\$ regex *********a and string b,looks like only 10 matches. \$\endgroup\$whokares– whokares2014年12月23日 11:13:01 +00:00Commented Dec 23, 2014 at 11:13
Can you translate the regular expression into ad-hoc Java code, compile it on the fly, and then execute it?
After it is translated and compiled, which should take less than a second, the execution will be a couple orders of magnitude faster than any data-driven parser.
Explore related questions
See similar questions with these tags.