Simplified regular expression engine

Question 1

Task:

On an alphabet set [a-z], a simplified regular expression is much simpler than the normal regular expression.

It has only two meta characters: '.' and '*'.

'.' -- exact one arbitrary character match.

'*' -- zero or more arbitrary character match.

Note that it is different from the normal regular expression in that '' is on its own and does NOT act as a decorator on the leading character, I.e. in our simplified regular expression. "" is equivalent to ".*" in regular expression. Write a parser which takes a pattern string and a candidate string as input and decides to either accept or reject the candidate string based on the pattern string.

Only the full match is acceptable. The parser rejects partial matches.

You can safely assume all the characters in both input pattern string and candidate string are valid characters, i.e. you don't need to validate the input strings.

Your parser need to at least pass all the following test cases.

You can write the parser in any programing language you like.

Write up an analysis on the run-time complexity of your code.

Example:
| pattern string | candidate string | result |
| abc | abc | accept |
| * | abc | accept |
| *abc | abc | accept |
| *abc | aaabbbabc | accept |
| a*bc | aaabbbabc | accept |
| a*bc | abc | accept |
| a* | abc | accept |
| a* | a | accept |
| a* | aa | accept |
| a* | abcdef | accept |
| *abc* | abc | accept |
| ***** | abc | accept |
| ... | abc | accept |
| .* | abc | accept |
| .bc* | abc | accept |
| .b*c*a | abca | accept |
| * | /EMPTY STRING/ | accept |
| abc | abcd | reject |
| *a | abcd | reject |
| a | /EMPTY STRING/ | reject |
| .a*c | abc | reject |
| a.*b | abc | reject |
| .. | abc | reject |
| /EMPTY STRING/ | /EMPTY STRING/ | reject |
| /EMPTY STRING/ | abc | reject |

My Solution:

public class RegxEngine
{
 public static boolean match(String regx, String candidate)
 {
 // Empty regx will always return false
 if (regx.isEmpty())
 {
 return false;
 }
 else
 {
 if (regx.charAt(0) == '*')
 {
 // Last * matches everything
 if (regx.length() == 1)
 {
 return true;
 }
 else
 {
 return matchStar(regx.substring(1), candidate);
 }
 }
 // Return false if text is empty but pattern is not *
 else if (candidate.isEmpty())
 {
 return false;
 }
 else if (regx.charAt(0) == '.' || regx.charAt(0) == candidate.charAt(0))
 {
 // If the last regx matches the last text
 if (regx.length() == 1 && candidate.length() == 1)
 {
 return true;
 }
 // If hasn't reached the end, try to match the rest strings
 else
 {
 return match(regx.substring(1), candidate.substring(1));
 }
 }
 else
 {
 return false; 
 }
 }
 }
 // Otherwise skip as many chars as required
 private static boolean matchStar(String regx, String candidate)
 {
 for (int i = 0; i < candidate.length(); i++)
 {
 if (match(regx, candidate.substring(i)))
 {
 return true;
 }
 else
 {
 continue;
 }
 }
 
 return false;
 }
}

This was an interview question and I did get through to the next round. I am asking simply looking for more ideas.

My Questions:

Is my code \$O(m*n)\$, where \$m\$ is the length of the candidate string and \$n\$ is the length of the pattern string?
In terms of performance/time complexity, could this be done better?

Question 2

In answer to 1) - No. Consider the regex *********a and the string b. Your code will have to consider all 9! (362880) cases before returning false.

That being said, this solution seems to be going in the right direction, and for this restricted language we can stop short of actually building the DFA equivalent to the expression. (Which is expensive in space to create, but will be cheaper each time it is run against a candidate string).

We acheive this by caching the result of each regex-substring for future use (the dynamic programming approach). This prevents the problematic recursive explosion in the original problem and will guarantee an O(n*m) runtime, but could cost up to O(n*m) space for each regex candidate pair.

static Map<Pair<String, String>, boolean> result_cache;
//Assume we have a pair, as in:
//http://stackoverflow.com/a/677248/1331457
public static boolean match(String regx, String candidate)
{
 Pair<String, String> key = Pair(regx, candidate);
 if result_cache.containsKey(key){
 return true;
 }
 boolean result = false;
 // Empty regx will always return false
 if (regx.isEmpty())
 {
 result = false;
 }
 else
 {
 if (regx.charAt(0) == '*')
 {
 // Last * matches everything
 if (regx.length() == 1)
 {
 result = true;
 }
 else
 {
 result = matchStar(regx.substring(1), candidate);
 }
 }
 // Return false if text is empty but pattern is not *
 else if (candidate.isEmpty())
 {
 result = false;
 }
 else if (regx.charAt(0) == '.' || regx.charAt(0) == candidate.charAt(0))
 {
 // If the last regx matches the last text
 if (regx.length() == 1 && candidate.length() == 1)
 {
 result = true;
 }
 // If hasn't reached the end, try to match the rest strings
 else
 {
 result = match(regx.substring(1), candidate.substring(1));
 }
 }
 else
 {
 result = false; 
 }
 }
 result_cache.put(key, result);
 return result;
}
//Similarly for other methods.

Question 3

Thanks for the answers. You really made me think when you mentioned *********a. What if we preprocess the pattern string to be *a which is equivalent to *********a?

Question 4

This is a good first pass optimisation. But it won't help you in cases like *.*.*.*.*.*a, b. The dynamic programming approach that I suggest above should deal with both in linear time.

Question 5

@cmh can you explain why its 9!.

Question 6

regex *********a and string b,looks like only 10 matches.

Question 7

Can you translate the regular expression into ad-hoc Java code, compile it on the fly, and then execute it?

After it is translated and compiled, which should take less than a second, the execution will be a couple orders of magnitude faster than any data-driven parser.

cmh cmh 4953 silver badges9 bronze badges · Accepted Answer · 2012-04-14 00:13:19Z

In answer to 1) - No. Consider the regex *********a and the string b. Your code will have to consider all 9! (362880) cases before returning false.

That being said, this solution seems to be going in the right direction, and for this restricted language we can stop short of actually building the DFA equivalent to the expression. (Which is expensive in space to create, but will be cheaper each time it is run against a candidate string).

We acheive this by caching the result of each regex-substring for future use (the dynamic programming approach). This prevents the problematic recursive explosion in the original problem and will guarantee an O(n*m) runtime, but could cost up to O(n*m) space for each regex candidate pair.

static Map<Pair<String, String>, boolean> result_cache;
//Assume we have a pair, as in:
//http://stackoverflow.com/a/677248/1331457
public static boolean match(String regx, String candidate)
{
 Pair<String, String> key = Pair(regx, candidate);
 if result_cache.containsKey(key){
 return true;
 }
 boolean result = false;
 // Empty regx will always return false
 if (regx.isEmpty())
 {
 result = false;
 }
 else
 {
 if (regx.charAt(0) == '*')
 {
 // Last * matches everything
 if (regx.length() == 1)
 {
 result = true;
 }
 else
 {
 result = matchStar(regx.substring(1), candidate);
 }
 }
 // Return false if text is empty but pattern is not *
 else if (candidate.isEmpty())
 {
 result = false;
 }
 else if (regx.charAt(0) == '.' || regx.charAt(0) == candidate.charAt(0))
 {
 // If the last regx matches the last text
 if (regx.length() == 1 && candidate.length() == 1)
 {
 result = true;
 }
 // If hasn't reached the end, try to match the rest strings
 else
 {
 result = match(regx.substring(1), candidate.substring(1));
 }
 }
 else
 {
 result = false; 
 }
 }
 result_cache.put(key, result);
 return result;
}
//Similarly for other methods.

Thanks for the answers. You really made me think when you mentioned *********a. What if we preprocess the pattern string to be *a which is equivalent to *********a?
This is a good first pass optimisation. But it won't help you in cases like *.*.*.*.*.*a, b. The dynamic programming approach that I suggest above should deal with both in linear time.

Stack Exchange Network

Simplified regular expression engine

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Simplified regular expression engine

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions