The whole question is Simplified regular expression engine. I have solved the question, and in turn felt the need to get it reviewed. Any suggestions for clean up and optimization would help. Also please verify my complexity: Complexity: O (n), where n is the length of the longer or regex or string.
public final class Regex {
private static final char STAR = '*';
private static final char DOT = '.';
private Regex() {};
/**
* The index of first non-star character following first occurence of star
* eg: if regex is a**b, then index is startIndex is 1 and return value is 3 (index of b)
*
* @param regex : the regex string.
* @param starIndex : index of the first occurene of * after a non *, eg: a**b, then starIndex would be 1
* @return : the index of first non-star character following first occurence of star
*/
private static int getNonStarIndex(String regex, int starIndex) {
assert regex != null;
assert starIndex >= 0 && starIndex < regex.length();
for (int k = starIndex + 1; k < regex.length(); k++) {
if (regex.charAt(k) != '*') {
return k;
}
}
return regex.length();
}
/**
* A * means 0 or more number of any char matches
* A . means just a single any char match.
*
* @param regex : The regex to match the string again
* @param str : The string to be matched.
* @return : true / false
*/
public static boolean match(String regex, String str) {
if (regex == null) throw new NullPointerException("The regex cannot be null");
if (str == null) throw new NullPointerException("The str cannot be null");
int i = 0;
int j = 0;
while ((i < regex.length() && j < str.length())) {
if (regex.charAt(i) == str.charAt(j) || regex.charAt(i) == DOT) {
i++; j++;
} else if (regex.charAt(i) == STAR) {
int nextNonStarIndex = getNonStarIndex(regex, i);
/*
* For a case where regex and string, both have not yet reached the last char but regex is all *'s till end.
* Eg: Regext abc*** && String is abcxyz
*/
if (nextNonStarIndex == regex.length()) return true;
/*
* regex: a*.*c, abc
*/
if (regex.charAt(nextNonStarIndex) == DOT) {i = nextNonStarIndex + 1; j++; continue;}
boolean match = false;
for (int k = j; k < str.length(); k++) {
if (regex.charAt(nextNonStarIndex) == str.charAt(k)) {
// now move to the next char, after a match succeeds with first char after adjacent stars.
i = nextNonStarIndex + 1;
j = k + 1;
match = true;
break;
}
}
if (!match) {
return false;
}
} else {
return false;
}
}
/*
* String has reached the end but not the regex.
* regx = "abc**" and str = "abc";
*/
if (i != regex.length()) {
for (int x = i; x < regex.length() ; x++) {
if (regex.charAt(x) != STAR) return false;
}
}
if (j != str.length()) {
return false;
}
return true;
}
public static void main(String[] args) {
String[] regexList = { "abc****", "abc", "a*b*c", "****abc", "**a**b**c****", "abc*"};
String str1 = "abc";
for (String regex : regexList) {
System.out.println(Regex.match(regex, str1));
}
String regex = "abc****";
String[] strList1 = {"abcxyz", "abcx", "abc"};
for (String str : strList1) {
System.out.println(Regex.match(regex, str));
}
regex = "***abc";
String[] strList2 = {"xyzabc", "xabc", "abc"};
for (String str : strList2) {
System.out.println(Regex.match(regex, str));
}
System.out.println(Regex.match("a.c", "abc"));
System.out.println(Regex.match("a*.*c", "abc"));
}
}
1 Answer 1
Style
- Use 'final' keyword where you can. It enables the JIT compiler to make better choices when optimizing, and it also ensures that bugs don't creep in to your code. For example,
private static int getNonStarIndex(String regex, int starIndex) {...}
should beprivate static final int getNonStarIndex(final String regex, final int starIndex) {...}
andpublic static boolean match(String regex, String str) {...}
should bepublic static final boolean match(final String regex, final String str) {...}
- Use better names for non-for-loop variables.
i
is OK forfor (int i = 0; i < ...; i++)
because this is a clear 'standard', but usingi
andj
are poor choices for variables that are manipulated inside the loops.rexpos
andstrpos
are examples of names I would use. - I always put '1-line-blocks' on their own line, but I accept that
if (regex == null) throw new NullPointerException("The regex cannot be null");
type lines are OK. But, the following 1-liner is never OK:if (regex.charAt(nextNonStarIndex) == DOT) {i = nextNonStarIndex + 1; j++; continue;}
Functional
More detailed assessment.... I added the following tests to your code:
System.out.println("Should be false from here");
System.out.println(Regex.match("abc", "abcd"));
System.out.println(Regex.match("*a", "abcd"));
System.out.println(Regex.match("a", ""));
System.out.println(Regex.match(".a*c", "abc"));
System.out.println(Regex.match("a.*b", "abc"));
System.out.println(Regex.match("..", "abc"));
System.out.println(Regex.match("", ""));
System.out.println(Regex.match("", "abc"));
and discovered that your code fails for Regex.match("", "")
. This is just bad form. The 'assignment' gave a whole bunch of positive and negative test cases, but you only chose to test the positive ones. That counts as 'half-a-job' when testing is concerned. It took me a few minutes to copy/paste the tests in and found what should be a 'fail-the-assignment' condition.
Then I looked at your algorithm and figured it was 'failing' early tests too early... let me try to explain.... consider the following:
Regex.match("a*.b.*c", "axxbxxbxxc");
The above is not one of the official test cases, but, it should return true
, but your code returns false
.
This is because you look for only the most 'obvious' match...
Recommendations
The algorithm you have chosen is too simplistic, having only 1 loop inside the 'STAR' handling loop.
I re-worked your code to use recursion instead, and it handles things much better. Without giving the whole algorithm away, consider a recursive function (copy/paste of your code, renamed match
to be matchRecursive
):
public static boolean matchRecursive(final String regex, int regexpos,
final String str, int strpos) {
while ((regexpos < regex.length() && strpos < str.length())) {
if (regex.charAt(regexpos) == str.charAt(strpos) || regex.charAt(regexpos) == DOT) {
regexpos++;
strpos++;
} else if (regex.charAt(regexpos) == STAR) {
regexpos++;
// see if there's anything remaining in the regex.
// if not, return true (star matches rest of str).
// otherwise, recursively check if the rest of the regex
// matches anywhere in the rest of the str....
while (strpos < str.length()) {
if (matchRecursive(regex, regexpos, str, strpos) {
return true;
}
strpos++;
}
} else {
return false;
}
}
// tidy up odd use-cases like:
// regexes ending in * when there's nothing left in str.
// empty regex
// etc.
return ....;
}
-
\$\begingroup\$ Firstly a huge thanks, improved code and re-posted. Secondly "".matches(""); returns true in java, so keeping implementation aligned with it. \$\endgroup\$JavaDeveloper– JavaDeveloper2013年11月18日 03:04:39 +00:00Commented Nov 18, 2013 at 3:04
-
\$\begingroup\$ This is a great answer except for the first suggestion. The
final
modifier has three distinct roles: On classes, it makes them uninheritable, on methods, it makes them non-overridable, and on variables, it makes them assign-once. Disallowing a subclass to override a method is more often than not a stupid idea (compare Open-Closed Principle), whereas single-assignment is a useful technique to reduce bugs. Note that there are few cases where afinal
modifier on a variable would actually affect performance. \$\endgroup\$amon– amon2013年11月18日 07:42:48 +00:00Commented Nov 18, 2013 at 7:42 -
1\$\begingroup\$ I think you underestimate the power of
final
. Apart from the logic-constraints you describe above, it has proven to be a defensive bug-trapping tool, performance enhancing, and intent-revealing mechanism. This is in my experience at least.final
is like a door. It is better to close your doors even when you live in a safe area because an open door is an invitation to come in. Regardless, this is an opinion thing, and will never be resolved. By the way, I prefer vi over emacs, eclipse over intellij, and linux over windows or mac... . ;-) \$\endgroup\$rolfl– rolfl2013年11月18日 14:45:11 +00:00Commented Nov 18, 2013 at 14:45
Explore related questions
See similar questions with these tags.
System.out.println(Regex.match("", ""));
\$\endgroup\$