I want to parse a date string with the following general format: Weekday, DD-Mon-YYYY HH:MM:SS. The calendar should also accept:
- a date without the weekday
- spaces instead of dashes
- case-insensitive month (e.g., allow "Jan", "JAN" and "jAn")
- two-digit year
- a missing timezone
- allow multiple spaces wherever a single space is allowed.
Return null
when:
Input isn't a valid date string, meaning only checks that don't require connecting different. Example: the weekday string "XXX" is invalid but "Fri, 09-Jun-2015" is considered valid even though it was a Tuesday
My code:
public static Calendar parseDate(String input) {
List<String> months = Arrays.asList("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec");
Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("GMT"));
Pattern pattern = Pattern.compile ("(?:[A-Z][a-z][a-z],\\s+)?([0-2][0-9]|[3][0-1])(?:\\s+|-)([a-zA-z]{3})(?:\\s+|-)([0-9]{2,4})(?:\\s+)([0-1][0-9]|[2][0-3]):([0-5][0-9]):([0-9]{2})(?:\\s+GMT|$)");
Matcher matcher = pattern.matcher (input);
Calendar calendar = Calendar.getInstance (TimeZone.getTimeZone ("GTM"));
if ( !matcher.find () ){
return null;
}
int dayOfMonth = Integer.parseInt (matcher.group (1)); //since the first group is the time zone
int month = months.indexOf (months.indexOf(matcher.group(2).toLowerCase()));
int year = Integer.parseInt (matcher.group (3));
if (year >= 0 && year <= 69){
year += 2000;
}
if ( year >= 70 && year <= 99 ) {
year += 1900;
}
int hours = Integer.parseInt(matcher.group(4));
int minutes = Integer.parseInt(matcher.group(5));
int seconds = Integer.parseInt(matcher.group(6));
cal.set(year, month, dayOfMonth, hours, minutes, seconds);
return calendar;
}
My question is how to improve "(?:[A-Z][a-z][a-z],\\s+)?([0-2][0-9]|[3][0-1])(?:\\s+|-)([a-zA-z]{3})(?:\\s+|-)([0-9]{2,4})(?:\\s+)([0-1][0-9]|[2][0-3]):([0-5][0-9]):([0-9]{2})(?:\\s+GMT|$)"
?
EDIT: I can only import regex matcher and pattern and ArrayList, and the only additional methods allowed are: months.indexOf, Calendar.set, Integer.parseInt
and String.toLowerCase
Tester:
public void testParseDate() {
DateFormat df = DateFormat.getDateTimeInstance();
df.setTimeZone(TimeZone.getTimeZone("GMT"));
ArrayList<String> inputs = new ArrayList<String>();
ArrayList<Long> expect = new ArrayList<Long>();
Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("GMT"));
cal.set(Calendar.MILLISECOND, 0);
cal.set(2012,4,23,1,23,31);
long time1 = cal.getTimeInMillis();
inputs.add("Wed, 23-May-12 01:23:31 GMT"); expect.add(time1);
inputs.add("23-May-12 01:23:31 GMT"); expect.add(time1);
inputs.add("23-May-12 01:23:31"); expect.add(time1);
inputs.add("23 May 2012 01:23:31 GMT"); expect.add(time1);
inputs.add("23 May 2012 01:23:31 GMT"); expect.add(time1);
inputs.add("Wed, 23 May 2012 01:23:31 GMT"); expect.add(time1);
inputs.add("23 mAy 2012 01:23:31 GMT"); expect.add(time1);
inputs.add("23 maY 12 01:23:31"); expect.add(time1);
inputs.add("23 jan 12 01:23:31"); cal.set(2012,0,23,1,23,31); expect.add(cal.getTimeInMillis());
inputs.add("01 AUG 12 15:23:31 GMT"); cal.set(2012,7,1,15,23,31); expect.add(cal.getTimeInMillis());
inputs.add("01 bla 12 15:23:31 GMT"); expect.add(null);
inputs.add("1 May 2012 15:23:31 GMT"); expect.add(null);
inputs.add("01 May 2012 15:23:31 BLA"); expect.add(null);
inputs.add("01 May 2012-15:23:31"); expect.add(null);
inputs.add("01 May 2012-15/23:31"); expect.add(null);
inputs.add("01 May 2012-15/23:31"); expect.add(null);
for (int i = 0; i < inputs.size(); ++i) {
Long expectTime = expect.get(i);
Calendar output = RegexpPractice.parseDate(inputs.get(i));
if (expectTime == null) {
assertNull(String.format("Test %d failed: Parsing <<%s>> (should be null)", i, inputs.get(i)), output);
continue;
} else {
assertNotNull(String.format("Test %d failed: Parsing <<%s>> (unexpected null)", i, inputs.get(i)), output);
}
output.set(Calendar.MILLISECOND, 0);
long outTime = output.getTimeInMillis();
assertEquals(String.format("Test %d failed: Parsing <<%s>> (was %s not %s)",
i, inputs.get(i), df.format(outTime), df.format(expectTime)),
(long) expectTime, outTime);
}
}
```
1 Answer 1
You could utilize Named Capturing Groups that would clarify the meaning of capturing group in the regex, as well as the method calls in the matcher
. furthermore, with Named Capturing Groups, you don't care about the order of the groups and don't need to "skip" captured groups that you do not want to read (like the time zone)
you could also break the regex into separate Strings for the capturing groups to further enhance readability:
String timeZoneRegex = "(?<timeZone>[A-Z][a-z][a-z],\\s+)";
String dayOfMonthRegex = "(?<dayOfMonth>[0-2][0-9]|[3][0-1])";
...
Pattern pattern = Pattern.compile(timeZoneRegex + "?" + dayOfMonthRegex + ...
instead of matcher.group(1)
--> matcher.group("dayOfMonth");
TimeZone.getTimeZone ("GTM")
? Surely, you mean"GMT"
. \$\endgroup\$