Classification of Date/Time/DateTime

Question 1

I'd like to share a methodology I've been using when faced with a classification problem. This particular example is designed to classify time data as Date/Time/DateTime. Though I believe it could easily be adapted to many other problems

The basic idea is:

Create a list of regular expression's (re) that match features in the string to be classified. The re's should be ordered in the list from highest -> lowest importance
Create a match_code that is a list representing what re matched where in the string. Duplicate matches will be replaced with the match of highest importance
Create an answer_table that contains the possible classifications
Create a response_table that contains match_codes, placed in the corresponding index of the answer_table
Check membership of the match_code in the response_table and return the answer that corresponds to the match.

I'm aware machine learning solutions are more robust. This solution came about when I was attempting to create a data set to teach such a system. In the end I found I didn't actually need to go down that path.

Currently the creation of the response table is manual.

Quite new to python programming. And felt feedback would be useful. Review at leisure.

import re
import itertools
def time_classifier(string):
 '''
 classify the string according to features extracted by regular expressions
 '''
 _string = string
 _string = str.lower(_string)
 _string = re.sub(r'\s', ' ', _string).strip()
 #Strings to remove
 match_code = [[]] * 10
 match_code[0] = re.finditer(r'\d+', _string)
 match_code[1] = re.finditer(r'[/\-.|]', _string)
 match_code[2] = re.finditer(r':', _string)
 match_code[3] = re.finditer(r'am|pm', _string)
 match_code[4] = re.finditer(r'jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?'
 r'|may|jun(e)?|jul(y)?|aug(ust)?'
 r'|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?', _string)
 match_code[5] = re.finditer(r'mon(day)?|tue(sday)?|wed(nesday)?|thu(rsday)?'
 r'|fri(day)?|sat(urday)?|sun(day)?', _string)
 match_code[6] = re.finditer(r',', _string)
 match_code[7] = re.finditer(r'today|tomorrow|yesterday', _string)
 match_code[8] = re.finditer(r'aest', _string)
 match_code[9] = re.finditer(r'[a-z]+', _string)
 #Convert re items to an ordered list
 match_code = construct_key(match_code)
 #These are the possible classifications
 answer_table = ['Date', 'Time', 'DateTime']
 response_table = [[]] * len(answer_table)
 #Each response_type element correlates to the elements in the answer_key
 ####################DATE####################
 response_table[0] = [[3, 1, 0, 1, 0, 3],
 [5, 0, 1, 0, 1, 0],
 [0, 1, 0],
 [5, 0, 1, 0],
 [0, 1, 0, 1, 0, 1, 0],
 [4, 0, 1, 0, 6, 0, 3],
 [4, 0],
 [3, 4, 0, 1, 0, 6, 0, 3],
 [5, 4, 0, 0],
 [5, 6, 0, 4, 1],
 [0, 4, 0],
 [5, 6, 0, 4, 0, 9],
 [5, 1, 4, 0],
 [7, 1, 4, 0],
 [9, 0, 1, 5, 0, 1, 0, 1, 0],
 [4, 6, 0, 0],
 [0, 3]]
 ####################TIME####################
 response_table[1] = [[0, 2, 0, 3],
 [0, 2, 0],
 [0, 2, 0, 2, 0, 0],
 [0, 0, 2, 0],
 [0, 2, 0, 1],
 [0, 2, 0, 2, 0],
 [0, 0, 2, 0, 3],
 [9, 9, 0, 0, 2, 0],
 [0, 2, 0, 2, 0, 8],
 [0, 2, 0, 8],
 [9, 0, 0, 2, 0, 3],
 [9, 0, 0, 2, 0, 3]]
 ####################DATETIME####################
 response_table[2] = [[5, 0, 4, 1, 0, 2, 0, 3],
 [5, 0, 4, 0, 2, 0],
 [5, 0, 4, 0, 0, 2, 0, 3],
 [0, 2, 0, 3, 0, 4],
 [0, 1, 0, 1, 0, 0, 2, 0],
 [5, 0, 4, 1, 0, 2, 0],
 [1, 5, 0, 4, 0, 2, 0],
 [5, 6, 0, 4, 0],
 [0, 1, 0, 1, 0, 0, 2, 0, 3, 0],
 [0, 4, 0, 0, 2, 0, 3],
 [5, 0, 4, 1, 0, 2, 0, 3, 9],
 [0, 4, 6, 0, 2, 0, 8],
 [4, 0, 6, 0, 6, 0, 1, 0, 3],
 [4, 6, 0, 0, 6, 0, 2, 0],
 [4, 6, 0, 9, 0, 0, 2, 0],
 [4, 6, 0, 9, 0, 0, 2, 0, 3]]
 result = calculate_classification(match_code, response_table, answer_table, '%s' % string)
 if result:
 return result
 else:
 failed_to_classify_output('Date/Time/DateTime',string, _string, match_code)
 return None
def list_to_string(arg):
 '''
 Same as str(arg) but removes square brackets '[' & ']'
 '''
 return re.sub(r'^\[|\]$', '', str(arg))
def find_matches(response_table, match_code, response_key=None):
 '''
 Compare the match code against items in the response table
 '''
 if not response_key:
 response_key = lambda x: str(x)
 result = [(index, len(ii)) for index, i in enumerate(response_table)
 for ii in i if response_key(ii) in str(match_code)]
 #longer matches are considered 'better'
 result = sorted(result, key=lambda x: x[1], reverse=True)
 return result
def construct_key(key):
 '''
 Unpack the match iterators and remove duplicate matches
 '''
 #Unpack generators
 _key = [(index, y.start()) for index, i in enumerate(key) for y in i]
 _key = sorted(_key, key=lambda x: x[1])
 # Matches removes duplicate matches, the first match gets priority over the rest
 _key = [next(group) for i, group in itertools.groupby(_key, key=lambda x: x[1])]
 _key = [i[0] for i in _key]
 return _key
def calculate_classification(match_code, response_table, answer_table, warning_output='None'):
 '''
 Find which item in response_key is the best fit for the given key.
 Return the corresponding value in answer key.
 If no match is found print error msg and string that couldnt be classified
 '''
 #Check for an exact match
 answer_index = find_matches(response_table, match_code)
 if not answer_index: #If no match exists, find the best fit
 answer_index = find_matches(response_table, list_to_string(match_code), response_key= list_to_string)
 print('Warning: Incomplete match on classifying: "{} MATCH CODE: {} "'.format(str(warning_output), match_code))
 #Use the index to look up the answer
 if answer_index:
 answer_index = next(iter(answer_index))[0]
 return answer_table[answer_index]
 else:
 return None
def failed_to_classify_output(classification_type, raw_string, formatted_string, key):
 '''
 Print the failed classification for review
 '''
 def align(text): #Column width
 return ' ' * (30 - len(text))
 row = [[]] *4
 #Column 1
 row[0] = 'FAILED TO CLASSIFY:{}\n\n'.format(classification_type)
 row[1] = 'Raw string:'
 row[2] = 'Formatted string:'
 row[3] = 'key:'
 #Add column 2 to column 1
 row[1] = '{}{}{}\n'.format(row[1], align(row[1]), raw_string)
 row[2] = '{}{}{}\n'.format(row[2], align(row[2]), formatted_string)
 row[3] = '{}{}{}\n'.format(row[3], align(row[3]), key)
 output = ''.join((i for y in row for i in y))
 print(output + '{}'.format('-'*len(max(row, key= lambda x: len(x)))))
if __name__ == '__main__':
 while True:
 string = input("Please enter a time to classify:\n>?")
 print(time_classifier(string))

Question 2

Some things I'd do:

Rename _string to something like normalised_string for clarity.
Don't overwrite the normalised string repeatedly. This case is a bit of a grey area, because you're effectively just splitting up a chain, but joining the chain back up again would be nicer:
```
normalised_string = re.sub(r'\s', ' ', string.lower()).strip()
```
On second thought, avoiding the normalisation would be even better. Instead use character classes and case insensitive modifiers on the regular expressions. This makes the regular expressions slightly more complex, but can avoid subtle bugs because you're not applying assumptions to the input other than those in the regular expressions.
Don't declare the blank list; there's no need for it in Python.
Use list.append() to add items to a list.
Throw an exception in calculate_classification if it fails, rather than returning a falsy None value.
readline() from standard input to get the strings to classify. This makes your script scriptable, as in, it can be used from other scripts without having to handle a prompt/feedback loop.
In general there are too many magic values. Pull out properly named constants, methods or classes to clarify things.
The response_table should contain references to the actual match_code entries rather than simply the indexes. For example:
```
response_table.append([match_code[3], etc.])
```
Using lots of numeric references makes the code really hard to follow.
It looks like match_code entries are matched to the input string sequentially using the response_table sequences. A more obvious way of doing this would be to construct more complete regular expressions and match the whole input string at once, such as this answer for ISO dates.

Question 3

I've been applying those suggestions, greatly appreciated. Raising the Exception was a great suggestion! Though a simple enough thing to do, I now know how to create custom exceptions. and I don't get None's floating around. As for the ISO regex option. I'm attempting to normalize scraped data, that's often not in an ISO format.

Question 4

I didn't mean that you should use an ISO regex exclusively, but rather that you should build regexes for the complete date formats you are expecting rather than combining partial regexes.

Question 5

Arh, I see. I guess my solution was constrained by my limited regex ability. What I liked about my solution was that if it fails, I get a printed out error message with a match_code I can just copy and paste into the correct classification. If I get better at regex, that solution will become available to be me.

l0b0 l0b0 9,11722 silver badges36 bronze badges · Accepted Answer · 2017-07-22 08:45:23Z

Some things I'd do:

Rename _string to something like normalised_string for clarity.
Don't overwrite the normalised string repeatedly. This case is a bit of a grey area, because you're effectively just splitting up a chain, but joining the chain back up again would be nicer:
```
normalised_string = re.sub(r'\s', ' ', string.lower()).strip()
```
On second thought, avoiding the normalisation would be even better. Instead use character classes and case insensitive modifiers on the regular expressions. This makes the regular expressions slightly more complex, but can avoid subtle bugs because you're not applying assumptions to the input other than those in the regular expressions.
Don't declare the blank list; there's no need for it in Python.
Use list.append() to add items to a list.
Throw an exception in calculate_classification if it fails, rather than returning a falsy None value.
readline() from standard input to get the strings to classify. This makes your script scriptable, as in, it can be used from other scripts without having to handle a prompt/feedback loop.
In general there are too many magic values. Pull out properly named constants, methods or classes to clarify things.
The response_table should contain references to the actual match_code entries rather than simply the indexes. For example:
```
response_table.append([match_code[3], etc.])
```
Using lots of numeric references makes the code really hard to follow.
It looks like match_code entries are matched to the input string sequentially using the response_table sequences. A more obvious way of doing this would be to construct more complete regular expressions and match the whole input string at once, such as this answer for ISO dates.

I've been applying those suggestions, greatly appreciated. Raising the Exception was a great suggestion! Though a simple enough thing to do, I now know how to create custom exceptions. and I don't get None's floating around. As for the ISO regex option. I'm attempting to normalize scraped data, that's often not in an ISO format.
I didn't mean that you should use an ISO regex exclusively, but rather that you should build regexes for the complete date formats you are expecting rather than combining partial regexes.
Arh, I see. I guess my solution was constrained by my limited regex ability. What I liked about my solution was that if it fails, I get a printed out error message with a match_code I can just copy and paste into the correct classification. If I get better at regex, that solution will become available to be me.

Stack Exchange Network

Classification of Date/Time/DateTime

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Classification of Date/Time/DateTime

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions