I'd like to share a methodology I've been using when faced with a classification problem. This particular example is designed to classify time data as Date/Time/DateTime. Though I believe it could easily be adapted to many other problems
The basic idea is:
- Create a list of regular expression's (re) that match features in the string to be classified. The re's should be ordered in the list from highest -> lowest importance
- Create a
match_code
that is a list representing what re matched where in the string. Duplicate matches will be replaced with the match of highest importance - Create an
answer_table
that contains the possible classifications - Create a
response_table
that containsmatch_codes
, placed in the corresponding index of theanswer_table
- Check membership of the
match_code
in theresponse_table
and return the answer that corresponds to the match.
I'm aware machine learning solutions are more robust. This solution came about when I was attempting to create a data set to teach such a system. In the end I found I didn't actually need to go down that path.
Currently the creation of the response table is manual.
Quite new to python programming. And felt feedback would be useful. Review at leisure.
import re
import itertools
def time_classifier(string):
'''
classify the string according to features extracted by regular expressions
'''
_string = string
_string = str.lower(_string)
_string = re.sub(r'\s', ' ', _string).strip()
#Strings to remove
match_code = [[]] * 10
match_code[0] = re.finditer(r'\d+', _string)
match_code[1] = re.finditer(r'[/\-.|]', _string)
match_code[2] = re.finditer(r':', _string)
match_code[3] = re.finditer(r'am|pm', _string)
match_code[4] = re.finditer(r'jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?'
r'|may|jun(e)?|jul(y)?|aug(ust)?'
r'|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?', _string)
match_code[5] = re.finditer(r'mon(day)?|tue(sday)?|wed(nesday)?|thu(rsday)?'
r'|fri(day)?|sat(urday)?|sun(day)?', _string)
match_code[6] = re.finditer(r',', _string)
match_code[7] = re.finditer(r'today|tomorrow|yesterday', _string)
match_code[8] = re.finditer(r'aest', _string)
match_code[9] = re.finditer(r'[a-z]+', _string)
#Convert re items to an ordered list
match_code = construct_key(match_code)
#These are the possible classifications
answer_table = ['Date', 'Time', 'DateTime']
response_table = [[]] * len(answer_table)
#Each response_type element correlates to the elements in the answer_key
####################DATE####################
response_table[0] = [[3, 1, 0, 1, 0, 3],
[5, 0, 1, 0, 1, 0],
[0, 1, 0],
[5, 0, 1, 0],
[0, 1, 0, 1, 0, 1, 0],
[4, 0, 1, 0, 6, 0, 3],
[4, 0],
[3, 4, 0, 1, 0, 6, 0, 3],
[5, 4, 0, 0],
[5, 6, 0, 4, 1],
[0, 4, 0],
[5, 6, 0, 4, 0, 9],
[5, 1, 4, 0],
[7, 1, 4, 0],
[9, 0, 1, 5, 0, 1, 0, 1, 0],
[4, 6, 0, 0],
[0, 3]]
####################TIME####################
response_table[1] = [[0, 2, 0, 3],
[0, 2, 0],
[0, 2, 0, 2, 0, 0],
[0, 0, 2, 0],
[0, 2, 0, 1],
[0, 2, 0, 2, 0],
[0, 0, 2, 0, 3],
[9, 9, 0, 0, 2, 0],
[0, 2, 0, 2, 0, 8],
[0, 2, 0, 8],
[9, 0, 0, 2, 0, 3],
[9, 0, 0, 2, 0, 3]]
####################DATETIME####################
response_table[2] = [[5, 0, 4, 1, 0, 2, 0, 3],
[5, 0, 4, 0, 2, 0],
[5, 0, 4, 0, 0, 2, 0, 3],
[0, 2, 0, 3, 0, 4],
[0, 1, 0, 1, 0, 0, 2, 0],
[5, 0, 4, 1, 0, 2, 0],
[1, 5, 0, 4, 0, 2, 0],
[5, 6, 0, 4, 0],
[0, 1, 0, 1, 0, 0, 2, 0, 3, 0],
[0, 4, 0, 0, 2, 0, 3],
[5, 0, 4, 1, 0, 2, 0, 3, 9],
[0, 4, 6, 0, 2, 0, 8],
[4, 0, 6, 0, 6, 0, 1, 0, 3],
[4, 6, 0, 0, 6, 0, 2, 0],
[4, 6, 0, 9, 0, 0, 2, 0],
[4, 6, 0, 9, 0, 0, 2, 0, 3]]
result = calculate_classification(match_code, response_table, answer_table, '%s' % string)
if result:
return result
else:
failed_to_classify_output('Date/Time/DateTime',string, _string, match_code)
return None
def list_to_string(arg):
'''
Same as str(arg) but removes square brackets '[' & ']'
'''
return re.sub(r'^\[|\]$', '', str(arg))
def find_matches(response_table, match_code, response_key=None):
'''
Compare the match code against items in the response table
'''
if not response_key:
response_key = lambda x: str(x)
result = [(index, len(ii)) for index, i in enumerate(response_table)
for ii in i if response_key(ii) in str(match_code)]
#longer matches are considered 'better'
result = sorted(result, key=lambda x: x[1], reverse=True)
return result
def construct_key(key):
'''
Unpack the match iterators and remove duplicate matches
'''
#Unpack generators
_key = [(index, y.start()) for index, i in enumerate(key) for y in i]
_key = sorted(_key, key=lambda x: x[1])
# Matches removes duplicate matches, the first match gets priority over the rest
_key = [next(group) for i, group in itertools.groupby(_key, key=lambda x: x[1])]
_key = [i[0] for i in _key]
return _key
def calculate_classification(match_code, response_table, answer_table, warning_output='None'):
'''
Find which item in response_key is the best fit for the given key.
Return the corresponding value in answer key.
If no match is found print error msg and string that couldnt be classified
'''
#Check for an exact match
answer_index = find_matches(response_table, match_code)
if not answer_index: #If no match exists, find the best fit
answer_index = find_matches(response_table, list_to_string(match_code), response_key= list_to_string)
print('Warning: Incomplete match on classifying: "{} MATCH CODE: {} "'.format(str(warning_output), match_code))
#Use the index to look up the answer
if answer_index:
answer_index = next(iter(answer_index))[0]
return answer_table[answer_index]
else:
return None
def failed_to_classify_output(classification_type, raw_string, formatted_string, key):
'''
Print the failed classification for review
'''
def align(text): #Column width
return ' ' * (30 - len(text))
row = [[]] *4
#Column 1
row[0] = 'FAILED TO CLASSIFY:{}\n\n'.format(classification_type)
row[1] = 'Raw string:'
row[2] = 'Formatted string:'
row[3] = 'key:'
#Add column 2 to column 1
row[1] = '{}{}{}\n'.format(row[1], align(row[1]), raw_string)
row[2] = '{}{}{}\n'.format(row[2], align(row[2]), formatted_string)
row[3] = '{}{}{}\n'.format(row[3], align(row[3]), key)
output = ''.join((i for y in row for i in y))
print(output + '{}'.format('-'*len(max(row, key= lambda x: len(x)))))
if __name__ == '__main__':
while True:
string = input("Please enter a time to classify:\n>?")
print(time_classifier(string))
1 Answer 1
Some things I'd do:
- Rename
_string
to something likenormalised_string
for clarity. Don't overwrite the normalised string repeatedly. This case is a bit of a grey area, because you're effectively just splitting up a chain, but joining the chain back up again would be nicer:
normalised_string = re.sub(r'\s', ' ', string.lower()).strip()
On second thought, avoiding the normalisation would be even better. Instead use character classes and case insensitive modifiers on the regular expressions. This makes the regular expressions slightly more complex, but can avoid subtle bugs because you're not applying assumptions to the input other than those in the regular expressions.
- Don't declare the blank list; there's no need for it in Python.
- Use
list.append()
to add items to a list. - Throw an exception in
calculate_classification
if it fails, rather than returning a falsyNone
value. readline()
from standard input to get the strings to classify. This makes your script scriptable, as in, it can be used from other scripts without having to handle a prompt/feedback loop.- In general there are too many magic values. Pull out properly named constants, methods or classes to clarify things.
The
response_table
should contain references to the actualmatch_code
entries rather than simply the indexes. For example:response_table.append([match_code[3], etc.])
Using lots of numeric references makes the code really hard to follow.
- It looks like
match_code
entries are matched to the input string sequentially using theresponse_table
sequences. A more obvious way of doing this would be to construct more complete regular expressions and match the whole input string at once, such as this answer for ISO dates.
-
\$\begingroup\$ I've been applying those suggestions, greatly appreciated. Raising the Exception was a great suggestion! Though a simple enough thing to do, I now know how to create custom exceptions. and I don't get
None
's floating around. As for the ISO regex option. I'm attempting to normalize scraped data, that's often not in an ISO format. \$\endgroup\$James Schinner– James Schinner2017年07月25日 06:36:47 +00:00Commented Jul 25, 2017 at 6:36 -
\$\begingroup\$ I didn't mean that you should use an ISO regex exclusively, but rather that you should build regexes for the complete date formats you are expecting rather than combining partial regexes. \$\endgroup\$l0b0– l0b02017年07月25日 06:43:04 +00:00Commented Jul 25, 2017 at 6:43
-
\$\begingroup\$ Arh, I see. I guess my solution was constrained by my limited regex ability. What I liked about my solution was that if it fails, I get a printed out error message with a
match_code
I can just copy and paste into the correct classification. If I get better at regex, that solution will become available to be me. \$\endgroup\$James Schinner– James Schinner2017年07月25日 06:52:09 +00:00Commented Jul 25, 2017 at 6:52