1
\$\begingroup\$

This is an updated version of my previous post found here. Mainly as a courtesy for anyone interested. I have Taken most of the advise. Apart from a couple of things.

  1. I can't figure out how to use readline()
  2. Input normalization is now done by an optional function passed
  3. Not a pure regex solution

The terminology has also changed quite a bit since the previous version.

I'm interested in all feedback. In particular, how I can format the regular expressions in the list labeled time_features, to look 'right'.

Any better way to implement an idea or simplify it.

What I don't like about it. Is that just creating the classifier object has as a side effect of creating files. Which I want, as it lets me populate the files with the tokenized output printed from the error msg. There's more than likely a better way.

Files associated with this example: https://drive.google.com/open?id=0B3HIB_5rVAxmbUJ4SjRMT3lBNlU

Here are some input/output pairs:


Enter time to classify as Date/Time/DateTime:
>?July 3 2017 8am
DateTime
Enter time to classify as Date/Time/DateTime:
>?2:00
Time
Enter time to classify as Date/Time/DateTime:
>?1.2.2017
Date
Enter time to classify as Date/Time/DateTime:
>?15 september 2017 19:00
DateTime
Enter time to classify as Date/Time/DateTime:
>?August 3 12:00
CLASSIFIER: Time FAILED TO CLASSIFY:
Raw string: August 3 12:00
Formatted string: August 3 12:00
key: [4, 0, 0, 2, 0]
----------------------------------------------
Enter time to classify as Date/Time/DateTime:
>?

import re
import itertools
import operator
class ClassificationException(Exception):
 def __init__(self, output):
 pass
class Classifier(object):
 def __init__(self, classifier_name, classifications, features, file_extension='.txt', normalize_input=None, re_flags=''):
 '''
 Create a classifier object that classifies string input based upon features extracted by regular expressions
 :param classifier_name: The prefix for files associated with this classifier
 :param classifications: The potential classifications of the data
 :param features: A list of regular expressions, each expression is a feature. Features should be ordered from
 highest to lowest importance
 :param file_extension: The file extension of files associated to this classifier defaults to .txt
 :param normalize_input: Designed to take a function to normalize the string input. Defaults to identity
 :param re_flags: Flags passed to the regular expression engine, defaults to none. It should be a string, similar to
 the inline syntax. eg. 'ismx' => IGNORECASE|DOTALL|MULTILINE|VERBOSE
 '''
 #This is used to identify files that belong to this classifier
 self.name = classifier_name
 #The potential classifications
 self.classifications = classifications
 #Create a function that extracts the given features
 self.create_extractor_from_table(features, re_flags)
 self.file_extension = file_extension
 #Create files to contain the match_codes. Useful when the user wants to populate the response table
 self.create_files(self.name, self.file_extension, *self.classifications)
 #Read response matches and construct the lookup table
 self.create_response_table()
 #Input normalisation defaults to an identity function
 self.normalize_function(normalize_input)
 def create_extractor_from_table(self, features, re_flags):
 flag_lookup = {'i': re.IGNORECASE, 'm': re.MULTILINE, 's': re.DOTALL, 'x': re.VERBOSE}
 if 2 <= len(re_flags) <= 4:
 flags = [flag_lookup[i] for i in re_flags]
 flags = list(itertools.accumulate(flags, func=operator.or_))[-1]
 elif len(re_flags) == 1:
 flags = flag_lookup[re_flags]
 else:
 flags = 0
 def feature_extractor(string):
 #Create a list of re.finditer generators
 feature_table = [re.finditer(feature, string, flags) for feature in features]
 #Unpack the above generators
 match_code = [(index, ii.start()) for index, i in enumerate(feature_table) for ii in i]
 #Sort the features based upon their start position
 match_code = sorted(match_code, key=lambda x: x[1])
 #Remove duplicate matches, the first match gets priority over the rest
 match_code = [next(group) for i, group in itertools.groupby(match_code, key=lambda x: x[1])]
 match_code = [i[0] for i in match_code]
 return match_code
 self.feature_extractor = feature_extractor
 @staticmethod
 def create_files(prefix, extension, *file_names):
 for file in file_names:
 with open('{}_{}{}'.format(prefix, file, extension), 'a') as f:
 pass
 @staticmethod
 def read_file(file_prefix, file_name, file_extension):
 with open('{}_{}{}'.format(file_prefix, file_name, file_extension), 'r') as f:
 lines = itertools.takewhile(lambda x: x != '', f)
 contents = [[int(ii.group()) for ii in re.finditer(r'\d+', i)] for i in lines]
 return contents
 @staticmethod
 def append_file(file_prefix, file_name, file_extension, data):
 with open('{}{}{}'.format(file_prefix, file_name, file_extension), 'a') as f:
 for i in data:
 f.write('{}{}'.format(str(i), '\n'))
 def create_response_table(self):
 self.response_table = {classification: self.read_file(self.name, classification, self.file_extension)
 for classification in self.classifications}
 def normalize_function(self, func):
 def identity_function(arg):
 return arg
 if not func:
 self.normalize_function = identity_function
 else:
 self.normalize_function = func
 def failed_to_classify_output(self, raw_string, formatted_string, match_code):
 '''
 Print the failed classification for review
 '''
 def align(text): # Column width
 return ' ' * (30 - len(text))
 row = [[]] * 4 # I know this not pythonic but it allows me to lay out my code nicely
 # Column 1
 row[0] = '\nCLASSIFIER: {} FAILED TO CLASSIFY:\n\n'.format(self.name)
 row[1] = 'Raw string:'
 row[2] = 'Formatted string:'
 row[3] = 'key:'
 # Add column 2 to column 1
 row[1] = '{}{}{}\n'.format(row[1], align(row[1]), raw_string)
 row[2] = '{}{}{}\n'.format(row[2], align(row[2]), formatted_string)
 row[3] = '{}{}{}\n'.format(row[3], align(row[3]), match_code)
 output = ''.join((i for ii in row for i in ii))
 return (output + '{}'.format('-' * len(max(row, key=lambda x: len(x)))))
 def __call__(self, string):
 #Normalize string
 normalized_string = self.normalize_function(string)
 #Extract features
 match_code = self.feature_extractor(normalized_string)
 #Check for membership of the match_code in the response table
 members = [(key, len(value)) for key, value in self.response_table.items()
 for i in value if str(i) in str(match_code)]
 #Sort by match length
 result = sorted(members, key=lambda x: x[1], reverse=True)
 # The classification will be first value if it exists
 classification = None
 try:
 classification = next(iter(result))[0]
 except StopIteration:
 pass
 if classification:
 return classification
 else:
 #If it failed to classify print out an error msg and raise a ClassificationException
 error_msg = self.failed_to_classify_output(string, normalized_string, match_code)
 raise ClassificationException(error_msg)
if __name__ == '__main__':
 def normalization(string):
 _string = re.sub(r'\s', ' ', str(string)).strip()
 return _string
 time_features = ['\d+',
 '[/\-.|]',
 ':',
 'am|pm|AM|PM',
 'jan(?:uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?\
 |aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?',
 'mon(day)?|tue(sday)?|wed(nesday)?|thu(rsday)?|fri(day)?|sat(urday)?|sun(day)?',
 ',',
 'today|tomorrow|yesterday',
 'aest',
 '\w+']
 time_classifications = ['Date', 'Time', 'DateTime']
 time_classifier = Classifier('Time', time_classifications, time_features,
 normalize_input=normalization, re_flags='i')
 while True:
 text = input('Enter time to classify as Date/Time/DateTime:\n>?')
 try:
 result = time_classifier(text)
 except ClassificationException as e:
 print(e)
 else:
 print(result)
Gareth Rees
50.1k3 gold badges130 silver badges210 bronze badges
asked Jul 25, 2017 at 12:07
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

It's a great idea for you to be documenting your functions. Keep that up.

Running a PEP8 linter will tell you, among other things, that your comments should have a space between the # and the first word.

I think you've already smelled the issue where you start to write 'I know this not pythonic'. I'll still recommend that it be changed. At the least, perform a literal initializer:

row = [
 '\nCLASSIFIER: {} FAILED TO CLASSIFY:\n\n'.format(self.name),
 'Raw string:',
 'Formatted string:',
 'key:'
]

Finally, it looks like you have no facility to exit the program. That feature seems useful (rather than being forced to Ctrl^C).

answered Jul 25, 2017 at 16:52
\$\endgroup\$
1
  • \$\begingroup\$ Well, I didn't know that, must admit I've been putting off reading PEP 8... Actually. Your solution is nicer, I liked reading row[1] as the index was almost like a comment. Hadn't thought of the above layout. \$\endgroup\$ Commented Jul 25, 2017 at 22:34

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.