Document classfier

Question 1

Description:

I am working on a classifier which categorizes the text based on some criteria, at present, it is a map of category and list of words if any of the words appear in the text, a category is assigned to it. The text can have multiple categories. The number of categories can be more than 100.

Problem:

There is a need to make this design as flexible as possible so that new classifier can be just plugged in.

Code:

from collections import defaultdict
import string
import re
def createIndex(my_dict):
 index = defaultdict(list);
 for category, words in my_dict.items():
 for word in words:
 index[word.lower()].append(category)
 return index
class Document(object):
 def __init__(self, category, text):
 self.category = category
 self._text = text.strip().translate(str.maketrans('', '', string.punctuation))
 self._paragraphs = []
 def parse(self):
 paragraphs = re.split('\n\n', self._text)
 for para in paragraphs:
 self._paragraphs.append(Paragraph(para))
 def paragraphs(self):
 return self._paragraphs
class Paragraph(object):
 def __init__(self, text):
 self.number = None
 self._text = text
 def words(self):
 return self._text.split()
 def text(self):
 return self._text
 def __str__(self):
 return self._text
class ClassifiedText(object):
 def __init__(self, text, categories):
 self.text = text
 self.categories = categories
 def __str__(self):
 return self.text + '->' + str(self.categories)
 def __repr__(self):
 return self.text + '->' + str(self.categories)
class UncassifiedText(object):
 def __init__(self, text):
 self.text = text
 self.weight = 0
 self.words = []
 self.category = None
 def __str__(self):
 return self.text + '[Unclassified]'
 def __repr__(self):
 return self.text + '[Unclassified]'
class WeightedClassifiedText(object):
 def __init__(self, text, weight, category, words):
 self.text = text
 self.weight = weight
 self.words = words
 self.category = category
 def __str__(self):
 return self.text + '[' + str(self.weight) + ']'
 def __repr__(self):
 return self.text + '[' + str(self.weight) + ']' 
class CategoryClassifier(object):
 def classify(self, text):
 raise NotImplementedError('subclasses must override classifier()!')
class TimeClassifier(CategoryClassifier):
 def __init__(self):
 self.index = createIndex(wordsByCategory)
 self._key = 'time'
 self._label = 'Time'
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 if 'time' in self.index[word.lower()]:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, 'Time', list(matched_words))
 else:
 return UncassifiedText(paragraph.text())
class MoodClassifier(CategoryClassifier):
 def __init__(self):
 self.index = createIndex(wordsByCategory)
 self._key = 'mood'
 self._label = 'Mood'
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 if 'mood' in self.index[word.lower()]:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, 'Mood', list(matched_words))
 else:
 return UncassifiedText(paragraph.text()) 
class DayClassifier(CategoryClassifier):
 def __init__(self):
 self.index = createIndex(wordsByCategory)
 self._key = 'day'
 self._label = 'Day'
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 if self._key in self.index[word.lower()]:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, self._label, list(matched_words))
 else:
 return UncassifiedText(paragraph.text()) 
class LocationClassifier(CategoryClassifier):
 def __init__(self):
 self.index = createIndex(wordsByCategory)
 self._key = 'location'
 self._label = 'Location'
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 if 'location' in self.index[word.lower()]:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, 'Location', list(matched_words))
 else:
 return UncassifiedText(paragraph.text()) 
wordsByCategory = {
 'time': ['monday', 'noon', 'morning'],
 'location': ['Here', 'Near', 'City', 'London', 'desk', 'office', 'home'],
 'mood': ['Happy', 'Excited', 'smiling', 'smiled', 'sick'],
 'day': ['sunday', 'monday', 'Friday']
}
raw_text = """
Friday brings joy and happiness. We get together
and have fun in our London office.
Monday is normally boring and makes me sick.
Everyone is busy in meetings and no fun.
She looked and smiled at me again, I am thinking
to have coffee with her.
"""
document = Document('Contract', raw_text)
document.parse()
#print(document.paragraphs[0])
#print(ManualClassifier(','.join(texts[0])).classify())
classfiers = [
 TimeClassifier(),
 MoodClassifier(),
 DayClassifier(),
 LocationClassifier()
]
for text in document.paragraphs():
 result = list(map(lambda x: x.classify(text), classfiers))
 categories = list(filter(
 lambda x: x is not None, list(map(lambda x: x.category, result))))
 print(text, categories)

It works as expected and to add a new classifier I just have to create a class and add it to the list of classifier and it works. But I feel lack of object-oriented design and I am new to Python as well so I am not sure if I am doing it the "Python" way.

Misc:

In the future I need to introduce the ML-based classifiers as well and then for a given text I need to decide on the category decided by the ML vs Manual classification. For the same reason, I have added weight in the ClassifiedText.

Question 2

It is indeed not so object oriented, but there's something to work with.

First things first : UncassifiedText -> UnclassifiedText right?

If we look at all your classifiers, we can see pretty quickly that they are all the same except for the key and the label. There's another difference where only the DayClassifier actually uses the variables key and label.

Instead of having all these duplicated, why not just use :

class Classifier():
 def __init__(self, key, label):
 self.index = createIndex(wordsByCategory)
 self._key = key
 self._label = label
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 if self._key in self.index[word.lower()]:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, self._label, list(matched_words))
 else:
 return UncassifiedText(paragraph.text())

Then :

classifiers = [
 Classifier("day", "Day"),
 Classifier("mood", "Mood"),
 ....
]

With only this change, you've removed all the duplication in your code, that's already pretty damn good OOP. But we can do a little better.

One thing that strikes me as not very OOP is the global usage of wordsByCategory and the whole mechanic behind it, we could make if much simpler. As a matter of fact createIndex is bloated. You create an index for every word in the dictionary, but your classifier only uses some of them. You could make this much simpler with something like this :

class ClassifierV2():
 def __init__(self, key, label, words):
 self.words = words
 self._key = key
 self._label = label
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 # Notice the difference here, we don't need to track if we have the right category
 # it's the only one possible.
 if word in self.words:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, self._label, list(matched_words))
 else:
 return UncassifiedText(paragraph.text()) 
ClassifierV2("day", "Day", ['sunday', 'monday', 'Friday'])

Here's a minor thing, if your label is always the key with a capitalized first letter, you should consider doing this instead in the constructor :

class ClassifierV2():
 def __init__(self, label, words):
 self.words = words
 self._key = label.lower()
 self._label = label
ClassifierV2("Day", ['sunday', 'monday', 'Friday'])

The goal would be that, if you want to add a new classifier, you wouldn't need to touch your code back. How could we achieve this? I think a classifiers configuration file is in order. Configuration files can seem scary at first, but we'll keep it very simple :

classifiers.json (I'm being dead honest I'm not sure what's the proper syntax for .json files I never hand write them lol, but that's got to be pretty close to it)

{
 'time': ['monday', 'noon', 'morning'],
 'location': ['Here', 'Near', 'City', 'London', 'desk', 'office', 'home'],
 'mood': ['Happy', 'Excited', 'smiling', 'smiled', 'sick'],
 'day': ['sunday', 'monday', 'Friday']
}

Once we have this, if you want to add a new classifier, you add a new line to your json file, anyone can do this. You can then load your wordDictionary with the json package. And feed each key:value pair to a new instance of the ClassifierV2.

Question 3

Apart from the above mentioned points (which I mainly agree with), my OCD doesn't let me skip the first thing that I spotted:

Use 4 spaces per indentation level.

When the conditional part of an if-statement is long enough to require that it be written across multiple lines, it's worth noting that the combination of a two character keyword (i.e. if), plus a single space, plus an opening parenthesis creates a natural 4-space indent for the subsequent lines of the multiline conditional. This can produce a visual conflict with the indented suite of code nested inside the if-statement, which would also naturally be indented to 4 spaces. This PEP takes no explicit position on how (or whether) to further visually distinguish such conditional lines from the nested suite inside the if-statement.

More, Surround top-level function and class definitions with two blank lines.. So, instead of:

class B(A):
 # ...
class C(A):
 # ...

Use:

class B(A):
 # ...
class C(A):
 # ...

You can read more about the subject at the above metnioned links.

IEatBagels IEatBagels 12.6k3 gold badges48 silver badges99 bronze badges · Answer 1 · 2019-08-03 23:55:22Z

It is indeed not so object oriented, but there's something to work with.

First things first : UncassifiedText -> UnclassifiedText right?

If we look at all your classifiers, we can see pretty quickly that they are all the same except for the key and the label. There's another difference where only the DayClassifier actually uses the variables key and label.

Instead of having all these duplicated, why not just use :

class Classifier():
 def __init__(self, key, label):
 self.index = createIndex(wordsByCategory)
 self._key = key
 self._label = label
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 if self._key in self.index[word.lower()]:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, self._label, list(matched_words))
 else:
 return UncassifiedText(paragraph.text())

Then :

classifiers = [
 Classifier("day", "Day"),
 Classifier("mood", "Mood"),
 ....
]

With only this change, you've removed all the duplication in your code, that's already pretty damn good OOP. But we can do a little better.

One thing that strikes me as not very OOP is the global usage of wordsByCategory and the whole mechanic behind it, we could make if much simpler. As a matter of fact createIndex is bloated. You create an index for every word in the dictionary, but your classifier only uses some of them. You could make this much simpler with something like this :

class ClassifierV2():
 def __init__(self, key, label, words):
 self.words = words
 self._key = key
 self._label = label
 def classify(self, paragraph):
 count = 0
 matched_words = set()
 for word in paragraph.words():
 # Notice the difference here, we don't need to track if we have the right category
 # it's the only one possible.
 if word in self.words:
 matched_words.add(word)
 count += 1 
 if count > 0:
 return WeightedClassifiedText(
 paragraph.text(), count, self._label, list(matched_words))
 else:
 return UncassifiedText(paragraph.text()) 
ClassifierV2("day", "Day", ['sunday', 'monday', 'Friday'])

Here's a minor thing, if your label is always the key with a capitalized first letter, you should consider doing this instead in the constructor :

class ClassifierV2():
 def __init__(self, label, words):
 self.words = words
 self._key = label.lower()
 self._label = label
ClassifierV2("Day", ['sunday', 'monday', 'Friday'])

The goal would be that, if you want to add a new classifier, you wouldn't need to touch your code back. How could we achieve this? I think a classifiers configuration file is in order. Configuration files can seem scary at first, but we'll keep it very simple :

classifiers.json (I'm being dead honest I'm not sure what's the proper syntax for .json files I never hand write them lol, but that's got to be pretty close to it)

{
 'time': ['monday', 'noon', 'morning'],
 'location': ['Here', 'Near', 'City', 'London', 'desk', 'office', 'home'],
 'mood': ['Happy', 'Excited', 'smiling', 'smiled', 'sick'],
 'day': ['sunday', 'monday', 'Friday']
}

Once we have this, if you want to add a new classifier, you add a new line to your json file, anyone can do this. You can then load your wordDictionary with the json package. And feed each key:value pair to a new instance of the ClassifierV2.

score 1 · Answer 2 · 2019-08-05 14:04:56Z

Apart from the above mentioned points (which I mainly agree with), my OCD doesn't let me skip the first thing that I spotted:

Use 4 spaces per indentation level.

When the conditional part of an if-statement is long enough to require that it be written across multiple lines, it's worth noting that the combination of a two character keyword (i.e. if), plus a single space, plus an opening parenthesis creates a natural 4-space indent for the subsequent lines of the multiline conditional. This can produce a visual conflict with the indented suite of code nested inside the if-statement, which would also naturally be indented to 4 spaces. This PEP takes no explicit position on how (or whether) to further visually distinguish such conditional lines from the nested suite inside the if-statement.

More, Surround top-level function and class definitions with two blank lines.. So, instead of:

class B(A):
 # ...
class C(A):
 # ...

Use:

class B(A):
 # ...
class C(A):
 # ...

You can read more about the subject at the above metnioned links.

Stack Exchange Network

Document classfier

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Document classfier

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions