Redundant processing
Some of the code processes the text multiple times. For example:
def unique_words_count(self):
return {x : (self.word_search()).count(x) for x in self.unique_words()}
scans the text twice and scans a list of all the words:
self.word_search()
callsword_pattern.findall()
.count(x)
scans the list of wordsself.unique_words()
callsword_search()
which callsword_pattern.findall()
All of the analysis can be obtained by processing the text once to build a dictionary of items in the text. The various methods can the return information based on the dictionary.
collections module
The collections
library provides a Counter
class designed for counting things.
counts = Counter(sequence) # counts items in the sequence
regex
The regex patterns can be simplified:
word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)' # .?! only if followed by white space, quote, or end-of-string.
I also added a regex to catch a few abbreviations, so they won't be picked up as sentence endings. Obviously, this can be expanded.
Separate viewing from processing
Rather than directly printing out some data, it is often better for a class to return a string representation of the data. This makes the class more flexible. For example, if you want to use the Analysis class as part of a web server. The string should be sent to the web browser, not printed on the server's screen. (Although, some web frameworks take care of this for you).
revised code
import re
import itertools as it
from collections import Counter, deque
class Analyse:
def __init__(self, text):
self.text = text
abbreviations = r"Dr\.|Mr\.|Mrs\."
word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)'
pattern_string = '|'.join([abbreviations, word_pattern, sentence_ending])
search_pattern = re.compile(pattern_string)
self.counts = Counter(match[0].lower() for match in search_pattern.finditer(text))
# pulls sentence endings out of self.counts, so the methods don't need
# to filter them out
self.sentence_endings = sum(self.counts.pop(p, 0) for p in '.?!')
def sentence_count(self):
return self.sentence_endings
def all_words(self):
'''Returns all words in the text (repetitions are included)'''
return list(self.counts.elements())
def unique_words(self):
return list(self.counts.keys())
def word_counts(self, word=None):
if word:
return self.counts[word]
else:
return dict(self.counts)
def most_frequent(self, n):
'''Returns n most frequent words.'''
return self.counts.most_common(n)
def least_frequent(self, n):
return self.counts.most_common()[-n:]
Several of the methods end with return list(...)
or return dict(...)
. The calls to list
or dict
are probably not needed, but I put them in to match the data structures returned by your code.
- 10.7k
- 1
- 14
- 30