Return to Revisions

3 of 3

added 535 characters in body

edited Sep 30, 2019 at 18:35

RootTwo

edited Sep 30, 2019 at 18:35

RootTwo

10.7k
1
14
30

Redundant processing

Some of the code processes the text multiple times. For example:

def unique_words_count(self):
 return {x : (self.word_search()).count(x) for x in self.unique_words()}

scans the text twice and scans a list of all the words:

self.word_search() calls word_pattern.findall()
.count(x) scans the list of words
self.unique_words() calls word_search() which calls word_pattern.findall()

All of the analysis can be obtained by processing the text once to build a dictionary of items in the text. The various methods can the return information based on the dictionary.

collections module

The collections library provides a Counter class designed for counting things.

counts = Counter(sequence) # counts items in the sequence

regex

The regex patterns can be simplified:

word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)' # .?! only if followed by white space, quote, or end-of-string.

I also added a regex to catch a few abbreviations, so they won't be picked up as sentence endings. Obviously, this can be expanded.

Separate viewing from processing

Rather than directly printing out some data, it is often better for a class to return a string representation of the data. This makes the class more flexible. For example, if you want to use the Analysis class as part of a web server. The string should be sent to the web browser, not printed on the server's screen. (Although, some web frameworks take care of this for you).

revised code

import re
import itertools as it
from collections import Counter, deque
class Analyse:
 def __init__(self, text):
 self.text = text
 
 abbreviations = r"Dr\.|Mr\.|Mrs\."
 
 word_pattern = r"\w+(?:[-’']\w+)?"
 
 sentence_ending = r'[.?!](?=\s|"|"|$)'
 
 pattern_string = '|'.join([abbreviations, word_pattern, sentence_ending])
 
 search_pattern = re.compile(pattern_string)
 
 self.counts = Counter(match[0].lower() for match in search_pattern.finditer(text))
 # pulls sentence endings out of self.counts, so the methods don't need
 # to filter them out
 self.sentence_endings = sum(self.counts.pop(p, 0) for p in '.?!')
 # length of longest word
 self.maxlen = max(len(w) for w in self.counts) 
 def sentence_count(self):
 return self.sentence_endings
 def all_words(self):
 '''Returns all words in the text (repetitions are included)'''
 return list(self.counts.elements())
 def unique_words(self):
 return list(self.counts.keys())
 def word_counts(self, word=None):
 if word:
 return self.counts[word]
 else:
 return dict(self.counts)
 def most_frequent(self, n):
 '''Returns n most frequent words.'''
 return self.counts.most_common(n)
 
 def least_frequent(self, n):
 return self.counts.most_common()[-n:]
 def most_frequent_as_str(self, n):
 s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.most_frequent(n))]
 return '\n'.join(s)
 def least_frequent_as_str(self, n):
 s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.least_frequent(n))]
 return '\n'.join(s)

Several of the methods end with return list(...) or return dict(...). The calls to list or dict are probably not needed, but I put them in to match the data structures returned by your code.

answered Sep 30, 2019 at 18:08

RootTwo

10.7k
1
14
30

default