I wrote a very rudimentary code that counts sentences and words in the arbitrary text.
Code:
class analyse():
def __init__(self,text):
self.text = text
def sentence_count(self):
punctuation_marks = ['\.','\?','!', '\."','\."']
space_newline_end = [' ', '\n', '$']
combine = ['({}({}|{}|{}))'.format(x,*space_newline_end) for x in punctuation_marks]
pattern_string = '|'.join(combine)
sentence_search = re.compile(r'{}'.format(pattern_string))
count = len(sentence_search.findall(self.text))
return count
def word_search(self):
'''Returns all words in the text (repetitions are included)'''
word_pattern = re.compile(r'(\w+)((\’\w+)|(-\w+)|(\'\w+))?')
find_all = word_pattern.findall(self.text)
return [''.join(x[:2]).lower() for x in find_all]
def unique_words(self):
return set(self.word_search())
def unique_words_count(self):
return {x : (self.word_search()).count(x) for x in self.unique_words()}
def most_frequent(self,n,reverse):
'''Returns n most (reverse == False) / least (reverse == True) frequent words.'''
uniq_words = self.unique_words_count()
if n > len(uniq_words):
print("Number of unique words is less than {}.".format(n))
return None
words = list(uniq_words.keys())
count = [[list(uniq_words.values())[x],x] for x in range(len(uniq_words))]
count_sorted = sorted(count,key = lambda x: x[0],reverse=True)
words_sorted = [words[x[1]] for x in count_sorted]
if reverse == False:
for y in range(n):
print('Word: {}. {} \t \t |||| Count: {}'.format(y+1,words_sorted[y],count_sorted[y][0]))
if reverse == True:
iterator = 1
for y in range(len(uniq_words)-1,len(uniq_words)-n-1,-1):
print('Word: {}. {} \t \t |||| Count: {}'.format(iterator,words_sorted[y],count_sorted[y][0]))
iterator+=1
def search_specific_word_count(self,word):
if word in self.unique_words_count():
print('Word: {} || Count: {}'.format(word,self.unique_words_count()[word]))
else:
print('"{}" is not found.'.format(word))
What can be improved?
If you don't need an example, you can skip following part.
Example:
Consider following passage:
When Matt Radwell, a customer support officer for a small local authority in the UK, first started answering queries from the area’s residents, it was a frustrating and time-consuming business. If a resident contacted Aylesbury Vale District Council, 40 miles north of London, about an issue like housing benefit in which he lacked expertise, Mr Radwell might keep the caller waiting as long as 20 minutes. He had to find someone who could give him the relevant information.
Let the variable that hold the passage above be called passage
Firstly we instantiate class
p1 = analyse(passage)
Now let's have a look at each function:
Input: p1.sentence_count()
Output: 3
Input: p1.word_search()
Output: ['when', 'matt', 'radwell', 'a', 'customer', 'support', 'officer', 'for', 'a', 'small', 'local', 'authority', 'in', 'the', 'uk', 'first', 'started', 'answering', 'queries', 'from', 'the', 'area’s', 'residents', 'it', 'was', 'a', 'frustrating', 'and', 'time-consuming', 'business', 'if', 'a', 'resident', 'contacted', 'aylesbury', 'vale', 'district', 'council', '40', 'miles', 'north', 'of', 'london', 'about', 'an', 'issue', 'like', 'housing', 'benefit', 'in', 'which', 'he', 'lacked', 'expertise', 'mr', 'radwell', 'might', 'keep', 'the', 'caller', 'waiting', 'as', 'long', 'as', '20', 'minutes', 'he', 'had', 'to', 'find', 'someone', 'who', 'could', 'give', 'him', 'the', 'relevant', 'information']
Input: p1.unique_words()
Output: {'20', '40', 'a', 'about', 'an', 'and', 'answering', 'area’s', 'as', 'authority', 'aylesbury', 'benefit', 'business', 'caller', 'contacted', 'could', 'council', 'customer', 'district', 'expertise', 'find', 'first', 'for', 'from', 'frustrating', 'give', 'had', 'he', 'him', 'housing', 'if', 'in', 'information', 'issue', 'it', 'keep', 'lacked', 'like', 'local', 'london', 'long', 'matt', 'might', 'miles', 'minutes', 'mr', 'north', 'of', 'officer', 'queries', 'radwell', 'relevant', 'resident', 'residents', 'small', 'someone', 'started', 'support', 'the', 'time-consuming', 'to', 'uk', 'vale', 'waiting', 'was', 'when', 'which', 'who'}
Input: p1.unique_words_count()
Output: {'was': 1, 'queries': 1, 'radwell': 2, 'the': 4, 'caller': 1, 'waiting': 1, 'area’s': 1, 'and': 1, 'first': 1, 'north': 1, 'a': 4, 'give': 1, 'like': 1, 'housing': 1, 'as': 2, 'him': 1, 'from': 1, 'he': 2, 'might': 1, 'someone': 1, 'who': 1, 'it': 1, 'issue': 1, 'miles': 1, 'lacked': 1, 'started': 1, 'benefit': 1, '20': 1, 'minutes': 1, 'an': 1, 'council': 1, 'time-consuming': 1, 'resident': 1, 'officer': 1, 'uk': 1, 'expertise': 1, 'had': 1, 'support': 1, 'small': 1, 'answering': 1, 'which': 1, 'customer': 1, 'matt': 1, 'if': 1, 'mr': 1, 'in': 2, 'aylesbury': 1, 'london': 1, 'frustrating': 1, 'long': 1, 'when': 1, 'contacted': 1, 'district': 1, 'relevant': 1, '40': 1, 'could': 1, 'information': 1, 'residents': 1, 'about': 1, 'keep': 1, 'of': 1, 'to': 1, 'find': 1, 'authority': 1, 'local': 1, 'business': 1, 'for': 1, 'vale': 1}
Input: p1.most_frequent(1,reverse=False)
Output: 'Word: 1. the |||| Count: 4'
Input: p1.most_frequent(2,reverse=False)
Output: 'Word: 1. the |||| Count: 4
Word: 2. a |||| Count: 4'
Input: p1.most_frequent(100,reverse=False)
Output: 'Number of unique words is less than 100.'
If you want words that occur least often, set reverse for True
Input: p1.most_frequent(1,reverse=True)
Output: 'Word: 1. vale |||| Count: 1'
You can also check frequency of the specific word, for example:
Input: p1.search_specific_word_count('customer')
Output: 'Word: customer || Count: 1'
If you want to check the word that is not in the text, then output will be
Input: p1.search_specific_word_count('president')
Output: '"president" is not found.'
1 Answer 1
Redundant processing
Some of the code processes the text multiple times. For example:
def unique_words_count(self):
return {x : (self.word_search()).count(x) for x in self.unique_words()}
scans the text twice and scans a list of all the words:
self.word_search()
callsword_pattern.findall()
.count(x)
scans the list of wordsself.unique_words()
callsword_search()
which callsword_pattern.findall()
All of the analysis can be obtained by processing the text once to build a dictionary of items in the text. The various methods can the return information based on the dictionary.
collections module
The collections
library provides a Counter
class designed for counting things.
counts = Counter(sequence) # counts items in the sequence
regex
The regex patterns can be simplified:
word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)' # .?! only if followed by white space, quote, or end-of-string.
I also added a regex to catch a few abbreviations, so they won't be picked up as sentence endings. Obviously, this can be expanded.
Separate viewing from processing
Rather than directly printing out some data, it is often better for a class to return a string representation of the data. This makes the class more flexible. For example, if you want to use the Analysis class as part of a web server. The string should be sent to the web browser, not printed on the server's screen. (Although, some web frameworks take care of this for you).
revised code
import re
import itertools as it
from collections import Counter, deque
class Analyse:
def __init__(self, text):
self.text = text
abbreviations = r"Dr\.|Mr\.|Mrs\."
word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)'
pattern_string = '|'.join([abbreviations, word_pattern, sentence_ending])
search_pattern = re.compile(pattern_string)
self.counts = Counter(match[0].lower() for match in search_pattern.finditer(text))
# pulls sentence endings out of self.counts, so the methods don't need
# to filter them out
self.sentence_endings = sum(self.counts.pop(p, 0) for p in '.?!')
# length of longest word
self.maxlen = max(len(w) for w in self.counts)
def sentence_count(self):
return self.sentence_endings
def all_words(self):
'''Returns all words in the text (repetitions are included)'''
return list(self.counts.elements())
def unique_words(self):
return list(self.counts.keys())
def word_counts(self, word=None):
if word:
return self.counts[word]
else:
return dict(self.counts)
def most_frequent(self, n):
'''Returns n most frequent words.'''
return self.counts.most_common(n)
def least_frequent(self, n):
return self.counts.most_common()[-n:]
def most_frequent_as_str(self, n):
s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.most_frequent(n))]
return '\n'.join(s)
def least_frequent_as_str(self, n):
s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.least_frequent(n))]
return '\n'.join(s)
Several of the methods end with return list(...)
or return dict(...)
. The calls to list
or dict
are probably not needed, but I put them in to match the data structures returned by your code.
-
\$\begingroup\$ Thanks for the review! One question: you wrote
word_pattern = r"\w+(?:[-’']\w+)?"
What is?:
for? \$\endgroup\$Stokolos Ilya– Stokolos Ilya2019年10月01日 05:22:54 +00:00Commented Oct 1, 2019 at 5:22 -
\$\begingroup\$ One more thing: the name of my class was called "analyse", but you instead wrote "Analyse". Is there a reason for that? (Like, for example, convention for writing names of the classes) \$\endgroup\$Stokolos Ilya– Stokolos Ilya2019年10月01日 05:47:54 +00:00Commented Oct 1, 2019 at 5:47
-
\$\begingroup\$ And one more:
sentence_ending = r'[.?!](?=\s|"|"|$)'
I get everything except?=
part. What is it for? \$\endgroup\$Stokolos Ilya– Stokolos Ilya2019年10月01日 05:53:52 +00:00Commented Oct 1, 2019 at 5:53 -
2\$\begingroup\$ @Nelver It is
Analysis
because of PEP8, the import is probably an oversight (maybe they used it instead ofpop
forself.sentence_endings
?).?=
is a positive look ahead. So it matches only of the pattern will come afterwards.?:
is quite similar, but not quite the same. \$\endgroup\$Graipher– Graipher2019年10月01日 09:21:17 +00:00Commented Oct 1, 2019 at 9:21 -
1\$\begingroup\$
(?:...)
is like using(...)
, but the regex engine doesn't save the results. So you can't get it using.group()
. As Graipher said classes should start with a capital letter.p1(?=p2)
means that p1 matches if p2 would match next, but it doesn't "use up" the p2.deque
was left over from a previous version of the code. \$\endgroup\$RootTwo– RootTwo2019年10月01日 22:19:56 +00:00Commented Oct 1, 2019 at 22:19