Word frequency analysis: Python

Question 1

I wrote a very rudimentary code that counts sentences and words in the arbitrary text.

Code:

class analyse():
 def __init__(self,text):
 self.text = text
 def sentence_count(self):
 punctuation_marks = ['\.','\?','!', '\."','\."']
 space_newline_end = [' ', '\n', '$']
 combine = ['({}({}|{}|{}))'.format(x,*space_newline_end) for x in punctuation_marks]
 pattern_string = '|'.join(combine)
 sentence_search = re.compile(r'{}'.format(pattern_string))
 count = len(sentence_search.findall(self.text))
 return count
 def word_search(self):
 '''Returns all words in the text (repetitions are included)'''
 word_pattern = re.compile(r'(\w+)((\’\w+)|(-\w+)|(\'\w+))?')
 find_all = word_pattern.findall(self.text)
 return [''.join(x[:2]).lower() for x in find_all]
 def unique_words(self):
 return set(self.word_search())
 def unique_words_count(self):
 return {x : (self.word_search()).count(x) for x in self.unique_words()}
 def most_frequent(self,n,reverse):
 '''Returns n most (reverse == False) / least (reverse == True) frequent words.'''
 uniq_words = self.unique_words_count()
 if n > len(uniq_words):
 print("Number of unique words is less than {}.".format(n))
 return None
 words = list(uniq_words.keys())
 count = [[list(uniq_words.values())[x],x] for x in range(len(uniq_words))]
 count_sorted = sorted(count,key = lambda x: x[0],reverse=True)
 words_sorted = [words[x[1]] for x in count_sorted]
 if reverse == False:
 for y in range(n):
 print('Word: {}. {} \t \t |||| Count: {}'.format(y+1,words_sorted[y],count_sorted[y][0]))
 if reverse == True:
 iterator = 1
 for y in range(len(uniq_words)-1,len(uniq_words)-n-1,-1):
 print('Word: {}. {} \t \t |||| Count: {}'.format(iterator,words_sorted[y],count_sorted[y][0]))
 iterator+=1
 def search_specific_word_count(self,word):
 if word in self.unique_words_count():
 print('Word: {} || Count: {}'.format(word,self.unique_words_count()[word]))
 else:
 print('"{}" is not found.'.format(word))

What can be improved?

If you don't need an example, you can skip following part.

Example:

Consider following passage:

When Matt Radwell, a customer support officer for a small local authority in the UK, first started answering queries from the area’s residents, it was a frustrating and time-consuming business. If a resident contacted Aylesbury Vale District Council, 40 miles north of London, about an issue like housing benefit in which he lacked expertise, Mr Radwell might keep the caller waiting as long as 20 minutes. He had to find someone who could give him the relevant information.

Let the variable that hold the passage above be called passage

Firstly we instantiate class

p1 = analyse(passage)

Now let's have a look at each function:

Input: p1.sentence_count()
Output: 3
Input: p1.word_search()
Output: ['when', 'matt', 'radwell', 'a', 'customer', 'support', 'officer', 'for', 'a', 'small', 'local', 'authority', 'in', 'the', 'uk', 'first', 'started', 'answering', 'queries', 'from', 'the', 'area’s', 'residents', 'it', 'was', 'a', 'frustrating', 'and', 'time-consuming', 'business', 'if', 'a', 'resident', 'contacted', 'aylesbury', 'vale', 'district', 'council', '40', 'miles', 'north', 'of', 'london', 'about', 'an', 'issue', 'like', 'housing', 'benefit', 'in', 'which', 'he', 'lacked', 'expertise', 'mr', 'radwell', 'might', 'keep', 'the', 'caller', 'waiting', 'as', 'long', 'as', '20', 'minutes', 'he', 'had', 'to', 'find', 'someone', 'who', 'could', 'give', 'him', 'the', 'relevant', 'information']
Input: p1.unique_words()
Output: {'20', '40', 'a', 'about', 'an', 'and', 'answering', 'area’s', 'as', 'authority', 'aylesbury', 'benefit', 'business', 'caller', 'contacted', 'could', 'council', 'customer', 'district', 'expertise', 'find', 'first', 'for', 'from', 'frustrating', 'give', 'had', 'he', 'him', 'housing', 'if', 'in', 'information', 'issue', 'it', 'keep', 'lacked', 'like', 'local', 'london', 'long', 'matt', 'might', 'miles', 'minutes', 'mr', 'north', 'of', 'officer', 'queries', 'radwell', 'relevant', 'resident', 'residents', 'small', 'someone', 'started', 'support', 'the', 'time-consuming', 'to', 'uk', 'vale', 'waiting', 'was', 'when', 'which', 'who'}
Input: p1.unique_words_count()
Output: {'was': 1, 'queries': 1, 'radwell': 2, 'the': 4, 'caller': 1, 'waiting': 1, 'area’s': 1, 'and': 1, 'first': 1, 'north': 1, 'a': 4, 'give': 1, 'like': 1, 'housing': 1, 'as': 2, 'him': 1, 'from': 1, 'he': 2, 'might': 1, 'someone': 1, 'who': 1, 'it': 1, 'issue': 1, 'miles': 1, 'lacked': 1, 'started': 1, 'benefit': 1, '20': 1, 'minutes': 1, 'an': 1, 'council': 1, 'time-consuming': 1, 'resident': 1, 'officer': 1, 'uk': 1, 'expertise': 1, 'had': 1, 'support': 1, 'small': 1, 'answering': 1, 'which': 1, 'customer': 1, 'matt': 1, 'if': 1, 'mr': 1, 'in': 2, 'aylesbury': 1, 'london': 1, 'frustrating': 1, 'long': 1, 'when': 1, 'contacted': 1, 'district': 1, 'relevant': 1, '40': 1, 'could': 1, 'information': 1, 'residents': 1, 'about': 1, 'keep': 1, 'of': 1, 'to': 1, 'find': 1, 'authority': 1, 'local': 1, 'business': 1, 'for': 1, 'vale': 1}
Input: p1.most_frequent(1,reverse=False) 
Output: 'Word: 1. the |||| Count: 4'
Input: p1.most_frequent(2,reverse=False) 
Output: 'Word: 1. the |||| Count: 4
 Word: 2. a |||| Count: 4'
Input: p1.most_frequent(100,reverse=False)
Output: 'Number of unique words is less than 100.'

If you want words that occur least often, set reverse for True

Input: p1.most_frequent(1,reverse=True)
Output: 'Word: 1. vale |||| Count: 1'

You can also check frequency of the specific word, for example:

Input: p1.search_specific_word_count('customer')
Output: 'Word: customer || Count: 1'

If you want to check the word that is not in the text, then output will be

Input: p1.search_specific_word_count('president')
Output: '"president" is not found.'

Question 2

Redundant processing

Some of the code processes the text multiple times. For example:

def unique_words_count(self):
 return {x : (self.word_search()).count(x) for x in self.unique_words()}

scans the text twice and scans a list of all the words:

self.word_search() calls word_pattern.findall()
.count(x) scans the list of words
self.unique_words() calls word_search() which calls word_pattern.findall()

All of the analysis can be obtained by processing the text once to build a dictionary of items in the text. The various methods can the return information based on the dictionary.

collections module

The collections library provides a Counter class designed for counting things.

counts = Counter(sequence) # counts items in the sequence

regex

The regex patterns can be simplified:

word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)' # .?! only if followed by white space, quote, or end-of-string.

I also added a regex to catch a few abbreviations, so they won't be picked up as sentence endings. Obviously, this can be expanded.

Separate viewing from processing

Rather than directly printing out some data, it is often better for a class to return a string representation of the data. This makes the class more flexible. For example, if you want to use the Analysis class as part of a web server. The string should be sent to the web browser, not printed on the server's screen. (Although, some web frameworks take care of this for you).

revised code

import re
import itertools as it
from collections import Counter, deque
class Analyse:
 def __init__(self, text):
 self.text = text
 abbreviations = r"Dr\.|Mr\.|Mrs\."
 word_pattern = r"\w+(?:[-’']\w+)?"
 sentence_ending = r'[.?!](?=\s|"|"|$)'
 pattern_string = '|'.join([abbreviations, word_pattern, sentence_ending])
 search_pattern = re.compile(pattern_string)
 self.counts = Counter(match[0].lower() for match in search_pattern.finditer(text))
 # pulls sentence endings out of self.counts, so the methods don't need
 # to filter them out
 self.sentence_endings = sum(self.counts.pop(p, 0) for p in '.?!')
 # length of longest word
 self.maxlen = max(len(w) for w in self.counts) 
 def sentence_count(self):
 return self.sentence_endings
 def all_words(self):
 '''Returns all words in the text (repetitions are included)'''
 return list(self.counts.elements())
 def unique_words(self):
 return list(self.counts.keys())
 def word_counts(self, word=None):
 if word:
 return self.counts[word]
 else:
 return dict(self.counts)
 def most_frequent(self, n):
 '''Returns n most frequent words.'''
 return self.counts.most_common(n)
 def least_frequent(self, n):
 return self.counts.most_common()[-n:]
 def most_frequent_as_str(self, n):
 s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.most_frequent(n))]
 return '\n'.join(s)
 def least_frequent_as_str(self, n):
 s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.least_frequent(n))]
 return '\n'.join(s)

Several of the methods end with return list(...) or return dict(...). The calls to list or dict are probably not needed, but I put them in to match the data structures returned by your code.

Question 3

Thanks for the review! One question: you wrote word_pattern = r"\w+(?:[-’']\w+)?" What is ?: for?

Question 4

One more thing: the name of my class was called "analyse", but you instead wrote "Analyse". Is there a reason for that? (Like, for example, convention for writing names of the classes)

Question 5

And one more: sentence_ending = r'[.?!](?=\s|"|"|$)' I get everything except ?= part. What is it for?

Question 6

@Nelver It is Analysis because of PEP8, the import is probably an oversight (maybe they used it instead of pop for self.sentence_endings?). ?= is a positive look ahead. So it matches only of the pattern will come afterwards. ?: is quite similar, but not quite the same.

Question 7

(?:...) is like using (...), but the regex engine doesn't save the results. So you can't get it using .group(). As Graipher said classes should start with a capital letter. p1(?=p2) means that p1 matches if p2 would match next, but it doesn't "use up" the p2. deque was left over from a previous version of the code.

RootTwo RootTwo 10.7k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2019-09-30 18:08:57Z

Redundant processing

Some of the code processes the text multiple times. For example:

def unique_words_count(self):
 return {x : (self.word_search()).count(x) for x in self.unique_words()}

scans the text twice and scans a list of all the words:

self.word_search() calls word_pattern.findall()
.count(x) scans the list of words
self.unique_words() calls word_search() which calls word_pattern.findall()

All of the analysis can be obtained by processing the text once to build a dictionary of items in the text. The various methods can the return information based on the dictionary.

collections module

The collections library provides a Counter class designed for counting things.

counts = Counter(sequence) # counts items in the sequence

regex

The regex patterns can be simplified:

word_pattern = r"\w+(?:[-’']\w+)?"
sentence_ending = r'[.?!](?=\s|"|"|$)' # .?! only if followed by white space, quote, or end-of-string.

I also added a regex to catch a few abbreviations, so they won't be picked up as sentence endings. Obviously, this can be expanded.

Separate viewing from processing

Rather than directly printing out some data, it is often better for a class to return a string representation of the data. This makes the class more flexible. For example, if you want to use the Analysis class as part of a web server. The string should be sent to the web browser, not printed on the server's screen. (Although, some web frameworks take care of this for you).

revised code

import re
import itertools as it
from collections import Counter, deque
class Analyse:
 def __init__(self, text):
 self.text = text
 abbreviations = r"Dr\.|Mr\.|Mrs\."
 word_pattern = r"\w+(?:[-’']\w+)?"
 sentence_ending = r'[.?!](?=\s|"|"|$)'
 pattern_string = '|'.join([abbreviations, word_pattern, sentence_ending])
 search_pattern = re.compile(pattern_string)
 self.counts = Counter(match[0].lower() for match in search_pattern.finditer(text))
 # pulls sentence endings out of self.counts, so the methods don't need
 # to filter them out
 self.sentence_endings = sum(self.counts.pop(p, 0) for p in '.?!')
 # length of longest word
 self.maxlen = max(len(w) for w in self.counts) 
 def sentence_count(self):
 return self.sentence_endings
 def all_words(self):
 '''Returns all words in the text (repetitions are included)'''
 return list(self.counts.elements())
 def unique_words(self):
 return list(self.counts.keys())
 def word_counts(self, word=None):
 if word:
 return self.counts[word]
 else:
 return dict(self.counts)
 def most_frequent(self, n):
 '''Returns n most frequent words.'''
 return self.counts.most_common(n)
 def least_frequent(self, n):
 return self.counts.most_common()[-n:]
 def most_frequent_as_str(self, n):
 s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.most_frequent(n))]
 return '\n'.join(s)
 def least_frequent_as_str(self, n):
 s = [f"Word {i}: {word:{self.maxlen}} |||| Count: {count}" for i, (word, count) in enumerate(self.least_frequent(n))]
 return '\n'.join(s)

Several of the methods end with return list(...) or return dict(...). The calls to list or dict are probably not needed, but I put them in to match the data structures returned by your code.

Thanks for the review! One question: you wrote word_pattern = r"\w+(?:[-’']\w+)?" What is ?: for?
One more thing: the name of my class was called "analyse", but you instead wrote "Analyse". Is there a reason for that? (Like, for example, convention for writing names of the classes)
And one more: sentence_ending = r'[.?!](?=\s|"|"|$)' I get everything except ?= part. What is it for?
@Nelver It is Analysis because of PEP8, the import is probably an oversight (maybe they used it instead of pop for self.sentence_endings?). ?= is a positive look ahead. So it matches only of the pattern will come afterwards. ?: is quite similar, but not quite the same.
(?:...) is like using (...), but the regex engine doesn't save the results. So you can't get it using .group(). As Graipher said classes should start with a capital letter. p1(?=p2) means that p1 matches if p2 would match next, but it doesn't "use up" the p2. deque was left over from a previous version of the code.

Stack Exchange Network

Word frequency analysis: Python

1 Answer 1

Redundant processing

collections module

regex

Separate viewing from processing

revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Word frequency analysis: Python

1 Answer 1

Redundant processing

collections module

regex

Separate viewing from processing

revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions