I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:
import re
import string
def countWords(data_frame, selected_words):
words_dict = {}
for sentence in data_frame:
remove = string.punctuation
remove = remove.replace("'", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
test = re.sub(pattern, "", str(sentence)) #compile
splited_words = str(test).split(' ')
for word in splited_words:
word = word.strip()
word = word.lower()
if word in selected_words:
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
return words_dict
It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?
It takes roughly ~5s to perform for 15257065 words in 183531 sentences.
Sample input
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'
1 Answer 1
If you're dealing with a lot of data, and especially if your data fits in a dataframe, you should use dataframe methods as much as possible.
Your sample data is not a dataframe, but since you specifically mentioned Pandas and dataframes in your post, lets assume that your data is in a dataframe. Using Pandas' str methods for pre-processing will be much faster than looping over each sentence and processing them individually, as Pandas utilizes a vectorized implementation in C.
Also, since you're trying to count word occurrences, you can use Python's counter object, which is designed specifically for, wait for it, counting things.
The current code:
def countWords(data_frame, selected_words):
words_dict = {}
for sentence in data_frame:
remove = string.punctuation
remove = remove.replace("'", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
test = re.sub(pattern, "", str(sentence)) #compile
splited_words = str(test).split(' ')
for word in splited_words:
word = word.strip()
word = word.lower()
could be reduced to something like:
def count_words(df, selected_words):
...
df.sentences = df.sentences.replace(r"[{}]".format(string.punctuation.replace("'","")),"")
df.sentences = df.sentences.str.strip().str.lower().str.split()
for sentence in df.sentences:
...
Then if you were to use a Counter, you could filter and update the Counter in 1 line using a generator expression.
from collections import Counter
def count_words(df, selected_words):
words_count = Counter()
...
for sentence in df.sentences:
words_count.update(x for x in sentence if x in selected_words)
Alternatively, if you are going to be searching for different word groups, you could count all the words and filter afterwards.
One more thing to note is that selected words in the sample input is a list. However, searching over a list will normally require O(n) time. In this case, searching over m sentences, it would then take n*m time. This can be improved by changing the list to a set, which has O(1) lookup, therefore reducing the time complexity of the search to just O(m).
This can be even further improved if you have the physical memory for it, by skipping the for loop and the counter and doing the entire thing in Pandas. Pandas' str.split function takes a parameter, expand, that splits the str into columns in the dataframe. When combined with .stack(), this results in a single column of all the words that occur in all the sentences.
The column can then be masked to filter for just the selected words, and counted with Pandas' series.value_counts() function, like so:
words = df.sentences.str.split(expand=True).stack()
words = words[words.isin(selected_words)]
return words.value_counts()
In fact, it would probably be faster to skip all the for loops altogether and implement it like this, as vectorized implementations will be much faster than loops. If you don't have enough memory for this, you can process it in chunks and it should still be faster than using for loops.
-
\$\begingroup\$ That is what I was looking for. Thank you My friend, this helps me a lot!!! \$\endgroup\$simkusr– simkusr2018年03月26日 12:07:14 +00:00Commented Mar 26, 2018 at 12:07
Explore related questions
See similar questions with these tags.
selected_words
anddata_frame
to your post. Please format them accordingly \$\endgroup\$