Extracting specific words from PANDAS dataframe

Question 1

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

Question 2

Can you give us some sample input and output?

Question 3

I can't seems to upload any file, but I'm using "amazon_baby.csv" file for data frame and using data.review column to extract selected_words

Question 4

When adding additional information you should edit your question instead of adding a comment. Learn more about comments including when to comment and when not to in the Help Center page about Comments.

Question 5

Can you give us some sample input and output? – Dannnno 21 mins ago, so added some comments to that comment :)

Question 6

I have added the sample variables (i.e. selected_words and data_frame to your post. Please format them accordingly

Question 7

If you're dealing with a lot of data, and especially if your data fits in a dataframe, you should use dataframe methods as much as possible.

Your sample data is not a dataframe, but since you specifically mentioned Pandas and dataframes in your post, lets assume that your data is in a dataframe. Using Pandas' str methods for pre-processing will be much faster than looping over each sentence and processing them individually, as Pandas utilizes a vectorized implementation in C.

Also, since you're trying to count word occurrences, you can use Python's counter object, which is designed specifically for, wait for it, counting things.

The current code:

def countWords(data_frame, selected_words):
 words_dict = {}
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()

could be reduced to something like:

def count_words(df, selected_words):
 ...
 df.sentences = df.sentences.replace(r"[{}]".format(string.punctuation.replace("'","")),"")
 df.sentences = df.sentences.str.strip().str.lower().str.split()
 for sentence in df.sentences:
 ...

Then if you were to use a Counter, you could filter and update the Counter in 1 line using a generator expression.

from collections import Counter
def count_words(df, selected_words):
 words_count = Counter()
 ...
 for sentence in df.sentences:
 words_count.update(x for x in sentence if x in selected_words)

Alternatively, if you are going to be searching for different word groups, you could count all the words and filter afterwards.

One more thing to note is that selected words in the sample input is a list. However, searching over a list will normally require O(n) time. In this case, searching over m sentences, it would then take n*m time. This can be improved by changing the list to a set, which has O(1) lookup, therefore reducing the time complexity of the search to just O(m).

This can be even further improved if you have the physical memory for it, by skipping the for loop and the counter and doing the entire thing in Pandas. Pandas' str.split function takes a parameter, expand, that splits the str into columns in the dataframe. When combined with .stack(), this results in a single column of all the words that occur in all the sentences.

The column can then be masked to filter for just the selected words, and counted with Pandas' series.value_counts() function, like so:

words = df.sentences.str.split(expand=True).stack()
words = words[words.isin(selected_words)]
return words.value_counts()

In fact, it would probably be faster to skip all the for loops altogether and implement it like this, as vectorized implementations will be much faster than loops. If you don't have enough memory for this, you can process it in chunks and it should still be faster than using for loops.

Question 8

That is what I was looking for. Thank you My friend, this helps me a lot!!!

mochi mochimochi 1,1445 silver badges7 bronze badges · Accepted Answer · 2018-03-26 04:39:40Z

If you're dealing with a lot of data, and especially if your data fits in a dataframe, you should use dataframe methods as much as possible.

Your sample data is not a dataframe, but since you specifically mentioned Pandas and dataframes in your post, lets assume that your data is in a dataframe. Using Pandas' str methods for pre-processing will be much faster than looping over each sentence and processing them individually, as Pandas utilizes a vectorized implementation in C.

Also, since you're trying to count word occurrences, you can use Python's counter object, which is designed specifically for, wait for it, counting things.

The current code:

def countWords(data_frame, selected_words):
 words_dict = {}
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()

could be reduced to something like:

def count_words(df, selected_words):
 ...
 df.sentences = df.sentences.replace(r"[{}]".format(string.punctuation.replace("'","")),"")
 df.sentences = df.sentences.str.strip().str.lower().str.split()
 for sentence in df.sentences:
 ...

Then if you were to use a Counter, you could filter and update the Counter in 1 line using a generator expression.

from collections import Counter
def count_words(df, selected_words):
 words_count = Counter()
 ...
 for sentence in df.sentences:
 words_count.update(x for x in sentence if x in selected_words)

Alternatively, if you are going to be searching for different word groups, you could count all the words and filter afterwards.

One more thing to note is that selected words in the sample input is a list. However, searching over a list will normally require O(n) time. In this case, searching over m sentences, it would then take n*m time. This can be improved by changing the list to a set, which has O(1) lookup, therefore reducing the time complexity of the search to just O(m).

This can be even further improved if you have the physical memory for it, by skipping the for loop and the counter and doing the entire thing in Pandas. Pandas' str.split function takes a parameter, expand, that splits the str into columns in the dataframe. When combined with .stack(), this results in a single column of all the words that occur in all the sentences.

The column can then be masked to filter for just the selected words, and counted with Pandas' series.value_counts() function, like so:

words = df.sentences.str.split(expand=True).stack()
words = words[words.isin(selected_words)]
return words.value_counts()

In fact, it would probably be faster to skip all the for loops altogether and implement it like this, as vectorized implementations will be much faster than loops. If you don't have enough memory for this, you can process it in chunks and it should still be faster than using for loops.

That is what I was looking for. Thank you My friend, this helps me a lot!!!

Stack Exchange Network

Extracting specific words from PANDAS dataframe

Sample input

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting specific words from PANDAS dataframe

Sample input

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions