Return to Question

Commonmark migration

edited Jun 10, 2020 at 13:24

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

edited tags; edited title; edited tags

Link

edited Mar 7, 2018 at 18:13

200_success

edited Mar 7, 2018 at 18:13

200_success

145.5k
22
190
478

Python, Extracting specific words from PANDAS dataframe

add sample input from comments

Source Link

edited Mar 7, 2018 at 16:58

Sᴀᴍ Onᴇᴌᴀ ♦

edited Mar 7, 2018 at 16:58

Sᴀᴍ Onᴇᴌᴀ ♦

29.5k
16
45
201

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'