Skip to main content
Code Review

Return to Question

Commonmark migration
Source Link

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'
edited tags; edited title; edited tags
Link
200_success
  • 145.5k
  • 22
  • 190
  • 478

Python, Extracting specific words from PANDAS dataframe

add sample input from comments
Source Link

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

I'm new to data analysis and doing some online training. I have a task to extract specific words from specific column in data frame, then count those words and then just to do some min/max/mean ant etc... I didn't find specific any method for that in Pandas so I have tried to create function for that. I have done this so far:

import re
import string
def countWords(data_frame, selected_words):
 words_dict = {}
 
 for sentence in data_frame:
 remove = string.punctuation
 remove = remove.replace("'", "") # don't remove hyphens
 pattern = r"[{}]".format(remove) # create the pattern
 test = re.sub(pattern, "", str(sentence)) #compile
 splited_words = str(test).split(' ')
 for word in splited_words:
 word = word.strip()
 word = word.lower()
 if word in selected_words:
 if word not in words_dict:
 words_dict[word] = 1
 else:
 words_dict[word] += 1
 return words_dict

It works as expected, but the performance is not as I expected. What could be done better in this code to improve performance?

It takes roughly ~5s to perform for 15257065 words in 183531 sentences.

###Sample input

selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] 
data_frame = 'These flannel wipes are OK, but in my opinion not worth keeping. I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality. I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'
correct grammar
Source Link
Loading
Source Link
simkusr
  • 215
  • 1
  • 2
  • 8
Loading
lang-py

AltStyle によって変換されたページ (->オリジナル) /