I am new in Python coding. I think the code could be written in a better and more compact form. It compiles quite slowly due to the method of removing stop-words.
I wanted to find the top 10 most frequent words from the column excluding the URL links, special characters, punctuations... and stop-words.
Any criticisms and suggestions to improve the efficiency & readability of my code would be greatly appreciated. Also, I want to know if there exists any dedicated python module to get the desired result easily.
I have a dataframe df
such that:
print(df['text'])
0 If I smelled the scent of hand sanitizers toda...
1 Hey @Yankees @YankeesPR and @MLB - wouldn't it...
2 @diane3443 @wdunlap @realDonaldTrump Trump nev...
3 @brookbanktv The one gift #COVID19 has give me...
4 25 July : Media Bulletin on Novel #CoronaVirus...
...
179103 Thanks @IamOhmai for nominating me for the @WH...
179104 2020! The year of insanity! Lol! #COVID19 http...
179105 @CTVNews A powerful painting by Juan Lucena. I...
179106 More than 1,200 students test positive for #CO...
179107 I stop when I see a Stop\n\n@SABCNews\n@Izinda...
Name: text, Length: 179108, dtype: object
I have done it in the following way:
import pandas as pd
import nltk
import re
import string
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
stop_words = stopwords.words()
def cleaning(text):
# converting to lowercase, removing URL links, special characters, punctuations...
text = text.lower()
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('[’""...]', '', text)
# removing the emojies # https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)
# removing the stop-words
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in stop_words]
filtered_sentence = (" ").join(tokens_without_sw)
text = filtered_sentence
return text
dt = df['text'].apply(cleaning)
from collections import Counter
p = Counter(" ".join(dt).split()).most_common(10)
rslt = pd.DataFrame(p, columns=['Word', 'Frequency'])
print(rslt)
Word Frequency
0 covid19 104546
1 cases 18150
2 new 14585
3 coronavirus 14189
4 amp 12227
5 people 9079
6 pandemic 7944
7 us 7223
8 deaths 7088
9 health 5231
An example IO of my function cleaning()
:
inp = 'If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that... https://t.co/QZvYbrOgb0'
outp = cleaning(inp)
print('Input:\n', inp)
print('Output:\n', outp)
Input:
If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that... https://t.co/QZvYbrOgb0
Output:
smelled scent hand sanitizers today someone past would think intoxicated
1 Answer 1
Note: The data you're going through is 370k+ lines. Because I tend to run different versions of code a lot during review, I've limited my version to 1000 lines.
Your code goes all over the place. Imports, downloads, another import, a variable being loaded, a function definition, the function being called and oh, another import. In that order. Would you agree it's helpful to sort those, so we can easily find what we're looking for?
The revised head of the file would look like this:
import re
import string
import nltk
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
After that, we'd ordinarily put the function definition. However, there's part of the program that doesn't have to be in the function itself. It only has to be executed once, even if multiple files are handled.
# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
The variable is in UPPER_CASE
now, because it's a pseudo-constant (Python doesn't really have constants, but it's a reminder to you and other developers that the variable should be set once and only once). It's customary to put the pseudo-constants between the imports and function definitions so you know where to look for them.
Now, the rest of the program is mostly fine already. You could use more functions, but with a program this size that would mostly be an exercise. I'd rename some of the variables, cut up the lines, use a proper docstring (you had a great start already with the comment at the start of the cleaning
function) and prepare the program for re-use. After all, it would be nice to simply import from this file instead of having to copy code to the next few projects using this, wouldn't it? And we don't want to run the specifics of this program every time it's imported, so we explicitly only run it if it's not imported.
import re
import string
import nltk
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
STOP_WORDS = stopwords.words()
# removing the emojies
# https://www.kaggle.com/alankritamishra/covid-19-tweet-sentiment-analysis#Sentiment-analysis
EMOJI_PATTERN = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
def cleaning(text):
"""
Convert to lowercase.
Rremove URL links, special characters and punctuation.
Tokenize and remove stop words.
"""
text = text.lower()
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('[’""...]', '', text)
text = EMOJI_PATTERN.sub(r'', text)
# removing the stop-words
text_tokens = word_tokenize(text)
tokens_without_sw = [
word for word in text_tokens if not word in STOP_WORDS]
filtered_sentence = (" ").join(tokens_without_sw)
text = filtered_sentence
return text
if __name__ == "__main__":
max_rows = 1000 # 'None' to read whole file
input_file = 'covid19_tweets.csv'
df = pd.read_csv(input_file,
delimiter = ',',
nrows = max_rows,
engine = "python")
dt = df['text'].apply(cleaning)
word_count = Counter(" ".join(dt).split()).most_common(10)
word_frequency = pd.DataFrame(word_count, columns = ['Word', 'Frequency'])
print(word_frequency)
Naturally, if you'd want a more memory-efficient version, you could cut-out all intermediate variables in those last few lines. That would make it a little harder to read though. As long as you're not reading multiple large files into memory in the same program, it should be fine.
Some of the advice I've provided comes from the PEP8, the official Python style guide. I can highly recommend taking a look at it.