I have a business requirement where I have to read excel files and match each cell with a given keyword list and find the matches and write it into a dataframe.
I have written the code, but when the excel size is increasing (7 MB - 5 sheets, 30k * 9), the code is taking unusually long time (40 minutes), and I know that it should since I have so many for loops, but is there anyway to optimize it?
I have tried many different ways to optimize the same, but nothing worked out well.
import pandas as pd
import re
import datetime
import numpy as np
raw_details_list = []
# there are around 10 keys and 600 values
keyword_dic = {"it spend": ['IT OpEx', 'OpEx'], "Infra": ["Dell", "Oracle"]}
def pattern(keys):
return re.compile('|'.join([r'(?<!\w)%s(?!\w)' % re.escape(keys)]), flags=re.I)
# this is reading each cell of the excel and doing a match whether with the keyword_dic
def sheet_handler(path, page_count, row):
try:
global keyword_dic
if isinstance(row, str):
for category, keywords in keyword_dic.items():
for keys in keywords:
r = pattern(keys)
words = r.findall(row)
# storing the matches in a list of dictionary
for i in words:
temp_dic = {"File Path": path, "Total Number of Page(s)/Slide(s)": page_count, "Keyword": keys,
"IT Category":category}
raw_details_list.append(temp_dic)
except Exception as e:
pass
# reading the excel - all sheets and creating a dataframe
def xlsx_handler(path):
# reading the excel file
sheets_dict = pd.read_excel(path, sheet_name=None, header=None)
sheet_count = len(sheets_dict)
for name, sheet in sheets_dict.items():
# removing float type columns
sheet = sheet.loc[:, sheet.dtypes != np.float64]
# reading each row, we have keep track of each cell's row and column details
for index, row in sheet.iterrows():
for i in range(len(row)):
try:
# if the content is float, we can skip it
float(row[i])
except Exception as e:
sheet_handler(path, sheet_count, row[i])
print(datetime.datetime.now())
xlsx_handler(r"D:\Backend Python\test\test.xlsx")
df = pd.DataFrame(raw_details_list)
print(datetime.datetime.now())
Sample data:
1 Oracle ERP would be good United states
2 Dell Laptops UK
3 Vaio Off-limit India
4 Oracle 1.2 E-series Chine
The output for this should somewhat look like this:
File Path Pages Keyword IT Category
D://Data 1 Oracle Infra
D://Data 1 Oracle Infra
D://Data 1 Dell Infra
-
\$\begingroup\$ first off I think you should ditch for loops - Pandas vectorized methods could significantly help. It would also be helpful if u shared a sample of ur data (not pics, or links) with the keyword list u r trying to filter for. and include ur expected output. but really, ditch the for loops \$\endgroup\$sammywemmy– sammywemmy2020年04月01日 04:27:51 +00:00Commented Apr 1, 2020 at 4:27
-
\$\begingroup\$ Have uploaded sample input and expected output. I have already shared some sample keywords. \$\endgroup\$Sankar– Sankar2020年04月01日 04:41:35 +00:00Commented Apr 1, 2020 at 4:41
-
2\$\begingroup\$ engineering.upside.com/… -- great read. I should do more of this. I agree that this more of a "code review" \$\endgroup\$David Erickson– David Erickson2020年04月01日 05:06:23 +00:00Commented Apr 1, 2020 at 5:06
-
2\$\begingroup\$ Sankar, now that this has been migrated, do add more explanation as to what your code does. See How to get the best value out of Code Review - Asking Questions for what makes a good Code Review question. \$\endgroup\$Martijn Pieters– Martijn Pieters2020年04月01日 10:56:35 +00:00Commented Apr 1, 2020 at 10:56
1 Answer 1
The issue has been resolved.
The problem was I was creating the pattern r = pattern(keys)
for each keyword (remind you i had 600 keywords) for each cell of the excel which was completely unnecessary.
So if i had a 10(rows) * 10 (columns) excel file
, the same 600
pattern was getting created 600 * 10 * 10 times
.
So I created one dictionary of patterns just once and used the same 10 * 10
times.
for cat,key in keyword_dic.items():
temp = [pattern(keyword) for keyword in key]
list_of_patterns[cat] = temp
Thanks everyone, for taking the time to go through the post!