2
\$\begingroup\$

I have a business requirement where I have to read excel files and match each cell with a given keyword list and find the matches and write it into a dataframe.

I have written the code, but when the excel size is increasing (7 MB - 5 sheets, 30k * 9), the code is taking unusually long time (40 minutes), and I know that it should since I have so many for loops, but is there anyway to optimize it?

I have tried many different ways to optimize the same, but nothing worked out well.

import pandas as pd
import re
import datetime
import numpy as np
raw_details_list = []
# there are around 10 keys and 600 values
keyword_dic = {"it spend": ['IT OpEx', 'OpEx'], "Infra": ["Dell", "Oracle"]}
def pattern(keys):
 return re.compile('|'.join([r'(?<!\w)%s(?!\w)' % re.escape(keys)]), flags=re.I)
# this is reading each cell of the excel and doing a match whether with the keyword_dic
def sheet_handler(path, page_count, row):
 try:
 global keyword_dic
 if isinstance(row, str):
 for category, keywords in keyword_dic.items():
 for keys in keywords:
 r = pattern(keys)
 words = r.findall(row)
 # storing the matches in a list of dictionary
 for i in words:
 temp_dic = {"File Path": path, "Total Number of Page(s)/Slide(s)": page_count, "Keyword": keys,
 "IT Category":category}
 raw_details_list.append(temp_dic)
 except Exception as e:
 pass
# reading the excel - all sheets and creating a dataframe
def xlsx_handler(path):
 # reading the excel file
 sheets_dict = pd.read_excel(path, sheet_name=None, header=None)
 sheet_count = len(sheets_dict)
 for name, sheet in sheets_dict.items():
 # removing float type columns
 sheet = sheet.loc[:, sheet.dtypes != np.float64]
 # reading each row, we have keep track of each cell's row and column details
 for index, row in sheet.iterrows():
 for i in range(len(row)):
 try:
 # if the content is float, we can skip it
 float(row[i])
 except Exception as e:
 sheet_handler(path, sheet_count, row[i])
print(datetime.datetime.now())
xlsx_handler(r"D:\Backend Python\test\test.xlsx")
df = pd.DataFrame(raw_details_list)
print(datetime.datetime.now())

Sample data:

1 Oracle ERP would be good United states
2 Dell Laptops UK
3 Vaio Off-limit India
4 Oracle 1.2 E-series Chine

The output for this should somewhat look like this:

File Path Pages Keyword IT Category
D://Data 1 Oracle Infra
D://Data 1 Oracle Infra
D://Data 1 Dell Infra
asked Apr 1, 2020 at 4:25
\$\endgroup\$
4
  • \$\begingroup\$ first off I think you should ditch for loops - Pandas vectorized methods could significantly help. It would also be helpful if u shared a sample of ur data (not pics, or links) with the keyword list u r trying to filter for. and include ur expected output. but really, ditch the for loops \$\endgroup\$ Commented Apr 1, 2020 at 4:27
  • \$\begingroup\$ Have uploaded sample input and expected output. I have already shared some sample keywords. \$\endgroup\$ Commented Apr 1, 2020 at 4:41
  • 2
    \$\begingroup\$ engineering.upside.com/… -- great read. I should do more of this. I agree that this more of a "code review" \$\endgroup\$ Commented Apr 1, 2020 at 5:06
  • 2
    \$\begingroup\$ Sankar, now that this has been migrated, do add more explanation as to what your code does. See How to get the best value out of Code Review - Asking Questions for what makes a good Code Review question. \$\endgroup\$ Commented Apr 1, 2020 at 10:56

1 Answer 1

1
\$\begingroup\$

The issue has been resolved.

The problem was I was creating the pattern r = pattern(keys) for each keyword (remind you i had 600 keywords) for each cell of the excel which was completely unnecessary.

So if i had a 10(rows) * 10 (columns) excel file, the same 600 pattern was getting created 600 * 10 * 10 times.

So I created one dictionary of patterns just once and used the same 10 * 10 times.

for cat,key in keyword_dic.items():
 temp = [pattern(keyword) for keyword in key]
 list_of_patterns[cat] = temp

Thanks everyone, for taking the time to go through the post!

answered Apr 1, 2020 at 14:20
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.