Finding matches in Excel files, adding to a Pandas Dataframe

Question 1

I have a business requirement where I have to read excel files and match each cell with a given keyword list and find the matches and write it into a dataframe.

I have written the code, but when the excel size is increasing (7 MB - 5 sheets, 30k * 9), the code is taking unusually long time (40 minutes), and I know that it should since I have so many for loops, but is there anyway to optimize it?

I have tried many different ways to optimize the same, but nothing worked out well.

import pandas as pd
import re
import datetime
import numpy as np
raw_details_list = []
# there are around 10 keys and 600 values
keyword_dic = {"it spend": ['IT OpEx', 'OpEx'], "Infra": ["Dell", "Oracle"]}
def pattern(keys):
 return re.compile('|'.join([r'(?<!\w)%s(?!\w)' % re.escape(keys)]), flags=re.I)
# this is reading each cell of the excel and doing a match whether with the keyword_dic
def sheet_handler(path, page_count, row):
 try:
 global keyword_dic
 if isinstance(row, str):
 for category, keywords in keyword_dic.items():
 for keys in keywords:
 r = pattern(keys)
 words = r.findall(row)
 # storing the matches in a list of dictionary
 for i in words:
 temp_dic = {"File Path": path, "Total Number of Page(s)/Slide(s)": page_count, "Keyword": keys,
 "IT Category":category}
 raw_details_list.append(temp_dic)
 except Exception as e:
 pass
# reading the excel - all sheets and creating a dataframe
def xlsx_handler(path):
 # reading the excel file
 sheets_dict = pd.read_excel(path, sheet_name=None, header=None)
 sheet_count = len(sheets_dict)
 for name, sheet in sheets_dict.items():
 # removing float type columns
 sheet = sheet.loc[:, sheet.dtypes != np.float64]
 # reading each row, we have keep track of each cell's row and column details
 for index, row in sheet.iterrows():
 for i in range(len(row)):
 try:
 # if the content is float, we can skip it
 float(row[i])
 except Exception as e:
 sheet_handler(path, sheet_count, row[i])
print(datetime.datetime.now())
xlsx_handler(r"D:\Backend Python\test\test.xlsx")
df = pd.DataFrame(raw_details_list)
print(datetime.datetime.now())

Sample data:

1 Oracle ERP would be good United states
2 Dell Laptops UK
3 Vaio Off-limit India
4 Oracle 1.2 E-series Chine

The output for this should somewhat look like this:

File Path Pages Keyword IT Category
D://Data 1 Oracle Infra
D://Data 1 Oracle Infra
D://Data 1 Dell Infra

Question 2

first off I think you should ditch for loops - Pandas vectorized methods could significantly help. It would also be helpful if u shared a sample of ur data (not pics, or links) with the keyword list u r trying to filter for. and include ur expected output. but really, ditch the for loops

Question 3

Have uploaded sample input and expected output. I have already shared some sample keywords.

Question 4

engineering.upside.com/… -- great read. I should do more of this. I agree that this more of a "code review"

Question 5

Sankar, now that this has been migrated, do add more explanation as to what your code does. See How to get the best value out of Code Review - Asking Questions for what makes a good Code Review question.

Question 6

The issue has been resolved.

The problem was I was creating the pattern r = pattern(keys) for each keyword (remind you i had 600 keywords) for each cell of the excel which was completely unnecessary.

So if i had a 10(rows) * 10 (columns) excel file, the same 600 pattern was getting created 600 * 10 * 10 times.

So I created one dictionary of patterns just once and used the same 10 * 10 times.

for cat,key in keyword_dic.items():
 temp = [pattern(keyword) for keyword in key]
 list_of_patterns[cat] = temp

Thanks everyone, for taking the time to go through the post!

Sankar Sankar 1313 bronze badges · Answer 1 · 2020-04-01 14:20:26Z

The issue has been resolved.

The problem was I was creating the pattern r = pattern(keys) for each keyword (remind you i had 600 keywords) for each cell of the excel which was completely unnecessary.

So if i had a 10(rows) * 10 (columns) excel file, the same 600 pattern was getting created 600 * 10 * 10 times.

So I created one dictionary of patterns just once and used the same 10 * 10 times.

for cat,key in keyword_dic.items():
 temp = [pattern(keyword) for keyword in key]
 list_of_patterns[cat] = temp

Thanks everyone, for taking the time to go through the post!

Stack Exchange Network

Finding matches in Excel files, adding to a Pandas Dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding matches in Excel files, adding to a Pandas Dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions