3
\$\begingroup\$

I'm somewhat new to python and wrote this piece of code to do a string comparison of accounts that are being requested for import into our data base against accounts that are already present. The issue is that the accounts currently in our DB is over 65K and I'm comparing over 5K accounts for import causing this code to take over 5 hours to run. I suspect this has to do with the loop I'm using but I'm not certain how to improve it.

TLDR; I need help optimizing this code so it has a shorter run time.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
accounts_DB = pd.read_csv("file.csv") #65,000 rows and 15 columns
accounts_SF = pd.read_csv("Requested Import.csv") #5,000 rows and 30 columns 
def NameComparison(DB_account, choices):
 """Function uses fuzzywuzzy module to perform Levenshtein distance string comparison"""
 return(process.extractBests(DB_account, choices, score_cutoff= 95))
options = accounts_sf["Account Name"]
a_list = []
for i in range(len(accounts_db)):
 a_list.append(NameComparison(accounts_db.at[i,"Company Name"], options))
b_list = pd.DataFrame(a_list)
b_list.to_csv("Matched Accounts.csv")
asked Dec 31, 2018 at 17:05
\$\endgroup\$
2
  • \$\begingroup\$ Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.". \$\endgroup\$ Commented Dec 31, 2018 at 18:24
  • 1
    \$\begingroup\$ Thank you, I've gone ahead and adjusted my title. My codes main goal is to compare two strings using fuzzy string comparison \$\endgroup\$ Commented Dec 31, 2018 at 19:26

1 Answer 1

3
\$\begingroup\$

To apply the same function to each row of a dataframe column, you usually use pd.Series.map or pd.Series.apply. You can thus simplify your code to:

from functools import partial
from fuzzywuzzy import process
accounts_DB = pd.read_csv("file.csv") #65,000 rows and 15 columns
accounts_SF = pd.read_csv("Requested Import.csv") #5,000 rows and 30 columns
best_matches = partial(process.extractBests, choices=accounts_SF['Account Name'], score_cutoff=95)
accounts_DB['Company Name'].map(best_matches).to_csv("Matched Accounts.csv")
answered Dec 31, 2018 at 22:42
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.