Name comparison using fuzzy string matching

Question 1

I'm somewhat new to python and wrote this piece of code to do a string comparison of accounts that are being requested for import into our data base against accounts that are already present. The issue is that the accounts currently in our DB is over 65K and I'm comparing over 5K accounts for import causing this code to take over 5 hours to run. I suspect this has to do with the loop I'm using but I'm not certain how to improve it.

TLDR; I need help optimizing this code so it has a shorter run time.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
accounts_DB = pd.read_csv("file.csv") #65,000 rows and 15 columns
accounts_SF = pd.read_csv("Requested Import.csv") #5,000 rows and 30 columns 
def NameComparison(DB_account, choices):
 """Function uses fuzzywuzzy module to perform Levenshtein distance string comparison"""
 return(process.extractBests(DB_account, choices, score_cutoff= 95))
options = accounts_sf["Account Name"]
a_list = []
for i in range(len(accounts_db)):
 a_list.append(NameComparison(accounts_db.at[i,"Company Name"], options))
b_list = pd.DataFrame(a_list)
b_list.to_csv("Matched Accounts.csv")

Question 2

Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".

Question 3

Thank you, I've gone ahead and adjusted my title. My codes main goal is to compare two strings using fuzzy string comparison

Question 4

To apply the same function to each row of a dataframe column, you usually use pd.Series.map or pd.Series.apply. You can thus simplify your code to:

from functools import partial
from fuzzywuzzy import process
accounts_DB = pd.read_csv("file.csv") #65,000 rows and 15 columns
accounts_SF = pd.read_csv("Requested Import.csv") #5,000 rows and 30 columns
best_matches = partial(process.extractBests, choices=accounts_SF['Account Name'], score_cutoff=95)
accounts_DB['Company Name'].map(best_matches).to_csv("Matched Accounts.csv")

score 3 · Accepted Answer · 2018-12-31 22:42:33Z

To apply the same function to each row of a dataframe column, you usually use pd.Series.map or pd.Series.apply. You can thus simplify your code to:

from functools import partial
from fuzzywuzzy import process
accounts_DB = pd.read_csv("file.csv") #65,000 rows and 15 columns
accounts_SF = pd.read_csv("Requested Import.csv") #5,000 rows and 30 columns
best_matches = partial(process.extractBests, choices=accounts_SF['Account Name'], score_cutoff=95)
accounts_DB['Company Name'].map(best_matches).to_csv("Matched Accounts.csv")

Stack Exchange Network

Name comparison using fuzzy string matching

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Name comparison using fuzzy string matching

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions