I have two dataframes: One contains of company and its corresponding texts. The texts are in lists
**supplier_company_name Main_Text**
JDA SOFTWARE ['Supply chains','The answer is simple -RunJDA!']
PTC ['Hello', 'Solution']
The second dataframe is texts extracted from the company's website.
Company Text
0 JDA SOFTWARE About | JDA Software
1 JDA SOFTWARE 833.JDA.4ROI
2 JDA SOFTWARE Contact Us
3 JDA SOFTWARE Customer Support
4 PTC Training
5 PTC Partner Advantage
I want to create the new column in second dataframe if the text extracted from the web matches with the any item inside the list in the Main_Text column of the first data frame, fill True
else fill False
.
Code:
target = []
for x in tqdm(range(len(df['supplier_company_name']))): #company name in df1
#print(x)
for y in range(len(samp['Company']): #company name in df2
if samp['Company'][y] == df['supplier_company_name'][x]: #if the company name matches
#check if the text matches
if samp['Company'][y] in df['Main_Text'][x]:
target.append(True)
else:
target.append(False)
How can I change my code to run efficiently?
1 Answer 1
I’ll take the hypothesis that your first dataframe (df
) has unique company names. If so, you can easily reindex it by said company name and extract the (only one left) Main_Text
Series
to make it pretty much like a good old dict
:
main_text = df.set_index('supplier_company_name')['Main_Text']
Now we just need to iterate over each line in samp
, fetch the main text corresponding to the first column and generate a truthy value based on that and the second column. This is a job for apply
:
target = samp.apply(lambda row: row[1] in main_text.loc[row[0]], axis=1)