How to speed up the search for matching values in a second data frame in a line-by-line iteration over a first data frame?

Question 1

I have a survey response dataframe for which each survey has a code:

df

 code item_stamp question_name question_type scorable_question subquestion_name stage products_stamp product_ID answer_name respondent_id answers_identity answer Test Code
0 006032 '173303553131045' Age group single 1.0 NaN Screener NaN <NA> 31 - 45 '173303561331047' '11357427731020' 2 6032
1 006032 '173303553131045' Age group single 1.0 NaN Screener NaN <NA> 31 - 45 '173303561431047' '11357427731020' 2 6032

I also have a dataframe with the types of each survey that are identified with Test Code :

df_synthesis_clean

 Country Country Code Category Application Gender Program Evaluation Stage Context Packaging Code Test Code Test Completion Agency Deadline Product Type Line Extension Dosage Fragrance House_ID product_ID Liking Mean Liking Scale Olfactive Family Olfactive Subfamily OLFACTIVE CLUSTER EASY FRESH TEXTURED WARM SIGNATURE QUALIFICATION VERT ORANGE ROUGE TOP SELLER TOP TESTER
0 France FR Fine Men Fragrances Perf/Edt/A-Shave/Col (FM) M scent hunter clst - sniff - on glass ball jar Blind NaN 3879 4/15/2016 0:00 NaN Market Product EDT 12.0 817.0 8082451124 5.55 0 to 10 WOODY Floral TEXTURED WARM NaN NaN NaN NaN
1 USA US Fine Men Fragrances Perf/Edt/A-Shave/Col (FM) M scent hunter clst - sniff - on glass ball jar Blind NaN 3855 4/15/2016 0:00 NaN Market Product EDT 12.0 817.0 8082451124 4.88 0 to 10 WOODY Floral TEXTURED WARM NaN NaN NaN NaN

I want to add a column about the type of Program that caused the response (Flash or non-flash).

I have the test id in df and the test type in df_synthesis_clean. So I tried in a Google collaboratory without GPU (because I don't know how to use it):

for _, row in df.iterrows():
 # I will look in the df_synthesis_clean table to see if the row['code'] corresponds to a Test Code.
 # I have to put iloc 0 because a study corresponds to several tested products but the respuestas no tienen 
 program = df_synthesis_clean.loc[df_synthesis_clean['Test Code'] == row['code']].iloc[0]['Program']
 row['Program'] = program

It works on small amount of data but unfortunately I now have more than three million lines in df so that's why it takes a long time.

Question 2

Did my suggestion work for you?

Question 3

Iterating over the rows of a data frame is usually very slow. For combining the values of two data frames you can instead use pandas.merge like this

import pandas as pd
cols = ['Test Code', 'Program']
pd.merge(df, df_synthesis_clean[cols], left_on='code', right_on='Test Code')

When using pd.merge, take care to choose the correct value for the optional parameter how. By default, only the rows corresponding to keys that are present in both data frames will be part of the resulting data frame. But you might want to change that according to your needs.

Please let me know if it works. Unfortunately, I cannot test my code at the moment.

Flursch Flursch 3211 silver badge4 bronze badges · Answer 1 · 2021-11-17 12:36:55Z

Iterating over the rows of a data frame is usually very slow. For combining the values of two data frames you can instead use pandas.merge like this

import pandas as pd
cols = ['Test Code', 'Program']
pd.merge(df, df_synthesis_clean[cols], left_on='code', right_on='Test Code')

When using pd.merge, take care to choose the correct value for the optional parameter how. By default, only the rows corresponding to keys that are present in both data frames will be part of the resulting data frame. But you might want to change that according to your needs.

Please let me know if it works. Unfortunately, I cannot test my code at the moment.

Stack Exchange Network

How to speed up the search for matching values in a second data frame in a line-by-line iteration over a first data frame?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to speed up the search for matching values in a second data frame in a line-by-line iteration over a first data frame?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions