I have a survey response dataframe for which each survey has a code
:
df
code item_stamp question_name question_type scorable_question subquestion_name stage products_stamp product_ID answer_name respondent_id answers_identity answer Test Code
0 006032 '173303553131045' Age group single 1.0 NaN Screener NaN <NA> 31 - 45 '173303561331047' '11357427731020' 2 6032
1 006032 '173303553131045' Age group single 1.0 NaN Screener NaN <NA> 31 - 45 '173303561431047' '11357427731020' 2 6032
I also have a dataframe with the types of each survey that are identified with Test Code
:
df_synthesis_clean
Country Country Code Category Application Gender Program Evaluation Stage Context Packaging Code Test Code Test Completion Agency Deadline Product Type Line Extension Dosage Fragrance House_ID product_ID Liking Mean Liking Scale Olfactive Family Olfactive Subfamily OLFACTIVE CLUSTER EASY FRESH TEXTURED WARM SIGNATURE QUALIFICATION VERT ORANGE ROUGE TOP SELLER TOP TESTER
0 France FR Fine Men Fragrances Perf/Edt/A-Shave/Col (FM) M scent hunter clst - sniff - on glass ball jar Blind NaN 3879 4/15/2016 0:00 NaN Market Product EDT 12.0 817.0 8082451124 5.55 0 to 10 WOODY Floral TEXTURED WARM NaN NaN NaN NaN
1 USA US Fine Men Fragrances Perf/Edt/A-Shave/Col (FM) M scent hunter clst - sniff - on glass ball jar Blind NaN 3855 4/15/2016 0:00 NaN Market Product EDT 12.0 817.0 8082451124 4.88 0 to 10 WOODY Floral TEXTURED WARM NaN NaN NaN NaN
I want to add a column about the type of Program that caused the response (Flash or non-flash).
I have the test id in df and the test type in df_synthesis_clean
. So I tried in a Google collaboratory without GPU (because I don't know how to use it):
for _, row in df.iterrows():
# I will look in the df_synthesis_clean table to see if the row['code'] corresponds to a Test Code.
# I have to put iloc 0 because a study corresponds to several tested products but the respuestas no tienen
program = df_synthesis_clean.loc[df_synthesis_clean['Test Code'] == row['code']].iloc[0]['Program']
row['Program'] = program
It works on small amount of data but unfortunately I now have more than three million lines in df
so that's why it takes a long time.
-
\$\begingroup\$ Did my suggestion work for you? \$\endgroup\$Flursch– Flursch2021年11月27日 17:34:58 +00:00Commented Nov 27, 2021 at 17:34
1 Answer 1
Iterating over the rows of a data frame is usually very slow. For combining the values of two data frames you can instead use pandas.merge like this
import pandas as pd
cols = ['Test Code', 'Program']
pd.merge(df, df_synthesis_clean[cols], left_on='code', right_on='Test Code')
When using pd.merge
, take care to choose the correct value for the optional parameter how
. By default, only the rows corresponding to keys that are present in both data frames will be part of the resulting data frame. But you might want to change that according to your needs.
Please let me know if it works. Unfortunately, I cannot test my code at the moment.
Explore related questions
See similar questions with these tags.