1
\$\begingroup\$

I have a survey response dataframe for which each survey has a code:

df

 code item_stamp question_name question_type scorable_question subquestion_name stage products_stamp product_ID answer_name respondent_id answers_identity answer Test Code
0 006032 '173303553131045' Age group single 1.0 NaN Screener NaN <NA> 31 - 45 '173303561331047' '11357427731020' 2 6032
1 006032 '173303553131045' Age group single 1.0 NaN Screener NaN <NA> 31 - 45 '173303561431047' '11357427731020' 2 6032

I also have a dataframe with the types of each survey that are identified with Test Code :

df_synthesis_clean

 Country Country Code Category Application Gender Program Evaluation Stage Context Packaging Code Test Code Test Completion Agency Deadline Product Type Line Extension Dosage Fragrance House_ID product_ID Liking Mean Liking Scale Olfactive Family Olfactive Subfamily OLFACTIVE CLUSTER EASY FRESH TEXTURED WARM SIGNATURE QUALIFICATION VERT ORANGE ROUGE TOP SELLER TOP TESTER
0 France FR Fine Men Fragrances Perf/Edt/A-Shave/Col (FM) M scent hunter clst - sniff - on glass ball jar Blind NaN 3879 4/15/2016 0:00 NaN Market Product EDT 12.0 817.0 8082451124 5.55 0 to 10 WOODY Floral TEXTURED WARM NaN NaN NaN NaN
1 USA US Fine Men Fragrances Perf/Edt/A-Shave/Col (FM) M scent hunter clst - sniff - on glass ball jar Blind NaN 3855 4/15/2016 0:00 NaN Market Product EDT 12.0 817.0 8082451124 4.88 0 to 10 WOODY Floral TEXTURED WARM NaN NaN NaN NaN

I want to add a column about the type of Program that caused the response (Flash or non-flash).

I have the test id in df and the test type in df_synthesis_clean. So I tried in a Google collaboratory without GPU (because I don't know how to use it):

for _, row in df.iterrows():
 # I will look in the df_synthesis_clean table to see if the row['code'] corresponds to a Test Code.
 # I have to put iloc 0 because a study corresponds to several tested products but the respuestas no tienen 
 program = df_synthesis_clean.loc[df_synthesis_clean['Test Code'] == row['code']].iloc[0]['Program']
 row['Program'] = program

It works on small amount of data but unfortunately I now have more than three million lines in df so that's why it takes a long time.

asked Nov 14, 2021 at 22:52
\$\endgroup\$
1
  • \$\begingroup\$ Did my suggestion work for you? \$\endgroup\$ Commented Nov 27, 2021 at 17:34

1 Answer 1

1
\$\begingroup\$

Iterating over the rows of a data frame is usually very slow. For combining the values of two data frames you can instead use pandas.merge like this

import pandas as pd
cols = ['Test Code', 'Program']
pd.merge(df, df_synthesis_clean[cols], left_on='code', right_on='Test Code')

When using pd.merge, take care to choose the correct value for the optional parameter how. By default, only the rows corresponding to keys that are present in both data frames will be part of the resulting data frame. But you might want to change that according to your needs.

Please let me know if it works. Unfortunately, I cannot test my code at the moment.

answered Nov 17, 2021 at 12:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.