Pandas updating/adding columns to rows incrementally using dictionary key values

Question 1

I want to populate columns of a dataframe (df) by iteratively looping over a list (A_list) generating a dictionary where the keys are the names of the desired columns of df (in the example below the new columns are 'C', 'D', and 'E')

import pandas
def gen_data(key):
 #THIS FUNCTION IS JUST AN EXAMPLE THE COLUMNS ARE NOT NECESSARILY RELATED OR USE THE KEY 
 data_dict = {'C':key+key, 'D':key, 'E':key+key+key}
 return data_dict
A_list = ['a', 'b', 'c', 'd', 'f']
df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'f'], 'B': [1,2,3,3,2]})
for A_value in A_list:
 data_dict = gen_data(A_value)
 for data_key in data_dict:
 df.loc[df.A == A_value, data_key] = data_dict[key]

So the result of this should be:

df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'e','f'], 
 'B': [1,2,3,3,2,1],
 'C': ['aa','bb','cc','dd',nan,'ff'],
 'D': ['a', 'b', 'c', 'd', nan,'f'],
 'E': ['aaa','bbb','ccc','ddd',nan,'fff']})

I feel that

for data_key in data_dict:
 df.loc[df.A == A_value, data_key] = data_dict[key]

is really inefficient if there are a lot of rows in df and I feel that there should be a way to remove the for loop in this code.

for A_value in A_list:
 data_dict = gen_data(A_value)
 for data_key in data_dict:
 df.loc[df.A == key, data_key] = data_dict[key]

Question 2

Since you're looking for a specific improvement in your code it belongs on Stack Overflow instead.

Question 3

Welcome to Code Review! Please see What to do when someone answers. I have rolled back Rev 3 → 2

Question 4

Since there is an e missing in the input dataframe in col A provided by you, I have added it:

#input
A_list = ['a', 'b', 'c', 'd', 'f']
df = pd.DataFrame({'A': ['a', 'b', 'c', 'd','e','f'], 'B': [1,2,3,3,2,1]})

You can start by joining the list you have:

pat='({})'.format('|'.join(A_list))
#pat --> '(a|b|c|d|f)'

Then using series.str.extract() I am extracting the matching keys from the series based on the pattern we created.

s=df.A.str.extract(pat,expand=False) #expand=False returns a series for further assignment
print(s)

0 a
1 b
2 c
3 d
4 NaN
5 f

Once you have this series, you can decide what you want to do with it. For,example if I take your function:

def gen_data(key):
 #THIS FUNCTION IS JUST AN EXAMPLE THE COLUMNS ARE NOT NECESSARILY RELATED OR USE THE KEY 
 data_dict = {'C':key*2, 'D':key, 'E':key*3}
 return data_dict

And do the below:

df.join(pd.DataFrame(s.apply(gen_data).values.tolist()))

We get the desired output:

 A B C D E
0 a 1 aa a aaa
1 b 2 bb b bbb
2 c 3 cc c ccc
3 d 3 dd d ddd
4 e 2 NaN NaN NaN
5 f 1 ff f fff

However I personally wouldn't use apply unless mandatory, so here is another way using df.assign() where you can pass a dictionary of the extracted series and assign it to the dataframe:

df=df.assign(**{'C':s*2,'D':s,'E':s*3})

 A B C D E
0 a 1 aa a aaa
1 b 2 bb b bbb
2 c 3 cc c ccc
3 d 3 dd d ddd
4 e 2 NaN NaN NaN
5 f 1 ff f fff

Question 5

Hey anky_91, Thank you for your reply. I really like the df.assign example you showed however my problem is that my "gen_data" is a bit complex requiring file io access so I won't be able to do any vectorization (i.e. {'C':s*2,'D':s,'E':s*3}) as per your example. However I have iteratively used assign with df.loc[df.A == key] = df.loc[df.A == key].assign(**metric_dict) and it now only take 1/3 the amount of time. is there a more efficient way of using assign?

Question 6

@kkawabat if vectorization isn't possible, you're doing it right IMO.

Question 7

I've editted the submission to use assign() which seems to finish a bit faster. TY

anky anky 2962 silver badges8 bronze badges · Accepted Answer · 2019-07-14 06:37:39Z

Since there is an e missing in the input dataframe in col A provided by you, I have added it:

#input
A_list = ['a', 'b', 'c', 'd', 'f']
df = pd.DataFrame({'A': ['a', 'b', 'c', 'd','e','f'], 'B': [1,2,3,3,2,1]})

You can start by joining the list you have:

pat='({})'.format('|'.join(A_list))
#pat --> '(a|b|c|d|f)'

Then using series.str.extract() I am extracting the matching keys from the series based on the pattern we created.

s=df.A.str.extract(pat,expand=False) #expand=False returns a series for further assignment
print(s)

0 a
1 b
2 c
3 d
4 NaN
5 f

Once you have this series, you can decide what you want to do with it. For,example if I take your function:

def gen_data(key):
 #THIS FUNCTION IS JUST AN EXAMPLE THE COLUMNS ARE NOT NECESSARILY RELATED OR USE THE KEY 
 data_dict = {'C':key*2, 'D':key, 'E':key*3}
 return data_dict

And do the below:

df.join(pd.DataFrame(s.apply(gen_data).values.tolist()))

We get the desired output:

 A B C D E
0 a 1 aa a aaa
1 b 2 bb b bbb
2 c 3 cc c ccc
3 d 3 dd d ddd
4 e 2 NaN NaN NaN
5 f 1 ff f fff

However I personally wouldn't use apply unless mandatory, so here is another way using df.assign() where you can pass a dictionary of the extracted series and assign it to the dataframe:

df=df.assign(**{'C':s*2,'D':s,'E':s*3})

 A B C D E
0 a 1 aa a aaa
1 b 2 bb b bbb
2 c 3 cc c ccc
3 d 3 dd d ddd
4 e 2 NaN NaN NaN
5 f 1 ff f fff

Hey anky_91, Thank you for your reply. I really like the df.assign example you showed however my problem is that my "gen_data" is a bit complex requiring file io access so I won't be able to do any vectorization (i.e. {'C':s*2,'D':s,'E':s*3}) as per your example. However I have iteratively used assign with df.loc[df.A == key] = df.loc[df.A == key].assign(**metric_dict) and it now only take 1/3 the amount of time. is there a more efficient way of using assign?
@kkawabat if vectorization isn't possible, you're doing it right IMO.
I've editted the submission to use assign() which seems to finish a bit faster. TY

Stack Exchange Network

Pandas updating/adding columns to rows incrementally using dictionary key values

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas updating/adding columns to rows incrementally using dictionary key values

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions