I want to populate columns of a dataframe (df) by iteratively looping over a list (A_list) generating a dictionary where the keys are the names of the desired columns of df (in the example below the new columns are 'C', 'D', and 'E')
import pandas
def gen_data(key):
#THIS FUNCTION IS JUST AN EXAMPLE THE COLUMNS ARE NOT NECESSARILY RELATED OR USE THE KEY
data_dict = {'C':key+key, 'D':key, 'E':key+key+key}
return data_dict
A_list = ['a', 'b', 'c', 'd', 'f']
df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'f'], 'B': [1,2,3,3,2]})
for A_value in A_list:
data_dict = gen_data(A_value)
for data_key in data_dict:
df.loc[df.A == A_value, data_key] = data_dict[key]
So the result of this should be:
df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'e','f'],
'B': [1,2,3,3,2,1],
'C': ['aa','bb','cc','dd',nan,'ff'],
'D': ['a', 'b', 'c', 'd', nan,'f'],
'E': ['aaa','bbb','ccc','ddd',nan,'fff']})
I feel that
for data_key in data_dict:
df.loc[df.A == A_value, data_key] = data_dict[key]
is really inefficient if there are a lot of rows in df and I feel that there should be a way to remove the for loop in this code.
for A_value in A_list:
data_dict = gen_data(A_value)
for data_key in data_dict:
df.loc[df.A == key, data_key] = data_dict[key]
-
\$\begingroup\$ Since you're looking for a specific improvement in your code it belongs on Stack Overflow instead. \$\endgroup\$l0b0– l0b02019年07月13日 23:44:47 +00:00Commented Jul 13, 2019 at 23:44
-
\$\begingroup\$ Welcome to Code Review! Please see What to do when someone answers. I have rolled back Rev 3 → 2 \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2019年07月16日 16:49:57 +00:00Commented Jul 16, 2019 at 16:49
1 Answer 1
Since there is an e
missing in the input dataframe in col A
provided by you, I have added it:
#input
A_list = ['a', 'b', 'c', 'd', 'f']
df = pd.DataFrame({'A': ['a', 'b', 'c', 'd','e','f'], 'B': [1,2,3,3,2,1]})
You can start by joining the list you have:
pat='({})'.format('|'.join(A_list))
#pat --> '(a|b|c|d|f)'
Then using series.str.extract()
I am extracting the matching keys from the series based on the pattern we created.
s=df.A.str.extract(pat,expand=False) #expand=False returns a series for further assignment
print(s)
0 a
1 b
2 c
3 d
4 NaN
5 f
Once you have this series, you can decide what you want to do with it. For,example if I take your function:
def gen_data(key):
#THIS FUNCTION IS JUST AN EXAMPLE THE COLUMNS ARE NOT NECESSARILY RELATED OR USE THE KEY
data_dict = {'C':key*2, 'D':key, 'E':key*3}
return data_dict
And do the below:
df.join(pd.DataFrame(s.apply(gen_data).values.tolist()))
We get the desired output:
A B C D E
0 a 1 aa a aaa
1 b 2 bb b bbb
2 c 3 cc c ccc
3 d 3 dd d ddd
4 e 2 NaN NaN NaN
5 f 1 ff f fff
However I personally wouldn't use apply unless mandatory, so here is another way using df.assign()
where you can pass a dictionary of the extracted series and assign it to the dataframe:
df=df.assign(**{'C':s*2,'D':s,'E':s*3})
A B C D E
0 a 1 aa a aaa
1 b 2 bb b bbb
2 c 3 cc c ccc
3 d 3 dd d ddd
4 e 2 NaN NaN NaN
5 f 1 ff f fff
-
\$\begingroup\$ Hey anky_91, Thank you for your reply. I really like the df.assign example you showed however my problem is that my "gen_data" is a bit complex requiring file io access so I won't be able to do any vectorization (i.e. {'C':s*2,'D':s,'E':s*3}) as per your example. However I have iteratively used assign with
df.loc[df.A == key] = df.loc[df.A == key].assign(**metric_dict)
and it now only take 1/3 the amount of time. is there a more efficient way of using assign? \$\endgroup\$kkawabat– kkawabat2019年07月16日 00:54:12 +00:00Commented Jul 16, 2019 at 0:54 -
1\$\begingroup\$ @kkawabat if vectorization isn't possible, you're doing it right IMO. \$\endgroup\$anky– anky2019年07月16日 02:27:48 +00:00Commented Jul 16, 2019 at 2:27
-
\$\begingroup\$ I've editted the submission to use
assign()
which seems to finish a bit faster. TY \$\endgroup\$kkawabat– kkawabat2019年07月16日 02:31:14 +00:00Commented Jul 16, 2019 at 2:31