I am concatenating columns of a Python Pandas Dataframe and want to improve the speed of my code.
My data has the following structure:
Apple Pear Cherry
1 2 3
4 5 NaN
7 8 9
I only want to concatenate the contents of the Cherry column if there is actually value in the respective row. If my code works correctly, the result of the example above should be:
Result
1 :: 2 :: 3
4 :: 5
7 :: 8 :: 9
My code so far is this:
a_dataframe[result] = a_dataframe.apply(lambda r:
str(r.loc['apple']) + ' :: ' + str(r.loc['pear'])+' :: '+str(r.loc['cherry'])
if pd.notnull(r.loc['cherry']) & (r.loc['cherry'] != "")
# if cherry value is empty, do not add cherry into result
else str(r.loc['apple']) + ' :: ' + str(r.loc['pear']),
axis=1)
Any thoughts on how I can improve the speed of my code? Can I run this without an apply statement using only Pandas column operations?
Thanks up front for the help.
-
\$\begingroup\$ Welcome to codereview. Does your code works exactly as you posted it ? :) \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2017年02月06日 07:55:00 +00:00Commented Feb 6, 2017 at 7:55
-
\$\begingroup\$ Code works as i posted it. Regarding single quote: I changed variable names for simplicity when posting, so I probably lost it in the process :-). \$\endgroup\$Martin Reindl– Martin Reindl2017年02月06日 08:46:59 +00:00Commented Feb 6, 2017 at 8:46
1 Answer 1
There's no need to create a lambda for this.
Let's suppose we have the following dataframe:
my_df = pd.DataFrame({
'Apple': ['1', '4', '7'],
'Pear': ['2', '5', '8'],
'Cherry': ['3', np.nan, '9']})
Which is:
Apple Cherry Pear 1 3 2 4 NaN 5 7 9 8
An easier way to achieve what you want without the apply()
function is:
- use
iterrows()
to parse each row one by one. - use
Series()
andstr.cat()
to do the merge.
You'll get this:
l = []
for _, row in my_df.iterrows():
l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])
Doing this, NaN
will automatically be taken out, and will lead us to the desired result:
Result 1::3::2 4::5 7::9::8
The entire program may look like:
import pandas as pd
import numpy as np
def merge_columns(my_df):
l = []
for _, row in my_df.iterrows():
l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])
return empty_df.to_string(index=False)
if __name__ == '__main__':
my_df = pd.DataFrame({
'Apple': ['1', '4', '7'],
'Pear': ['2', '5', '8'],
'Cherry': ['3', np.nan, '9']})
print(merge_columns(my_df))
There are other things that I added to my answer as:
if __name__ == '__main__'
- added the logic into its own function so that you can reuse it later
As @MathiasEttinger suggested, you can also modify the above function to use list comprehension to get a slightly better performance:
def merge_columns_1(my_df):
l = [pd.Series(row).str.cat(sep='::') for _, row in my_df.iterrows()]
return pd.DataFrame(l, columns=['Result']).to_string(index=False)
I'll let the order of the columns as an exercise for OP.
-
\$\begingroup\$ I like this a lot (definitely looks cleaner, and this code could easily be scaled for additional columns), but I just timed my code and don't really see a significant difference to the original code. Since we're still looping through every row (before: using
apply
, now: usingiterrows
) this seems to make sense. Any thoughts? \$\endgroup\$Martin Reindl– Martin Reindl2017年02月06日 09:56:37 +00:00Commented Feb 6, 2017 at 9:56 -
\$\begingroup\$ I don't think you can get any better than this in terms of performance \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2017年02月06日 10:37:12 +00:00Commented Feb 6, 2017 at 10:37
-
\$\begingroup\$ Why don't you use a list-comprehension instead of
for
+append
? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2017年02月06日 10:49:34 +00:00Commented Feb 6, 2017 at 10:49 -
\$\begingroup\$ @MathiasEttinger good call. I added that too. Thanks :) \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2017年02月06日 10:56:55 +00:00Commented Feb 6, 2017 at 10:56
-
\$\begingroup\$ Guess I'll just leave it here then. Thanks for the help!! \$\endgroup\$Martin Reindl– Martin Reindl2017年02月06日 11:12:34 +00:00Commented Feb 6, 2017 at 11:12