Conditional Concatenation of a Pandas DataFrame

Question 1

I am concatenating columns of a Python Pandas Dataframe and want to improve the speed of my code.

My data has the following structure:

 Apple Pear Cherry
 1 2 3
 4 5 NaN
 7 8 9

I only want to concatenate the contents of the Cherry column if there is actually value in the respective row. If my code works correctly, the result of the example above should be:

 Result
 1 :: 2 :: 3
 4 :: 5
 7 :: 8 :: 9

My code so far is this:

a_dataframe[result] = a_dataframe.apply(lambda r:
 str(r.loc['apple']) + ' :: ' + str(r.loc['pear'])+' :: '+str(r.loc['cherry'])
 if pd.notnull(r.loc['cherry']) & (r.loc['cherry'] != "")
 # if cherry value is empty, do not add cherry into result
 else str(r.loc['apple']) + ' :: ' + str(r.loc['pear']),
 axis=1)

Any thoughts on how I can improve the speed of my code? Can I run this without an apply statement using only Pandas column operations?

Thanks up front for the help.

Question 2

Welcome to codereview. Does your code works exactly as you posted it ? :)

Question 3

Code works as i posted it. Regarding single quote: I changed variable names for simplicity when posting, so I probably lost it in the process :-).

Question 4

There's no need to create a lambda for this.

Let's suppose we have the following dataframe:

my_df = pd.DataFrame({
 'Apple': ['1', '4', '7'],
 'Pear': ['2', '5', '8'],
 'Cherry': ['3', np.nan, '9']})

Which is:

Apple Cherry Pear
 1 3 2
 4 NaN 5
 7 9 8

An easier way to achieve what you want without the apply() function is:

use iterrows() to parse each row one by one.
use Series() and str.cat() to do the merge.

You'll get this:

l = []
for _, row in my_df.iterrows():
 l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])

Doing this, NaN will automatically be taken out, and will lead us to the desired result:

Result
1::3::2
 4::5
7::9::8

The entire program may look like:

import pandas as pd
import numpy as np
def merge_columns(my_df):
 l = []
 for _, row in my_df.iterrows():
 l.append(pd.Series(row).str.cat(sep='::'))
 empty_df = pd.DataFrame(l, columns=['Result'])
 return empty_df.to_string(index=False)
if __name__ == '__main__':
 my_df = pd.DataFrame({
 'Apple': ['1', '4', '7'],
 'Pear': ['2', '5', '8'],
 'Cherry': ['3', np.nan, '9']})
 print(merge_columns(my_df))

There are other things that I added to my answer as:

if __name__ == '__main__'
added the logic into its own function so that you can reuse it later

As @MathiasEttinger suggested, you can also modify the above function to use list comprehension to get a slightly better performance:

def merge_columns_1(my_df):
 l = [pd.Series(row).str.cat(sep='::') for _, row in my_df.iterrows()]
 return pd.DataFrame(l, columns=['Result']).to_string(index=False)

I'll let the order of the columns as an exercise for OP.

Question 5

I like this a lot (definitely looks cleaner, and this code could easily be scaled for additional columns), but I just timed my code and don't really see a significant difference to the original code. Since we're still looping through every row (before: using apply, now: using iterrows) this seems to make sense. Any thoughts?

Question 6

I don't think you can get any better than this in terms of performance

Question 7

Why don't you use a list-comprehension instead of for + append ?

Question 8

@MathiasEttinger good call. I added that too. Thanks :)

Question 9

Guess I'll just leave it here then. Thanks for the help!!

score 5 · Accepted Answer · 2017-02-06 08:45:06Z

There's no need to create a lambda for this.

Let's suppose we have the following dataframe:

my_df = pd.DataFrame({
 'Apple': ['1', '4', '7'],
 'Pear': ['2', '5', '8'],
 'Cherry': ['3', np.nan, '9']})

Which is:

Apple Cherry Pear
 1 3 2
 4 NaN 5
 7 9 8

An easier way to achieve what you want without the apply() function is:

use iterrows() to parse each row one by one.
use Series() and str.cat() to do the merge.

You'll get this:

l = []
for _, row in my_df.iterrows():
 l.append(pd.Series(row).str.cat(sep='::'))
empty_df = pd.DataFrame(l, columns=['Result'])

Doing this, NaN will automatically be taken out, and will lead us to the desired result:

Result
1::3::2
 4::5
7::9::8

The entire program may look like:

import pandas as pd
import numpy as np
def merge_columns(my_df):
 l = []
 for _, row in my_df.iterrows():
 l.append(pd.Series(row).str.cat(sep='::'))
 empty_df = pd.DataFrame(l, columns=['Result'])
 return empty_df.to_string(index=False)
if __name__ == '__main__':
 my_df = pd.DataFrame({
 'Apple': ['1', '4', '7'],
 'Pear': ['2', '5', '8'],
 'Cherry': ['3', np.nan, '9']})
 print(merge_columns(my_df))

There are other things that I added to my answer as:

if __name__ == '__main__'
added the logic into its own function so that you can reuse it later

As @MathiasEttinger suggested, you can also modify the above function to use list comprehension to get a slightly better performance:

def merge_columns_1(my_df):
 l = [pd.Series(row).str.cat(sep='::') for _, row in my_df.iterrows()]
 return pd.DataFrame(l, columns=['Result']).to_string(index=False)

I'll let the order of the columns as an exercise for OP.

I like this a lot (definitely looks cleaner, and this code could easily be scaled for additional columns), but I just timed my code and don't really see a significant difference to the original code. Since we're still looping through every row (before: using apply, now: using iterrows) this seems to make sense. Any thoughts?
I don't think you can get any better than this in terms of performance
Why don't you use a list-comprehension instead of for + append ?

Stack Exchange Network

Conditional Concatenation of a Pandas DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Conditional Concatenation of a Pandas DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions