Python Pandas operate on row

Question 1

Hi my dataframe look like:

Store,Dept,Date,Sales
1,1,2010年02月05日,245
1,1,2010年02月12日,449
1,1,2010年02月19日,455
1,1,2010年02月26日,154
1,1,2010年03月05日,29
1,1,2010年03月12日,239
1,1,2010年03月19日,264

Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010年02月05日", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.

Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.

Question 2

You can first convert it to strings (the integer columns) before concatenating with +:

In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]: 
 Store Dept Date Sales id
0 1 1 2010年02月05日 245 1_1_2010年02月05日
1 1 1 2010年02月12日 449 1_1_2010年02月12日
2 1 1 2010年02月19日 455 1_1_2010年02月19日
3 1 1 2010年02月26日 154 1_1_2010年02月26日
4 1 1 2010年03月05日 29 1_1_2010年03月05日
5 1 1 2010年03月12日 239 1_1_2010年03月12日
6 1 1 2010年03月19日 264 1_1_2010年03月19日

For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):

In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]: 
 Store Dept Date Sales id logSales
0 1 1 2010年02月05日 245 1_1_2010年02月05日 5.501258
1 1 1 2010年02月12日 449 1_1_2010年02月12日 6.107023
2 1 1 2010年02月19日 455 1_1_2010年02月19日 6.120297
3 1 1 2010年02月26日 154 1_1_2010年02月26日 5.036953
4 1 1 2010年03月05日 29 1_1_2010年03月05日 3.367296
5 1 1 2010年03月12日 239 1_1_2010年03月12日 5.476464
6 1 1 2010年03月19日 264 1_1_2010年03月19日 5.575949

Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.

Question 3

I get 164us using math vs 151us using numpy log, I'm assuming that for a large dataframe numpy's one will eat Math's log for breakfast?

Question 4

Indeed, I get 201us (np) vs 208us (math), so almost the same for this dataframe, but for a larger one (this one 100 times repeated), numpy is clearly faster than using apply.

Question 5

For a dataframe with 7000 rows math.log takes 2.17ms versus np.log time of 240us so a significant performance improvement

Question 6

Also for the concatenation, for this dataframe, using apply is not slower (even a bit faster 500 vs 700 us), but for larger dataframes (7000 rows) it is again clearly slower (200 vs 80 ms).

Question 7

yes I would expect this too, good to know that the vectorised operations scale well, I still have more to learn about pandas and numpy ;)

Question 8

In [153]:
import pandas as pd
import io
temp = """Store,Dept,Date,Sales
1,1,2010年02月05日,245
1,1,2010年02月12日,449
1,1,2010年02月19日,455
1,1,2010年02月26日,154
1,1,2010年03月05日,29
1,1,2010年03月12日,239
1,1,2010年03月19日,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
 Store Dept Date Sales
0 1 1 2010年02月05日 245
1 1 1 2010年02月12日 449
2 1 1 2010年02月19日 455
3 1 1 2010年02月26日 154
4 1 1 2010年03月05日 29
5 1 1 2010年03月12日 239
6 1 1 2010年03月19日 264
[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x: str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
 Store Dept Date Sales id
0 1 1 2010年02月05日 245 1 1_2010年02月05日
1 1 1 2010年02月12日 449 1 1_2010年02月12日
2 1 1 2010年02月19日 455 1 1_2010年02月19日
3 1 1 2010年02月26日 154 1 1_2010年02月26日
4 1 1 2010年03月05日 29 1 1_2010年03月05日
5 1 1 2010年03月12日 239 1 1_2010年03月12日
6 1 1 2010年03月19日 264 1 1_2010年03月19日
[7 rows x 5 columns]
In [155]:
import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
 Store Dept Date Sales id logSales
0 1 1 2010年02月05日 245 1 1_2010年02月05日 5.501258
1 1 1 2010年02月12日 449 1 1_2010年02月12日 6.107023
2 1 1 2010年02月19日 455 1 1_2010年02月19日 6.120297
3 1 1 2010年02月26日 154 1 1_2010年02月26日 5.036953
4 1 1 2010年03月05日 29 1 1_2010年03月05日 3.367296
5 1 1 2010年03月12日 239 1 1_2010年03月12日 5.476464
6 1 1 2010年03月19日 264 1 1_2010年03月19日 5.575949
[7 rows x 6 columns]

joris 140k37 gold badges258 silver badges207 bronze badges · Accepted Answer · 2014-05-30 09:05:30Z

You can first convert it to strings (the integer columns) before concatenating with +:

In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]: 
 Store Dept Date Sales id
0 1 1 2010年02月05日 245 1_1_2010年02月05日
1 1 1 2010年02月12日 449 1_1_2010年02月12日
2 1 1 2010年02月19日 455 1_1_2010年02月19日
3 1 1 2010年02月26日 154 1_1_2010年02月26日
4 1 1 2010年03月05日 29 1_1_2010年03月05日
5 1 1 2010年03月12日 239 1_1_2010年03月12日
6 1 1 2010年03月19日 264 1_1_2010年03月19日

For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):

In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]: 
 Store Dept Date Sales id logSales
0 1 1 2010年02月05日 245 1_1_2010年02月05日 5.501258
1 1 1 2010年02月12日 449 1_1_2010年02月12日 6.107023
2 1 1 2010年02月19日 455 1_1_2010年02月19日 6.120297
3 1 1 2010年02月26日 154 1_1_2010年02月26日 5.036953
4 1 1 2010年03月05日 29 1_1_2010年03月05日 3.367296
5 1 1 2010年03月12日 239 1_1_2010年03月12日 5.476464
6 1 1 2010年03月19日 264 1_1_2010年03月19日 5.575949

Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.

I get 164us using math vs 151us using numpy log, I'm assuming that for a large dataframe numpy's one will eat Math's log for breakfast?
Indeed, I get 201us (np) vs 208us (math), so almost the same for this dataframe, but for a larger one (this one 100 times repeated), numpy is clearly faster than using apply.
For a dataframe with 7000 rows math.log takes 2.17ms versus np.log time of 240us so a significant performance improvement
Also for the concatenation, for this dataframe, using apply is not slower (even a bit faster 500 vs 700 us), but for larger dataframes (7000 rows) it is again clearly slower (200 vs 80 ms).
yes I would expect this too, good to know that the vectorised operations scale well, I still have more to learn about pandas and numpy ;)

CollectivesTM on Stack Overflow

Python Pandas operate on row

2 Answers 2

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related