2

I have those packages installed:

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1

This is the dataframe info:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010年02月05日 00:00:00 to 2012年10月26日 00:00:00
Data columns (total 5 columns):
Store 421570 non-null int64
Dept 421570 non-null int64
Weekly_Sales 421570 non-null float64
IsHoliday 421570 non-null bool
Date_Str 421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None

this is a sample how data look like:

Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010年02月05日,24924.5,FALSE
1,1,2010年02月12日,46039.49,TRUE
1,1,2010年02月19日,41595.55,FALSE
1,1,2010年02月26日,19403.54,FALSE
1,1,2010年03月05日,21827.9,FALSE
1,1,2010年03月12日,21043.39,FALSE
1,1,2010年03月19日,22136.64,FALSE
1,1,2010年03月26日,26229.21,FALSE
1,1,2010年04月02日,57258.43,FALSE

I load the file and index it as follows:

df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])

when I the following operation with a 400K rows file,

df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)

or

df_train['try'] = df_train['Store'] * df_train['Dept']

it causes an error:

Traceback (most recent call last):
 File "rock.py", line 85, in <module>
 rock.pandasTest()
 File "rock.py", line 31, in pandasTest
 df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
 return_indexers=True)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
 return_indexers=return_indexers)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
 return_indexers=return_indexers)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
 how=how, sort=True)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
 return join_func(left_group_key, right_group_key, max_groups)
 File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

However, it works fine with a small file.

asked May 30, 2014 at 13:59
2
  • What is the question? Commented May 30, 2014 at 14:01
  • Also how do you load the data? Add the code etc. Commented May 30, 2014 at 14:03

2 Answers 2

2

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs>1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'
In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop
In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop
In [26]: %load_ext memory_profiler
In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB
In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB
answered May 31, 2014 at 11:19
Sign up to request clarification or add additional context in comments.

2 Comments

BTW, I found an alternative way that has comparable performance with astype(str) in terms of memory usage and speed: df_train['Store'].map(str)+'' + df_train['Dept'].map(str) + '' + df_train['Date_Str'].map(str)
thanks for mentioning %memit, didn't know about it before.
1

Try generating the _id field with DataFrame.apply call:

def apply_id(x):
 x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
 return x
df_train = df_train.apply(apply_id, 1)

When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.

answered May 30, 2014 at 14:16

10 Comments

yeah, this way works, but in this thread, stackoverflow.com/questions/23950658/…, it is said the vectorized function is faster than using apply call, and from my experiments it seems true. The vectorized functions tend to use more memory than apply call, but the confusion is that I still have lots of memory left when the memory error occurs
I'm guessing that vectorized functions need to keep the whole vectors in memory while performing the operation and in your case that's way too much memory required. Also I think you can get MemoryError even before you actually run ouf of memory. Python is probably trying to allocate huge chunk of memory and it fails -> doesn't show any increase in memory consumption as it fails instantly.
Indeed, but the strange thing is this should not happen at all for a dataframe of this size (and I also can't reproduce it)
Well note that you are not only appending three values but also converting them to strings. Vectorized string values -> BOOM
Sorry, I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13. So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14.
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.