Python Pandas MemoryError

Question 1

I have those packages installed:

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1

This is the dataframe info:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010年02月05日 00:00:00 to 2012年10月26日 00:00:00
Data columns (total 5 columns):
Store 421570 non-null int64
Dept 421570 non-null int64
Weekly_Sales 421570 non-null float64
IsHoliday 421570 non-null bool
Date_Str 421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None

this is a sample how data look like:

Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010年02月05日,24924.5,FALSE
1,1,2010年02月12日,46039.49,TRUE
1,1,2010年02月19日,41595.55,FALSE
1,1,2010年02月26日,19403.54,FALSE
1,1,2010年03月05日,21827.9,FALSE
1,1,2010年03月12日,21043.39,FALSE
1,1,2010年03月19日,22136.64,FALSE
1,1,2010年03月26日,26229.21,FALSE
1,1,2010年04月02日,57258.43,FALSE

I load the file and index it as follows:

df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])

when I the following operation with a 400K rows file,

df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)

or

df_train['try'] = df_train['Store'] * df_train['Dept']

it causes an error:

Traceback (most recent call last):
 File "rock.py", line 85, in <module>
 rock.pandasTest()
 File "rock.py", line 31, in pandasTest
 df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
 return_indexers=True)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
 return_indexers=return_indexers)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
 return_indexers=return_indexers)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
 how=how, sort=True)
 File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
 return join_func(left_group_key, right_group_key, max_groups)
 File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

However, it works fine with a small file.

Question 2

What is the question?

Question 3

Also how do you load the data? Add the code etc.

Question 4

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs>1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'
In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop
In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop
In [26]: %load_ext memory_profiler
In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB
In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB

Question 5

BTW, I found an alternative way that has comparable performance with astype(str) in terms of memory usage and speed: df_train['Store'].map(str)+'' + df_train['Dept'].map(str) + '' + df_train['Date_Str'].map(str)

Question 6

thanks for mentioning %memit, didn't know about it before.

Question 7

Try generating the _id field with DataFrame.apply call:

def apply_id(x):
 x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
 return x
df_train = df_train.apply(apply_id, 1)

When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.

Question 8

yeah, this way works, but in this thread, stackoverflow.com/questions/23950658/…, it is said the vectorized function is faster than using apply call, and from my experiments it seems true. The vectorized functions tend to use more memory than apply call, but the confusion is that I still have lots of memory left when the memory error occurs

Question 9

I'm guessing that vectorized functions need to keep the whole vectors in memory while performing the operation and in your case that's way too much memory required. Also I think you can get MemoryError even before you actually run ouf of memory. Python is probably trying to allocate huge chunk of memory and it fails -> doesn't show any increase in memory consumption as it fails instantly.

Question 10

Indeed, but the strange thing is this should not happen at all for a dataframe of this size (and I also can't reproduce it)

Question 11

Well note that you are not only appending three values but also converting them to strings. Vectorized string values -> BOOM

Question 12

Sorry, I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13. So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14.

joris 140k37 gold badges258 silver badges207 bronze badges · Accepted Answer · 2014-05-31 11:19:57Z

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs>1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'
In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop
In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop
In [26]: %load_ext memory_profiler
In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB
In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB

BTW, I found an alternative way that has comparable performance with astype(str) in terms of memory usage and speed: df_train['Store'].map(str)+'' + df_train['Dept'].map(str) + '' + df_train['Date_Str'].map(str)
thanks for mentioning %memit, didn't know about it before.

CollectivesTM on Stack Overflow

Python Pandas MemoryError

2 Answers 2

2 Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

2 Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related