MemoryError with python/pandas and large left outer joins

Question 1

I'm fairly new to both Python and Pandas, and trying to figure out the fastest way to execute a mammoth left outer join between a left dataset with roughly 11 million rows and a right dataset with ~160K rows and four columns. It should be a many-to-one situation but I'd like the join to not kick out an error if there's a duplicate row on the right side. I'm using Canopy Express on a Windows 7 64-bit system with 8 Gb RAM, and I'm pretty much stuck with that.

Here's a model of the code I've put together so far:

import pandas as pd
leftcols = ['a','b','c','d','e','key']
leftdata = pd.read_csv("LEFT.csv", names=leftcols)
rightcols = ['x','y','z','key']
rightdata = pd.read_csv("RIGHT.csv", names=rightcols)
mergedata = pd.merge(leftdata, rightdata, on='key', how='left')
mergedata.to_csv("FINAL.csv")

This works with small files but produces a MemoryError on my system with file sizes two orders of magnitude smaller than the size of the files I actually need to merge.

I've been browsing through related questions (one, two, three) but none of the answers really get at this basic problem - or if they do, it's not explained well enough for me to recognize the potential solution. And the accepted answers are no help. I'm already on a 64 bit system and using the most current stable version of Canopy (1.5.5 64-bit, using Python 2.7.10).

What is the fastest and/or most pythonic approach to avoiding this MemoryError issue?

Question 2

The most pythonic way is to get more memory. No seriously. Pandas always keeps your data in memory. If your files are too big, you simply cannot do it with pandas.

Question 3

If a system with 8 GB RAM isn't enough to merge two files each 1/100th the size of the files I need, how much RAM would be enough? 1TB? How many supercomputers would I need to assemble? There must be a different way to proceed that uses less memory. If not pandas, how?

Question 4

AFAIK, all join/merge operations can be done by sort-merges. So it's not necessary to keep things in memory.

Question 5

You probably have repeated values in the merge 'keys' which results in a Cartesian product of such rows in the output

Question 6

When you say "the accepted answers are no help," do you mean to include the pd.concat answer as well? Try setting index_col='key' in your call to read_csv and then concatenating (or joining, if that complains about duplicate indices)?

Question 7

Why not just read your right file into pandas (or even into a simple dictionary), then loop through your left file using the csv module to read, extend, and write each row? Is processing time a significant constraint (vs your development time)?

Question 8

Processing time is not incredibly significant; I'll try it out. If I come up with a workable solution I'll accept the answer and post the code below. Thank you!

Question 9

Thank you Jonathan, this approach worked - see below. I think I'll mark the comment below as the answer since it'll provide browsing users who have similar issues a model of the code to work with. But I'd give you a second upvote if I could. Thanks!

Question 10

This approach ended up working. Here's a model of my code:

import csv
idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")
leftdata = csv.reader(idata)
rightdata = csv.reader(odata)
def gen_chunks(reader, chunksize=1000000):
 chunk = []
 for i, line in enumerate(reader):
 if (i % chunksize == 0 and i > 0):
 yield chunk
 del chunk[:]
 chunk.append(line)
 yield chunk
count = 0
d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0) 
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])
for chunk in gen_chunks(leftdata):
 res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], 
 d1.get(k[6], "NaN")] for k in chunk]
 res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], 
 d2.get(k[6], "NaN")] for k in res]
 res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
 d3.get(k[6], "NaN")] for k in res1]
 namestart = "FINAL_"
 nameend = ".csv"
 count = count+1
 filename = namestart + str(count) + nameend
 with open(filename, "wb") as csvfile:
 output = csv.writer(csvfile)
 output.writerows(res2)

By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.

Thanks also to user miku who provided the chunk generator code in a comment on this post.

That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.

Question 11

As suggested in another question "Large data" work flows using pandas, dask (http://dask.pydata.org) could be an easy option.

Simple example

import dask.dataframe as dd
df1 = dd.read_csv('df1.csv')
df2 = dd.read_csv('df2.csv')
df_merge = dd.merge(df1, df2, how='left')

James T 1493 silver badges13 bronze badges · Accepted Answer · 2015-09-23 12:50:19Z

This approach ended up working. Here's a model of my code:

import csv
idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")
leftdata = csv.reader(idata)
rightdata = csv.reader(odata)
def gen_chunks(reader, chunksize=1000000):
 chunk = []
 for i, line in enumerate(reader):
 if (i % chunksize == 0 and i > 0):
 yield chunk
 del chunk[:]
 chunk.append(line)
 yield chunk
count = 0
d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0) 
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])
for chunk in gen_chunks(leftdata):
 res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], 
 d1.get(k[6], "NaN")] for k in chunk]
 res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], 
 d2.get(k[6], "NaN")] for k in res]
 res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
 d3.get(k[6], "NaN")] for k in res1]
 namestart = "FINAL_"
 nameend = ".csv"
 count = count+1
 filename = namestart + str(count) + nameend
 with open(filename, "wb") as csvfile:
 output = csv.writer(csvfile)
 output.writerows(res2)

By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.

Thanks also to user miku who provided the chunk generator code in a comment on this post.

That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.

CollectivesTM on Stack Overflow

MemoryError with python/pandas and large left outer joins

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related