2

I'm fairly new to both Python and Pandas, and trying to figure out the fastest way to execute a mammoth left outer join between a left dataset with roughly 11 million rows and a right dataset with ~160K rows and four columns. It should be a many-to-one situation but I'd like the join to not kick out an error if there's a duplicate row on the right side. I'm using Canopy Express on a Windows 7 64-bit system with 8 Gb RAM, and I'm pretty much stuck with that.

Here's a model of the code I've put together so far:

import pandas as pd
leftcols = ['a','b','c','d','e','key']
leftdata = pd.read_csv("LEFT.csv", names=leftcols)
rightcols = ['x','y','z','key']
rightdata = pd.read_csv("RIGHT.csv", names=rightcols)
mergedata = pd.merge(leftdata, rightdata, on='key', how='left')
mergedata.to_csv("FINAL.csv")

This works with small files but produces a MemoryError on my system with file sizes two orders of magnitude smaller than the size of the files I actually need to merge.

I've been browsing through related questions (one, two, three) but none of the answers really get at this basic problem - or if they do, it's not explained well enough for me to recognize the potential solution. And the accepted answers are no help. I'm already on a 64 bit system and using the most current stable version of Canopy (1.5.5 64-bit, using Python 2.7.10).

What is the fastest and/or most pythonic approach to avoiding this MemoryError issue?

asked Sep 17, 2015 at 16:13
10
  • The most pythonic way is to get more memory. No seriously. Pandas always keeps your data in memory. If your files are too big, you simply cannot do it with pandas. Commented Sep 17, 2015 at 16:29
  • If a system with 8 GB RAM isn't enough to merge two files each 1/100th the size of the files I need, how much RAM would be enough? 1TB? How many supercomputers would I need to assemble? There must be a different way to proceed that uses less memory. If not pandas, how? Commented Sep 17, 2015 at 16:35
  • AFAIK, all join/merge operations can be done by sort-merges. So it's not necessary to keep things in memory. Commented Sep 17, 2015 at 16:38
  • 1
    You probably have repeated values in the merge 'keys' which results in a Cartesian product of such rows in the output Commented Sep 17, 2015 at 17:27
  • 1
    When you say "the accepted answers are no help," do you mean to include the pd.concat answer as well? Try setting index_col='key' in your call to read_csv and then concatenating (or joining, if that complains about duplicate indices)? Commented Sep 22, 2015 at 16:17

3 Answers 3

2

Why not just read your right file into pandas (or even into a simple dictionary), then loop through your left file using the csv module to read, extend, and write each row? Is processing time a significant constraint (vs your development time)?

answered Sep 17, 2015 at 17:30
Sign up to request clarification or add additional context in comments.

2 Comments

Processing time is not incredibly significant; I'll try it out. If I come up with a workable solution I'll accept the answer and post the code below. Thank you!
Thank you Jonathan, this approach worked - see below. I think I'll mark the comment below as the answer since it'll provide browsing users who have similar issues a model of the code to work with. But I'd give you a second upvote if I could. Thanks!
2

This approach ended up working. Here's a model of my code:

import csv
idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")
leftdata = csv.reader(idata)
rightdata = csv.reader(odata)
def gen_chunks(reader, chunksize=1000000):
 chunk = []
 for i, line in enumerate(reader):
 if (i % chunksize == 0 and i > 0):
 yield chunk
 del chunk[:]
 chunk.append(line)
 yield chunk
count = 0
d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0) 
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])
for chunk in gen_chunks(leftdata):
 res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], 
 d1.get(k[6], "NaN")] for k in chunk]
 res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], 
 d2.get(k[6], "NaN")] for k in res]
 res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
 d3.get(k[6], "NaN")] for k in res1]
 namestart = "FINAL_"
 nameend = ".csv"
 count = count+1
 filename = namestart + str(count) + nameend
 with open(filename, "wb") as csvfile:
 output = csv.writer(csvfile)
 output.writerows(res2)

By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.

Thanks also to user miku who provided the chunk generator code in a comment on this post.

That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.

answered Sep 23, 2015 at 12:50

Comments

0

As suggested in another question "Large data" work flows using pandas, dask (http://dask.pydata.org) could be an easy option.

Simple example

import dask.dataframe as dd
df1 = dd.read_csv('df1.csv')
df2 = dd.read_csv('df2.csv')
df_merge = dd.merge(df1, df2, how='left')
Petter Friberg
21.8k10 gold badges67 silver badges116 bronze badges
answered Jun 4, 2018 at 19:16

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.