3

I am trying to process a 3GB XML file, and am getting a memoryerror in the middle of a loop that reads the file and stores some data in a dictionary.

class Node(object):
 def __init__(self, osmid, latitude, longitude):
 self.osmid = int(osmid)
 self.latitude = float(latitude)
 self.longitude = float(longitude)
 self.count = 0
context = cElementTree.iterparse(raw_osm_file, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
 if event == "end" and elem.tag == "node":
 lat = float(elem.get('lat'))
 lon = float(elem.get('lon'))
 osm_id = int(elem.get('id'))
 nodes[osm_id] = Node(osm_id, lat, lon)
 root.clear()

I'm using an iterative parsing method so the issue isn't with reading the file. I just want to store the data in a dictionary for later processing, but it seems the dictionary is getting too large. Later in the program I read in links and need to check if the nodes referenced by the links were in the initial batch of nodes, which is why I am storing them in a dictionary.

How can I either greatly reduce memory footprint (the script isn't even getting close to finishing so shaving bits and pieces off won't help much) or greatly increase the amount of memory available to python? Monitoring the memory usage it looks like python is pooping out at about 1950 MB, and my computer still has about 6 GB available of RAM.

asked Apr 7, 2016 at 0:36
4
  • 2
    Are you running 64-bit Python? Commented Apr 7, 2016 at 0:44
  • Oh shoot, I thought I was but just checked and I am actually using 32. It is my understanding that there is a hard cap on memory usage with 32, but not with 64, right? Commented Apr 7, 2016 at 0:59
  • 1
    There is also a hard cap on 64-bit, but you're very unlikely to hit it, since it's in the multiples of terabytes. Commented Apr 7, 2016 at 1:08
  • 1
    @user2913671: There is a hard cap on 64 bit too. But it's at least 256x larger (so 512 GB instead of 2 GB of address space), and I think at this point it's usually 65536 times larger (so 128 TB of address space, which I'm pretty sure should be enough). :-) Commented Apr 7, 2016 at 1:10

1 Answer 1

3

Assuming you have tons of Nodes being created, you might consider using __slots__ to predefine a fixed set of attributes for each Node. This removes the overhead of storing a per-instance __dict__ (in exchange for preventing the creation of undeclared attributes) and can easily cut memory usage per Node by a factor of ~5x (less on Python 3.3+ where shared key __dict__ reduces the per-instance memory cost for free).

It's easy to do, just change the declaration of Node to:

class Node(object):
 __slots__ = 'osmid', 'latitude', 'longitude', 'count'
 def __init__(self, osmid, latitude, longitude):
 self.osmid = int(osmid)
 self.latitude = float(latitude)
 self.longitude = float(longitude)
 self.count = 0

For example, on Python 3.5 (where shared key dictionaries already save you something), the difference in object overhead can be seen with:

 >>> import sys
 >>> ... define Node without __slots___
 >>> n = Node(1,2,3)
 >>> sys.getsizeof(n) + sys.getsizeof(n.__dict__)
 248
 >>> ... define Node with __slots__
 >>> n = Node(1,2,3)
 >>> sys.getsizeof(n) # It has no __dict__ now
 72

And remember, this is Python 3.5 with shared key dictionaries; in Python 2, the per-instance cost with __slots__ would be similar (one pointer sized variable larger IIRC), while the cost without __slots__ would go up by a few hundred bytes.

Also, assuming you're on a 64 bit OS, make sure you've installed the 64 bit version of Python to match the 64 bit OS; otherwise, Python will be limited to ~2 GB of virtual address space, and your 6 GB of RAM counts for very little.

answered Apr 7, 2016 at 0:46
Sign up to request clarification or add additional context in comments.

1 Comment

This worked great! Cut the memory usage a ton & switching to 64 bit version gave me the extra GB that I still needed. Thanks!

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.