Python, numpy, optimize module to handle big file

Question 1

With the help of some people here and there, I wrote the module below which works very well if the file targets.csv is not too big.

As soon as this file exceeds 100 MB, a MemoryError is returned.

The problem is that I used numpy to make the computations faster (and they became really fast compared to the previous versions of this module), and I think numpy requires to load the entire file into memory.

However, there should be a way to handle the MemoryError that happens there:

targets = np.array([(float(X), float(Y), float(Z))
 for X, Y, Z in csv.reader(open('targets.csv'))])

Any ideas?

Let me know if you need more details (files, etc.)

import csv
import numpy as np
import scipy.spatial
import cv2
import random
"""loading files"""
points = np.array([(int(R), int(G), int(B), float(X), float(Y), float(Z))
 for R, G, B, X, Y, Z in csv.reader(open('colorlist.csv'))])
 # load X,Y,Z coordinates of 'points' in a np.array 
print "colorlist loaded"
targets = np.array([(float(X), float(Y), float(Z))
 for X, Y, Z in csv.reader(open('targets.csv'))])
 # load the XYZ target values in a np.array
print "targets loaded"
img = cv2.imread("MAP.tif", -1)
height, width = img.shape
total = height * width
# load dimensions of tif image
print "MAP loaded"
ppm = file("walladrien.ppm", 'w')
ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
# write PPM file header
"""doing geometry"""
tri = scipy.spatial.Delaunay(points[:,[3,4,5]], furthest_site=False) # True makes an almost BW picture
# Delaunay triangulation
indices = tri.simplices
# indices of vertices
vertices = points[indices]
# the vertices for each tetrahedron
tet = tri.find_simplex(targets)
U = tri.transform[tet,:3]
V = targets - tri.transform[tet,3] 
b = np.einsum('ijk,ik->ij', U, V)
bcoords = np.c_[b, 1 - b.sum(axis=1)]
# find the barycentric coordinates of each point
"""looping through data"""
i = 0
for i in range(total):
 if tet[i] == -1:
 # this means that the point lies outside the convex hull
 R = G = B = 255 
 ppm.write(str(R) + ' ' + str(G) + ' ' + str(B) + "\n") # writes a pixel 
 else: 
 R0 = int(vertices[tet[i]][0][0])
 G0 = int(vertices[tet[i]][0][1])
 B0 = int(vertices[tet[i]][0][2]) 
 R1 = int(vertices[tet[i]][1][0])
 G1 = int(vertices[tet[i]][1][1])
 B1 = int(vertices[tet[i]][1][2])
 R2 = int(vertices[tet[i]][2][0])
 G2 = int(vertices[tet[i]][2][1])
 B2 = int(vertices[tet[i]][2][2])
 R3 = int(vertices[tet[i]][3][0])
 G3 = int(vertices[tet[i]][3][1])
 B3 = int(vertices[tet[i]][3][2])
 rand = random.uniform(0,1)
 BC_0 = bcoords[i][0]
 BC_1 = bcoords[i][1]
 BC_2 = bcoords[i][2]
 BC_3 = bcoords[i][3]
 i += 1 
 if rand <= BC_0:
 ppm.write(str(R0) + ' ' + str(G0) + ' ' + str(B0) + "\n") # writes a pixel 
 elif rand <= BC_0 + BC_1:
 ppm.write(str(R1) + ' ' + str(G1) + ' ' + str(B1) + "\n") # writes a pixel 
 elif rand <= BC_0 + BC_1 + BC_2:
 ppm.write(str(R2) + ' ' + str(G2) + ' ' + str(B2) + "\n") # writes a pixel 
 else:
 ppm.write(str(R3) + ' ' + str(G3) + ' ' + str(B3) + "\n") # writes a pixel 
ppm.close()
print 'done'

Question 2

What does "100 Mo" mean?

Question 3

100 Mb, sorry, it's a french habit

Question 4

Aha, mégaoctet!

Question 5

yes exactly :) haha

Question 6

1. The MemoryError

I can't reproduce the MemoryError. Here's the test I ran:

def make_test(n=10**7, output_file='targets.csv'):
 with open(output_file, 'w') as f:
 for _ in range(n):
 f.write('56.08401,55.19490,25.49292\n')
>>> make_test()
>>> import os
>>> os.stat('targets.csv').st_size
270000000

That's 270 megabytes of data, and it loads fine for me:

>>> import csv, numpy as np
>>> targets = np.array([(float(X), float(Y), float(Z)) for X, Y, Z in csv.reader(open('targets.csv'))])
>>> targets.shape
(10000000, 3)

So I don't really know what is going wrong for you. Can you provide the exact text of the error message and its traceback? There might be more information there.

Regardless of the cause of the MemoryError, Janne is right to suggest that using numpy.loadtxt to get the data directly from the file (with no intermediate Python list) would be better:

targets = np.loadtxt('targets.csv', delimiter=',')

However, numpy.loadtxt is still quite slow, and if this is a bottleneck for you then you should investigate pandas:

>>> from timeit import timeit
>>> timeit(lambda:np.loadtxt('targets.csv', delimiter=','), number=1)
18.912313379812986
>>> import pandas
>>> timeit(lambda:pandas.read_csv('targets.csv', delimiter=',').values, number=1)
0.5977518488653004

2. Other comments on your code

The PPM format specification says that width is given before height, but you have:
```
ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
```
Is that right? Do you really intend to transpose your image?

It's conventional to place a comment before the code that it describes. So instead of:

ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
# write PPM file header

most Python programmers would write it like this:

# Write PPM file header: see <http://netpbm.sourceforge.net/doc/ppm.html>
ppm.write("P3\n{} {}\n255\n".format(width, height))

You haven't managed to vectorize all your code. Everything up to the computation of bcoords is vectorized and will run fast, but then you have a loop over the data where you pick a random vertex corresponding to each target. This loop runs in slow Python code and so is going to be the bottleneck in your application.

See §3 below for how this part of the code can be vectorized, using numpy.cumsum to find the cumulative sums of the barycentric coordinates, numpy.random.uniform to generate random samples for each target, and numpy.argmax to pick the vertex.
Similarly, writing the output colours to the PPM would be faster if you used numpy.savetxt to write them all at once (instead of one at a time as at present).
You compute all four barycentric coordinates for each target using the recipe from scipy.spatial.Delaunay:
```
bcoords = np.c_[b, 1 - b.sum(axis=1)]
```
but in fact you don't need the fourth coordinate since you are only going to use the cumulative sums of these coordinates, and the fourth cumulative sum is always 1.
The format of colorlist.csv is quite inconvenient, with a mixture of two types of data (colours and vertices) using a mixture of spaces and commas as delimiters:
```
255 63 127,35.5344302104,21.380721966,20.3661095969
```
Where does this file come from? If it's under your control, could you generate it in a more convenient format, for example as one file containing colours and another containing vertices? Or if you have to generate it as a single file, then could you consistently use either spaces or commas as delimiters?
Where do the targets come from? Your code implies that you have exactly one target for each pixel of the image MAP.tif, so is it the case that the targets are generated by processing this image in some way? If so, shouldn't you just generate the targets directly from MAP.tif instead of via the intermediate targets.csv?

3. Revised code

Here's the dithering algorithm, fully vectorized:

import numpy as np
import scipy.spatial
WHITE = np.array([[255, 255, 255]], dtype=np.uint8)
def dither_colors(vertices, colors, targets, default_color=WHITE):
 """Randomly dither targets onto vertices and return array of
 corresponding colors.
 Input:
 vertices: Points in 3-space. shape=(N,3), dtype=float
 color: RGB colors for each point. shape=(N,3), dtype=uint8
 targets: Points in 3-space. shape=(M,3), dtype=float
 default_color: RGB color for targets not found in convex hull of
 vertices. shape=(1,3), dtype=uint8
 Output:
 Array of colors. shape=(M,3), dtype=uint8
 Works by triangulating vertices, getting a complex of tetrahedra.
 For each target, randomly pick one of the vertices of the
 tetrahedron it is found in (weighted according to how close the
 target is to that vertex), and assign the color of that vertex to
 the target. Targets that are not found in any tetrahedron get
 assigned default_color.
 """
 N = len(vertices)
 assert vertices.shape == (N, 3)
 assert colors.shape == (N, 3)
 M = len(targets)
 assert targets.shape == (M, 3)
 assert default_color.shape == (1, 3)
 # Compute Delaunay triangulation of vertices.
 triangulation = scipy.spatial.Delaunay(vertices, furthest_site=False)
 # Find the tetrahedron containing each target (or -1 if not found)
 tetrahedra = triangulation.find_simplex(targets)
 # Affine transformation for tetrahedron containing each target
 U = triangulation.transform[tetrahedra, :3]
 # Offset of each target from the origin of its containing tetrahedron
 V = targets - triangulation.transform[tetrahedra, 3]
 # First three barycentric coordinates of each target in its tetrahedron.
 # The fourth coordinate would be 1 - b.sum(axis=1), but we don't need it.
 B = np.einsum('ijk,ik->ij', U, V)
 # Cumulative sum of barycentric coordinates of each target.
 Bsum = np.c_[B.cumsum(axis=1), np.ones(M)]
 # A uniform random number in [0, 1] for each target.
 R = np.random.uniform(0, 1, size=(M, 1))
 # Randomly choose one of the tetrahedron vertices for each target,
 # weighted according to its barycentric coordinates
 W = np.argmax(R <= Bsum, axis=1)
 # Get the color corresponding to each vertex just chosen.
 C = colors[triangulation.simplices[tetrahedra, W]]
 # Mask out the targets where we failed to find a tetrahedron.
 C[tetrahedra == -1] = default_color
 assert C.shape == (M, 3)
 return C

And here's a function that handles the input and output:

def dither_image(width, height,
 vertices_file='colorlist.csv',
 targets_file='targets.csv',
 output_file='output.ppm'):
 """Read colors and vertices from vertices_file. Read targets from
 targets_file. Dither each target to a randomly chosen nearby
 vertex; output the colors for the dithered vertices to output_file
 as a PPM with the given width and height.
 """
 vertices = np.loadtxt(vertices_file, usecols=(1, 2, 3), delimiter=',')
 colors = np.loadtxt(vertices_file, usecols=(0,), delimiter=',', 
 converters={0:lambda s:s.split()}, dtype=np.uint8)
 targets = np.loadtxt(targets_file, delimiter=',')
 C = dither_colors(vertices, colors, targets)
 # Write output as image in PPM format.
 # See <http://netpbm.sourceforge.net/doc/ppm.html>
 with open(output_file, 'wb') as ppm:
 ppm.write("P3\n{} {}\n255\n".format(width, height).encode('ascii'))
 np.savetxt(ppm, C, fmt='%d')

4. Emacs bug

In the course of preparing this answer, I discovered a bug in the way Emacs handles PPM images on OS X, which I reported as bug #16683.

Question 7

Amazing answer, thanks! I learned a lot by reading it. Now I get what is meaned by vectorizing. The file colorlist.csv contains 2 types of color coordinates: "machine code" RGB and real colorimetric coordinates computed from spectrophotometric measurements and the CIE 1931 system: XYZ. But you're right, I'll restructure this fine R, G, B, X, Y, Z instead of R G B, X, Y, Z And no, the targets are not coming from the .tiff file. Thanks for showing how amazingly speedy pandas is! All my best.

Question 8

Here you read your data into a Python list that you pass to array:

targets = np.array([(float(X), float(Y), float(Z))
 for X, Y, Z in csv.reader(open('targets.csv'))])

A Python list of floats uses much more memory than a NumPy array, because each float is a separate Python object. You can avoid that overhead by reading directly into an array, eg. using np.loadtxt:

targets = np.loadtxt('targets.csv', delimiter=',')

Gareth Rees 50.1k3 gold badges130 silver badges211 bronze badges · Accepted Answer · 2014-02-10 12:45:22Z

1. The MemoryError

I can't reproduce the MemoryError. Here's the test I ran:

def make_test(n=10**7, output_file='targets.csv'):
 with open(output_file, 'w') as f:
 for _ in range(n):
 f.write('56.08401,55.19490,25.49292\n')
>>> make_test()
>>> import os
>>> os.stat('targets.csv').st_size
270000000

That's 270 megabytes of data, and it loads fine for me:

>>> import csv, numpy as np
>>> targets = np.array([(float(X), float(Y), float(Z)) for X, Y, Z in csv.reader(open('targets.csv'))])
>>> targets.shape
(10000000, 3)

So I don't really know what is going wrong for you. Can you provide the exact text of the error message and its traceback? There might be more information there.

Regardless of the cause of the MemoryError, Janne is right to suggest that using numpy.loadtxt to get the data directly from the file (with no intermediate Python list) would be better:

targets = np.loadtxt('targets.csv', delimiter=',')

However, numpy.loadtxt is still quite slow, and if this is a bottleneck for you then you should investigate pandas:

>>> from timeit import timeit
>>> timeit(lambda:np.loadtxt('targets.csv', delimiter=','), number=1)
18.912313379812986
>>> import pandas
>>> timeit(lambda:pandas.read_csv('targets.csv', delimiter=',').values, number=1)
0.5977518488653004

2. Other comments on your code

The PPM format specification says that width is given before height, but you have:
```
ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
```
Is that right? Do you really intend to transpose your image?

It's conventional to place a comment before the code that it describes. So instead of:

ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
# write PPM file header

most Python programmers would write it like this:

# Write PPM file header: see <http://netpbm.sourceforge.net/doc/ppm.html>
ppm.write("P3\n{} {}\n255\n".format(width, height))

You haven't managed to vectorize all your code. Everything up to the computation of bcoords is vectorized and will run fast, but then you have a loop over the data where you pick a random vertex corresponding to each target. This loop runs in slow Python code and so is going to be the bottleneck in your application.

See §3 below for how this part of the code can be vectorized, using numpy.cumsum to find the cumulative sums of the barycentric coordinates, numpy.random.uniform to generate random samples for each target, and numpy.argmax to pick the vertex.
Similarly, writing the output colours to the PPM would be faster if you used numpy.savetxt to write them all at once (instead of one at a time as at present).
You compute all four barycentric coordinates for each target using the recipe from scipy.spatial.Delaunay:
```
bcoords = np.c_[b, 1 - b.sum(axis=1)]
```
but in fact you don't need the fourth coordinate since you are only going to use the cumulative sums of these coordinates, and the fourth cumulative sum is always 1.
The format of colorlist.csv is quite inconvenient, with a mixture of two types of data (colours and vertices) using a mixture of spaces and commas as delimiters:
```
255 63 127,35.5344302104,21.380721966,20.3661095969
```
Where does this file come from? If it's under your control, could you generate it in a more convenient format, for example as one file containing colours and another containing vertices? Or if you have to generate it as a single file, then could you consistently use either spaces or commas as delimiters?
Where do the targets come from? Your code implies that you have exactly one target for each pixel of the image MAP.tif, so is it the case that the targets are generated by processing this image in some way? If so, shouldn't you just generate the targets directly from MAP.tif instead of via the intermediate targets.csv?

3. Revised code

Here's the dithering algorithm, fully vectorized:

import numpy as np
import scipy.spatial
WHITE = np.array([[255, 255, 255]], dtype=np.uint8)
def dither_colors(vertices, colors, targets, default_color=WHITE):
 """Randomly dither targets onto vertices and return array of
 corresponding colors.
 Input:
 vertices: Points in 3-space. shape=(N,3), dtype=float
 color: RGB colors for each point. shape=(N,3), dtype=uint8
 targets: Points in 3-space. shape=(M,3), dtype=float
 default_color: RGB color for targets not found in convex hull of
 vertices. shape=(1,3), dtype=uint8
 Output:
 Array of colors. shape=(M,3), dtype=uint8
 Works by triangulating vertices, getting a complex of tetrahedra.
 For each target, randomly pick one of the vertices of the
 tetrahedron it is found in (weighted according to how close the
 target is to that vertex), and assign the color of that vertex to
 the target. Targets that are not found in any tetrahedron get
 assigned default_color.
 """
 N = len(vertices)
 assert vertices.shape == (N, 3)
 assert colors.shape == (N, 3)
 M = len(targets)
 assert targets.shape == (M, 3)
 assert default_color.shape == (1, 3)
 # Compute Delaunay triangulation of vertices.
 triangulation = scipy.spatial.Delaunay(vertices, furthest_site=False)
 # Find the tetrahedron containing each target (or -1 if not found)
 tetrahedra = triangulation.find_simplex(targets)
 # Affine transformation for tetrahedron containing each target
 U = triangulation.transform[tetrahedra, :3]
 # Offset of each target from the origin of its containing tetrahedron
 V = targets - triangulation.transform[tetrahedra, 3]
 # First three barycentric coordinates of each target in its tetrahedron.
 # The fourth coordinate would be 1 - b.sum(axis=1), but we don't need it.
 B = np.einsum('ijk,ik->ij', U, V)
 # Cumulative sum of barycentric coordinates of each target.
 Bsum = np.c_[B.cumsum(axis=1), np.ones(M)]
 # A uniform random number in [0, 1] for each target.
 R = np.random.uniform(0, 1, size=(M, 1))
 # Randomly choose one of the tetrahedron vertices for each target,
 # weighted according to its barycentric coordinates
 W = np.argmax(R <= Bsum, axis=1)
 # Get the color corresponding to each vertex just chosen.
 C = colors[triangulation.simplices[tetrahedra, W]]
 # Mask out the targets where we failed to find a tetrahedron.
 C[tetrahedra == -1] = default_color
 assert C.shape == (M, 3)
 return C

And here's a function that handles the input and output:

def dither_image(width, height,
 vertices_file='colorlist.csv',
 targets_file='targets.csv',
 output_file='output.ppm'):
 """Read colors and vertices from vertices_file. Read targets from
 targets_file. Dither each target to a randomly chosen nearby
 vertex; output the colors for the dithered vertices to output_file
 as a PPM with the given width and height.
 """
 vertices = np.loadtxt(vertices_file, usecols=(1, 2, 3), delimiter=',')
 colors = np.loadtxt(vertices_file, usecols=(0,), delimiter=',', 
 converters={0:lambda s:s.split()}, dtype=np.uint8)
 targets = np.loadtxt(targets_file, delimiter=',')
 C = dither_colors(vertices, colors, targets)
 # Write output as image in PPM format.
 # See <http://netpbm.sourceforge.net/doc/ppm.html>
 with open(output_file, 'wb') as ppm:
 ppm.write("P3\n{} {}\n255\n".format(width, height).encode('ascii'))
 np.savetxt(ppm, C, fmt='%d')

4. Emacs bug

In the course of preparing this answer, I discovered a bug in the way Emacs handles PPM images on OS X, which I reported as bug #16683.

Amazing answer, thanks! I learned a lot by reading it. Now I get what is meaned by vectorizing. The file colorlist.csv contains 2 types of color coordinates: "machine code" RGB and real colorimetric coordinates computed from spectrophotometric measurements and the CIE 1931 system: XYZ. But you're right, I'll restructure this fine R, G, B, X, Y, Z instead of R G B, X, Y, Z And no, the targets are not coming from the .tiff file. Thanks for showing how amazingly speedy pandas is! All my best.

Stack Exchange Network

Python, numpy, optimize module to handle big file

2 Answers 2

1. The MemoryError

2. Other comments on your code

3. Revised code

4. Emacs bug

You must log in to answer this question.

Linked

Hot Network Questions

Python, numpy, optimize module to handle big file

2 Answers 2

1. The MemoryError

2. Other comments on your code

3. Revised code

4. Emacs bug

You must log in to answer this question.

Linked

Related

Hot Network Questions