With the help of some people here and there, I wrote the module below which works very well if the file targets.csv
is not too big.
As soon as this file exceeds 100 MB, a MemoryError
is returned.
The problem is that I used numpy
to make the computations faster (and they became really fast compared to the previous versions of this module), and I think numpy requires to load the entire file into memory.
However, there should be a way to handle the MemoryError
that happens there:
targets = np.array([(float(X), float(Y), float(Z))
for X, Y, Z in csv.reader(open('targets.csv'))])
Any ideas?
Let me know if you need more details (files, etc.)
import csv
import numpy as np
import scipy.spatial
import cv2
import random
"""loading files"""
points = np.array([(int(R), int(G), int(B), float(X), float(Y), float(Z))
for R, G, B, X, Y, Z in csv.reader(open('colorlist.csv'))])
# load X,Y,Z coordinates of 'points' in a np.array
print "colorlist loaded"
targets = np.array([(float(X), float(Y), float(Z))
for X, Y, Z in csv.reader(open('targets.csv'))])
# load the XYZ target values in a np.array
print "targets loaded"
img = cv2.imread("MAP.tif", -1)
height, width = img.shape
total = height * width
# load dimensions of tif image
print "MAP loaded"
ppm = file("walladrien.ppm", 'w')
ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
# write PPM file header
"""doing geometry"""
tri = scipy.spatial.Delaunay(points[:,[3,4,5]], furthest_site=False) # True makes an almost BW picture
# Delaunay triangulation
indices = tri.simplices
# indices of vertices
vertices = points[indices]
# the vertices for each tetrahedron
tet = tri.find_simplex(targets)
U = tri.transform[tet,:3]
V = targets - tri.transform[tet,3]
b = np.einsum('ijk,ik->ij', U, V)
bcoords = np.c_[b, 1 - b.sum(axis=1)]
# find the barycentric coordinates of each point
"""looping through data"""
i = 0
for i in range(total):
if tet[i] == -1:
# this means that the point lies outside the convex hull
R = G = B = 255
ppm.write(str(R) + ' ' + str(G) + ' ' + str(B) + "\n") # writes a pixel
else:
R0 = int(vertices[tet[i]][0][0])
G0 = int(vertices[tet[i]][0][1])
B0 = int(vertices[tet[i]][0][2])
R1 = int(vertices[tet[i]][1][0])
G1 = int(vertices[tet[i]][1][1])
B1 = int(vertices[tet[i]][1][2])
R2 = int(vertices[tet[i]][2][0])
G2 = int(vertices[tet[i]][2][1])
B2 = int(vertices[tet[i]][2][2])
R3 = int(vertices[tet[i]][3][0])
G3 = int(vertices[tet[i]][3][1])
B3 = int(vertices[tet[i]][3][2])
rand = random.uniform(0,1)
BC_0 = bcoords[i][0]
BC_1 = bcoords[i][1]
BC_2 = bcoords[i][2]
BC_3 = bcoords[i][3]
i += 1
if rand <= BC_0:
ppm.write(str(R0) + ' ' + str(G0) + ' ' + str(B0) + "\n") # writes a pixel
elif rand <= BC_0 + BC_1:
ppm.write(str(R1) + ' ' + str(G1) + ' ' + str(B1) + "\n") # writes a pixel
elif rand <= BC_0 + BC_1 + BC_2:
ppm.write(str(R2) + ' ' + str(G2) + ' ' + str(B2) + "\n") # writes a pixel
else:
ppm.write(str(R3) + ' ' + str(G3) + ' ' + str(B3) + "\n") # writes a pixel
ppm.close()
print 'done'
-
1\$\begingroup\$ What does "100 Mo" mean? \$\endgroup\$Gareth Rees– Gareth Rees2014年02月10日 12:59:59 +00:00Commented Feb 10, 2014 at 12:59
-
\$\begingroup\$ 100 Mb, sorry, it's a french habit \$\endgroup\$adrienlucca.net– adrienlucca.net2014年02月10日 21:12:59 +00:00Commented Feb 10, 2014 at 21:12
-
1\$\begingroup\$ Aha, mégaoctet! \$\endgroup\$Gareth Rees– Gareth Rees2014年02月10日 21:39:48 +00:00Commented Feb 10, 2014 at 21:39
-
\$\begingroup\$ yes exactly :) haha \$\endgroup\$adrienlucca.net– adrienlucca.net2014年02月10日 21:50:52 +00:00Commented Feb 10, 2014 at 21:50
2 Answers 2
1. The MemoryError
I can't reproduce the MemoryError
. Here's the test I ran:
def make_test(n=10**7, output_file='targets.csv'):
with open(output_file, 'w') as f:
for _ in range(n):
f.write('56.08401,55.19490,25.49292\n')
>>> make_test()
>>> import os
>>> os.stat('targets.csv').st_size
270000000
That's 270 megabytes of data, and it loads fine for me:
>>> import csv, numpy as np
>>> targets = np.array([(float(X), float(Y), float(Z)) for X, Y, Z in csv.reader(open('targets.csv'))])
>>> targets.shape
(10000000, 3)
So I don't really know what is going wrong for you. Can you provide the exact text of the error message and its traceback? There might be more information there.
Regardless of the cause of the MemoryError
, Janne is right to suggest that using numpy.loadtxt
to get the data directly from the file (with no intermediate Python list) would be better:
targets = np.loadtxt('targets.csv', delimiter=',')
However, numpy.loadtxt
is still quite slow, and if this is a bottleneck for you then you should investigate pandas:
>>> from timeit import timeit
>>> timeit(lambda:np.loadtxt('targets.csv', delimiter=','), number=1)
18.912313379812986
>>> import pandas
>>> timeit(lambda:pandas.read_csv('targets.csv', delimiter=',').values, number=1)
0.5977518488653004
2. Other comments on your code
The PPM format specification says that width is given before height, but you have:
ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n")
Is that right? Do you really intend to transpose your image?
It's conventional to place a comment before the code that it describes. So instead of:
ppm.write("P3" + "\n" + str(height) + " " + str(width) +"\n" + "255" + "\n") # write PPM file header
most Python programmers would write it like this:
# Write PPM file header: see <http://netpbm.sourceforge.net/doc/ppm.html> ppm.write("P3\n{} {}\n255\n".format(width, height))
You haven't managed to vectorize all your code. Everything up to the computation of
bcoords
is vectorized and will run fast, but then you have a loop over the data where you pick a random vertex corresponding to each target. This loop runs in slow Python code and so is going to be the bottleneck in your application.See §3 below for how this part of the code can be vectorized, using
numpy.cumsum
to find the cumulative sums of the barycentric coordinates,numpy.random.uniform
to generate random samples for each target, andnumpy.argmax
to pick the vertex.Similarly, writing the output colours to the PPM would be faster if you used
numpy.savetxt
to write them all at once (instead of one at a time as at present).You compute all four barycentric coordinates for each target using the recipe from
scipy.spatial.Delaunay
:bcoords = np.c_[b, 1 - b.sum(axis=1)]
but in fact you don't need the fourth coordinate since you are only going to use the cumulative sums of these coordinates, and the fourth cumulative sum is always 1.
The format of
colorlist.csv
is quite inconvenient, with a mixture of two types of data (colours and vertices) using a mixture of spaces and commas as delimiters:255 63 127,35.5344302104,21.380721966,20.3661095969
Where does this file come from? If it's under your control, could you generate it in a more convenient format, for example as one file containing colours and another containing vertices? Or if you have to generate it as a single file, then could you consistently use either spaces or commas as delimiters?
Where do the targets come from? Your code implies that you have exactly one target for each pixel of the image
MAP.tif
, so is it the case that the targets are generated by processing this image in some way? If so, shouldn't you just generate the targets directly fromMAP.tif
instead of via the intermediatetargets.csv
?
3. Revised code
Here's the dithering algorithm, fully vectorized:
import numpy as np
import scipy.spatial
WHITE = np.array([[255, 255, 255]], dtype=np.uint8)
def dither_colors(vertices, colors, targets, default_color=WHITE):
"""Randomly dither targets onto vertices and return array of
corresponding colors.
Input:
vertices: Points in 3-space. shape=(N,3), dtype=float
color: RGB colors for each point. shape=(N,3), dtype=uint8
targets: Points in 3-space. shape=(M,3), dtype=float
default_color: RGB color for targets not found in convex hull of
vertices. shape=(1,3), dtype=uint8
Output:
Array of colors. shape=(M,3), dtype=uint8
Works by triangulating vertices, getting a complex of tetrahedra.
For each target, randomly pick one of the vertices of the
tetrahedron it is found in (weighted according to how close the
target is to that vertex), and assign the color of that vertex to
the target. Targets that are not found in any tetrahedron get
assigned default_color.
"""
N = len(vertices)
assert vertices.shape == (N, 3)
assert colors.shape == (N, 3)
M = len(targets)
assert targets.shape == (M, 3)
assert default_color.shape == (1, 3)
# Compute Delaunay triangulation of vertices.
triangulation = scipy.spatial.Delaunay(vertices, furthest_site=False)
# Find the tetrahedron containing each target (or -1 if not found)
tetrahedra = triangulation.find_simplex(targets)
# Affine transformation for tetrahedron containing each target
U = triangulation.transform[tetrahedra, :3]
# Offset of each target from the origin of its containing tetrahedron
V = targets - triangulation.transform[tetrahedra, 3]
# First three barycentric coordinates of each target in its tetrahedron.
# The fourth coordinate would be 1 - b.sum(axis=1), but we don't need it.
B = np.einsum('ijk,ik->ij', U, V)
# Cumulative sum of barycentric coordinates of each target.
Bsum = np.c_[B.cumsum(axis=1), np.ones(M)]
# A uniform random number in [0, 1] for each target.
R = np.random.uniform(0, 1, size=(M, 1))
# Randomly choose one of the tetrahedron vertices for each target,
# weighted according to its barycentric coordinates
W = np.argmax(R <= Bsum, axis=1)
# Get the color corresponding to each vertex just chosen.
C = colors[triangulation.simplices[tetrahedra, W]]
# Mask out the targets where we failed to find a tetrahedron.
C[tetrahedra == -1] = default_color
assert C.shape == (M, 3)
return C
And here's a function that handles the input and output:
def dither_image(width, height,
vertices_file='colorlist.csv',
targets_file='targets.csv',
output_file='output.ppm'):
"""Read colors and vertices from vertices_file. Read targets from
targets_file. Dither each target to a randomly chosen nearby
vertex; output the colors for the dithered vertices to output_file
as a PPM with the given width and height.
"""
vertices = np.loadtxt(vertices_file, usecols=(1, 2, 3), delimiter=',')
colors = np.loadtxt(vertices_file, usecols=(0,), delimiter=',',
converters={0:lambda s:s.split()}, dtype=np.uint8)
targets = np.loadtxt(targets_file, delimiter=',')
C = dither_colors(vertices, colors, targets)
# Write output as image in PPM format.
# See <http://netpbm.sourceforge.net/doc/ppm.html>
with open(output_file, 'wb') as ppm:
ppm.write("P3\n{} {}\n255\n".format(width, height).encode('ascii'))
np.savetxt(ppm, C, fmt='%d')
4. Emacs bug
In the course of preparing this answer, I discovered a bug in the way Emacs handles PPM images on OS X, which I reported as bug #16683.
-
\$\begingroup\$ Amazing answer, thanks! I learned a lot by reading it. Now I get what is meaned by vectorizing. The file
colorlist.csv
contains 2 types of color coordinates: "machine code"RGB
and real colorimetric coordinates computed from spectrophotometric measurements and the CIE 1931 system:XYZ
. But you're right, I'll restructure this fineR, G, B, X, Y, Z
instead ofR G B, X, Y, Z
And no, the targets are not coming from the.tiff
file. Thanks for showing how amazingly speedypandas
is! All my best. \$\endgroup\$adrienlucca.net– adrienlucca.net2014年02月10日 21:29:47 +00:00Commented Feb 10, 2014 at 21:29
Here you read your data into a Python list that you pass to array
:
targets = np.array([(float(X), float(Y), float(Z))
for X, Y, Z in csv.reader(open('targets.csv'))])
A Python list of floats uses much more memory than a NumPy array, because each float is a separate Python object. You can avoid that overhead by reading directly into an array, eg. using np.loadtxt
:
targets = np.loadtxt('targets.csv', delimiter=',')