So I'm writing some code to perform a quite specific task given a large numpy array with N rows and 3 columns representing points in 3D. The 3D points are to be binned along one of the dimensions between specified bin edges. For each of these bins, there is a set fraction by which I must reduce the number of points in that bin, perfectly (pseudo)randomly.
Here is the code I have written to perform this task. I found myself spending a lot of time trying to figure out the most 'pythonic' way to achieve this. It still seems very clunky, so I'm sure there must be a more elegant way that capitalises on numpy's array performance. Any tips?
# Initialise the array of 3D points
radecz = np.zeros((N, 3))
# grab the point data from elsewhere
points[:, 0] = ...
points[:, 1] = ...
points[:, 2] = ... # this is the dimension we bin in, call it z
# Create an array of M + 1 elements for the edges of M bins
binedges = ... # (they do not span all of the space of points by the way)
# Find the counts per bin
H = np.histogram(points[:, 2], bins=binedges)
# The number to downsample to in each bin is already known
num_down = ("""some M-dimensional array of fractions""") * H[0]
# initialise a mask for the final array for my analysis with dimension N
finmask = np.array(points.shape[0] * [False])
# loop over bins (do I really need to do this??)
for i, nd in enumerate(num_down):
# First get the array ids of the points in each bin
zbin_ids = np.where( ( (binedges[i] < points[:, 2]) & \
(points[:, 2) <= binedges[i + 1]) ) == True )
# Choose ids at random without replacement
keep = np.random.choice(zbin_ids[0], size=cn, replace=False)
# What's left is turned on in the mask for the final array
finmask[keep] = True
points = points[(finmask,)]
1 Answer 1
You could use numpy.digitize
to determine which bin each point belongs in, allowing for a simpler expression inside where
.