I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the .loc
part takes most time for bigger datasets. I wonder if its possible to speed up the points.loc[idx, 'cluster'] = clusterNr
somehow.
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def getRandomCoordinates(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
points = np.random.rand(samples, 2) * width
#points = da.random.random(size=(samples, 2), chunks=(500, 500))
data = pd.DataFrame(points, columns=['lat', 'lng'])
data.lat += offsetX
data.lng += offsetY
# set spatial properties
data.columnX = 'lat'
data.columnY = 'lng'
return data
radius = 0.01
points = getRandomCoordinates(25)
samples = points.sample(10)
tree = NearestNeighbors(n_neighbors=2, radius=0.1, leaf_size=30, algorithm="ball_tree", n_jobs=1).fit(points)
nngrf = tree.radius_neighbors_graph(samples, radius, mode='connectivity').toarray().astype(np.bool)
points['cluster'] = -1
for clusterNr, idx in enumerate(nngrf):
points.loc[idx, 'cluster'] = clusterNr
Input data:
lng lat
0 12.988426 52.343361
1 13.055824 52.396462
2 13.353571 52.347457
3 12.980915 52.339021
4 13.232137 52.339155
5 12.877804 52.385926
6 13.220915 52.378951
7 13.479688 52.424455
8 13.324399 52.637530
9 13.052958 52.398084
10 13.087653 52.413064
11 13.330557 52.637883
12 13.354927 52.380040
13 13.163061 52.514445
14 13.371755 52.520665
15 13.698472 52.389397
16 13.405825 52.507757
17 13.239793 52.391341
18 13.369102 52.525122
19 13.322234 52.511453
20 13.326276 52.515045
21 13.318642 52.296283
22 13.411129 52.478509
23 13.207719 52.283844
24 13.222899 52.381747
and the result:
lng lat cluster
0 12.988426 52.343361 9
1 13.055824 52.396462 6
2 13.353571 52.347457 -1
3 12.980915 52.339021 9
4 13.232137 52.339155 4
5 12.877804 52.385926 -1
6 13.220915 52.378951 7
7 13.479688 52.424455 -1
8 13.324399 52.637530 -1
9 13.052958 52.398084 6
10 13.087653 52.413064 5
11 13.330557 52.637883 -1
12 13.354927 52.380040 0
13 13.163061 52.514445 -1
14 13.371755 52.520665 2
15 13.698472 52.389397 -1
16 13.405825 52.507757 1
17 13.239793 52.391341 -1
18 13.369102 52.525122 2
19 13.322234 52.511453 8
20 13.326276 52.515045 8
21 13.318642 52.296283 -1
22 13.411129 52.478509 -1
23 13.207719 52.283844 3
24 13.222899 52.381747 7
and the nearest neighbors graph:
[[False False False False False False False False False False False False
False False True False False False True False False False False False
False]
[False False True False False False False False False False False False
False False False False False False False False False False False False
False]
[False False False False False False False False False False False False
False False False False False False False False False False True False
False]
[False False False False False False False False False False False False
True False False False False False False False False False False False
False]
[False False False False False False True False False False False False
False False False False False False False False False False False False
True]
[False False False False False False False False False False False False
False False False False False False False True True False False False
False]
[False False False False False True False False False False False False
False False False False False False False False False False False False
False]
[False False False False False False False False False False False False
False False False True False False False False False False False False
False]
[False False False False False False False False False False False False
False False False False False False False False False False False True
False]
[False False False False False False False False False False False False
False False False False False False False True True False False False
False]]
1 Answer 1
Assignments to Pandas dataframes always work better when doing entire columns at once, instead of filling a single row or a small number of rows at a time. In the code below I first completely define a NumPy array of cluster numbers and only after it is completely defined do I pass it into the Pandas DataFrame.
Even aside from the speed issue, your
for
loop where you assign cluster numbers is very confusing because of a misuse ofenumerate
: you haveclusterNr, idx in enumerate(nngrf)
but the idiom is alwaysidx, value in enumerate(values)
. You have the position of the index and the value switched. In your particular case this works and is a nice trick but you should document how this works in your code if anyone besides you is ever meant to read it.I changed variable names to be PEP8 compliant and to be easier to read. I also added some comments, docstrings, and formatting. Particular changes of note were defining
n_points
andn_samples
as variables, wrapping the code you wrote into a function (so that I could line profile it), makingRADIUS
a variable defined in all-caps (which in Python suggests it is a constant), and defining the radius in the call to.radius_neighbors_graph
to be in terms ofRADIUS
rather than just be another magic number hard-coded into the code. I think most of these changes improve your code's readability and make it more in line with style guidelines for Python.Minor point: coercing to boolean with
.astype('bool')
works on SciPy sparse CSR matrices, and so doing the coercion before converting to a non-sparse NumPy array with.toarray()
should be slightly faster and use less memory than doing things the other way around -- no time is wasted on converting zeroes.import numpy as np import pandas as pd from sklearn.neighbors import NearestNeighbors %load_ext line_profiler def knn_clusters(n_points, n_samples): """ Finds the which of n_samples points is closest to each of n_points randomly defined points. """ RADIUS = 0.01 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5): """Generate random coordinates in a pandas dataframe""" points = np.random.rand(samples, 2) * width data = pd.DataFrame(points, columns=['latitude', 'longitude']) data.latitude += offsetX data.longitude += offsetY # set spatial properties data.columnX = 'latitude' data.columnY = 'longitude' return data # some random points points = get_random_coords(n_points) # a subset of those points samples = points.sample(n_samples) # KNN tree = NearestNeighbors(n_neighbors=2, radius=RADIUS, leaf_size=30, algorithm="ball_tree", n_jobs=1, ).fit(points) # nn_graph = tree.radius_neighbors_graph(samples, radius = 10 * RADIUS, mode = 'connectivity', ).astype('bool').toarray() # faster assigment to pandas dataframes: entire columns at once cluster_number, point_number = np.where(nn_graph) cluster_assignments = -np.ones(n_points) cluster_assignments[point_number] = cluster_number points.loc[:, 'cluster'] = cluster_assignments # for comparison: assignment by specific rows is slow points['cluster_alt'] = -1 for clusterNr, idx in enumerate(nn_graph): points.loc[idx, 'cluster_alt'] = clusterNr # ensure equality of both approaches assert np.all(np.equal(points.cluster, points.cluster_alt)) return points, nn_graph
The results of line profiling with %lprun -f knn_clusters knn_clusters(10000, 1000)
were:
Timer unit: 1e-06 s
Total time: 0.999948 s
File: <ipython-input-76-7be6bc3ea686>
Function: knn_clusters at line 6
Line # Hits Time Per Hit % Time Line Contents
==============================================================
6 def knn_clusters(n_points, n_samples):
7 """
8 Finds the which of n_samples points is closest to each of n_points randomly defined points.
9 """
10 1 2 2.0 0.0 RADIUS = 0.01
11
12 1 2 2.0 0.0 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
13 """Generate random coordinates in a pandas dataframe"""
14 points = np.random.rand(samples, 2) * width
15 data = pd.DataFrame(points, columns=['latitude', 'longitude'])
16 data.latitude += offsetX
17 data.longitude += offsetY
18
19 # set spatial properties
20 data.columnX = 'latitude'
21 data.columnY = 'longitude'
22
23 return data
24
25 # some random points
26 1 3710 3710.0 0.4 points = get_random_coords(n_points)
27
28 # a subset of those points
29 1 7687 7687.0 0.8 samples = points.sample(n_samples)
30
31 # KNN
32 1 4 4.0 0.0 tree = NearestNeighbors(n_neighbors=2,
33 1 2 2.0 0.0 radius=RADIUS,
34 1 2 2.0 0.0 leaf_size=30,
35 1 2 2.0 0.0 algorithm="ball_tree",
36 1 275 275.0 0.0 n_jobs=1,
37 1 6029 6029.0 0.6 ).fit(points)
38
39 #
40 1 7 7.0 0.0 nn_graph = tree.radius_neighbors_graph(samples,
41 1 4 4.0 0.0 radius=10*RADIUS,
42 1 90594 90594.0 9.1 mode='connectivity',
43 1 36823 36823.0 3.7 ).astype('bool').toarray()
44
45 # faster assigment to pandas dataframes: entire columns at once
46 1 73476 73476.0 7.3 cluster_number, point_number = np.where(nn_graph)
47 1 74 74.0 0.0 cluster_assignments = -np.ones(n_points)
48 1 11675 11675.0 1.2 cluster_assignments[point_number] = cluster_number
49 1 2548 2548.0 0.3 points.loc[:, 'cluster'] = cluster_assignments
50
51 # for comparison: assignment by specific rows is slow
52 1 889 889.0 0.1 points['cluster_alt'] = -1
53 1001 2464 2.5 0.2 for clusterNr, idx in enumerate(nn_graph):
54 1000 763337 763.3 76.3 points.loc[idx, 'cluster_alt'] = clusterNr
55
56 # ensure equality of both approaches
57 1 341 341.0 0.0 assert np.all(np.equal(points.cluster, points.cluster_alt))
58
59 1 1 1.0 0.0 return points, nn_graph
Thus assigning an entire column at once was just under 10x faster.
Explore related questions
See similar questions with these tags.