PANDAS spatial clustering

Question 1

I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the .loc part takes most time for bigger datasets. I wonder if its possible to speed up the points.loc[idx, 'cluster'] = clusterNr somehow.

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
def getRandomCoordinates(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
 points = np.random.rand(samples, 2) * width
 #points = da.random.random(size=(samples, 2), chunks=(500, 500))
 data = pd.DataFrame(points, columns=['lat', 'lng'])
 data.lat += offsetX
 data.lng += offsetY
 # set spatial properties
 data.columnX = 'lat'
 data.columnY = 'lng'
 return data
radius = 0.01
points = getRandomCoordinates(25)
samples = points.sample(10)
tree = NearestNeighbors(n_neighbors=2, radius=0.1, leaf_size=30, algorithm="ball_tree", n_jobs=1).fit(points)
nngrf = tree.radius_neighbors_graph(samples, radius, mode='connectivity').toarray().astype(np.bool)
points['cluster'] = -1
for clusterNr, idx in enumerate(nngrf):
 points.loc[idx, 'cluster'] = clusterNr

Input data:

 lng lat
0 12.988426 52.343361
1 13.055824 52.396462
2 13.353571 52.347457
3 12.980915 52.339021
4 13.232137 52.339155
5 12.877804 52.385926
6 13.220915 52.378951
7 13.479688 52.424455
8 13.324399 52.637530
9 13.052958 52.398084
10 13.087653 52.413064
11 13.330557 52.637883
12 13.354927 52.380040
13 13.163061 52.514445
14 13.371755 52.520665
15 13.698472 52.389397
16 13.405825 52.507757
17 13.239793 52.391341
18 13.369102 52.525122
19 13.322234 52.511453
20 13.326276 52.515045
21 13.318642 52.296283
22 13.411129 52.478509
23 13.207719 52.283844
24 13.222899 52.381747

and the result:

 lng lat cluster
0 12.988426 52.343361 9
1 13.055824 52.396462 6
2 13.353571 52.347457 -1
3 12.980915 52.339021 9
4 13.232137 52.339155 4
5 12.877804 52.385926 -1
6 13.220915 52.378951 7
7 13.479688 52.424455 -1
8 13.324399 52.637530 -1
9 13.052958 52.398084 6
10 13.087653 52.413064 5
11 13.330557 52.637883 -1
12 13.354927 52.380040 0
13 13.163061 52.514445 -1
14 13.371755 52.520665 2
15 13.698472 52.389397 -1
16 13.405825 52.507757 1
17 13.239793 52.391341 -1
18 13.369102 52.525122 2
19 13.322234 52.511453 8
20 13.326276 52.515045 8
21 13.318642 52.296283 -1
22 13.411129 52.478509 -1
23 13.207719 52.283844 3
24 13.222899 52.381747 7

and the nearest neighbors graph:

[[False False False False False False False False False False False False
 False False True False False False True False False False False False
 False]
 [False False True False False False False False False False False False
 False False False False False False False False False False False False
 False]
 [False False False False False False False False False False False False
 False False False False False False False False False False True False
 False]
 [False False False False False False False False False False False False
 True False False False False False False False False False False False
 False]
 [False False False False False False True False False False False False
 False False False False False False False False False False False False
 True]
 [False False False False False False False False False False False False
 False False False False False False False True True False False False
 False]
 [False False False False False True False False False False False False
 False False False False False False False False False False False False
 False]
 [False False False False False False False False False False False False
 False False False True False False False False False False False False
 False]
 [False False False False False False False False False False False False
 False False False False False False False False False False False True
 False]
 [False False False False False False False False False False False False
 False False False False False False False True True False False False
 False]]

Question 2

Assignments to Pandas dataframes always work better when doing entire columns at once, instead of filling a single row or a small number of rows at a time. In the code below I first completely define a NumPy array of cluster numbers and only after it is completely defined do I pass it into the Pandas DataFrame.
Even aside from the speed issue, your for loop where you assign cluster numbers is very confusing because of a misuse of enumerate: you have clusterNr, idx in enumerate(nngrf) but the idiom is always idx, value in enumerate(values). You have the position of the index and the value switched. In your particular case this works and is a nice trick but you should document how this works in your code if anyone besides you is ever meant to read it.
I changed variable names to be PEP8 compliant and to be easier to read. I also added some comments, docstrings, and formatting. Particular changes of note were defining n_points and n_samples as variables, wrapping the code you wrote into a function (so that I could line profile it), making RADIUS a variable defined in all-caps (which in Python suggests it is a constant), and defining the radius in the call to .radius_neighbors_graph to be in terms of RADIUS rather than just be another magic number hard-coded into the code. I think most of these changes improve your code's readability and make it more in line with style guidelines for Python.

Minor point: coercing to boolean with .astype('bool') works on SciPy sparse CSR matrices, and so doing the coercion before converting to a non-sparse NumPy array with .toarray() should be slightly faster and use less memory than doing things the other way around -- no time is wasted on converting zeroes.

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
%load_ext line_profiler
def knn_clusters(n_points, n_samples):
 """
 Finds the which of n_samples points is closest to each of n_points randomly defined points.
 """
 RADIUS = 0.01
 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
 """Generate random coordinates in a pandas dataframe"""
 points = np.random.rand(samples, 2) * width
 data = pd.DataFrame(points, columns=['latitude', 'longitude'])
 data.latitude += offsetX
 data.longitude += offsetY
 # set spatial properties
 data.columnX = 'latitude'
 data.columnY = 'longitude'
 return data
 # some random points
 points = get_random_coords(n_points)
 # a subset of those points
 samples = points.sample(n_samples)
 # KNN
 tree = NearestNeighbors(n_neighbors=2, 
 radius=RADIUS, 
 leaf_size=30, 
 algorithm="ball_tree", 
 n_jobs=1,
 ).fit(points)
 # 
 nn_graph = tree.radius_neighbors_graph(samples, 
 radius = 10 * RADIUS, 
 mode = 'connectivity',
 ).astype('bool').toarray()
 # faster assigment to pandas dataframes: entire columns at once
 cluster_number, point_number = np.where(nn_graph)
 cluster_assignments = -np.ones(n_points)
 cluster_assignments[point_number] = cluster_number
 points.loc[:, 'cluster'] = cluster_assignments
 # for comparison: assignment by specific rows is slow
 points['cluster_alt'] = -1
 for clusterNr, idx in enumerate(nn_graph):
 points.loc[idx, 'cluster_alt'] = clusterNr
 # ensure equality of both approaches
 assert np.all(np.equal(points.cluster, points.cluster_alt))
 return points, nn_graph

The results of line profiling with %lprun -f knn_clusters knn_clusters(10000, 1000) were:

Timer unit: 1e-06 s
Total time: 0.999948 s
File: <ipython-input-76-7be6bc3ea686>
Function: knn_clusters at line 6
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 6 def knn_clusters(n_points, n_samples):
 7 """
 8 Finds the which of n_samples points is closest to each of n_points randomly defined points.
 9 """
 10 1 2 2.0 0.0 RADIUS = 0.01
 11 
 12 1 2 2.0 0.0 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
 13 """Generate random coordinates in a pandas dataframe"""
 14 points = np.random.rand(samples, 2) * width
 15 data = pd.DataFrame(points, columns=['latitude', 'longitude'])
 16 data.latitude += offsetX
 17 data.longitude += offsetY
 18 
 19 # set spatial properties
 20 data.columnX = 'latitude'
 21 data.columnY = 'longitude'
 22 
 23 return data
 24 
 25 # some random points
 26 1 3710 3710.0 0.4 points = get_random_coords(n_points)
 27 
 28 # a subset of those points
 29 1 7687 7687.0 0.8 samples = points.sample(n_samples)
 30 
 31 # KNN
 32 1 4 4.0 0.0 tree = NearestNeighbors(n_neighbors=2, 
 33 1 2 2.0 0.0 radius=RADIUS, 
 34 1 2 2.0 0.0 leaf_size=30, 
 35 1 2 2.0 0.0 algorithm="ball_tree", 
 36 1 275 275.0 0.0 n_jobs=1,
 37 1 6029 6029.0 0.6 ).fit(points)
 38 
 39 # 
 40 1 7 7.0 0.0 nn_graph = tree.radius_neighbors_graph(samples, 
 41 1 4 4.0 0.0 radius=10*RADIUS, 
 42 1 90594 90594.0 9.1 mode='connectivity',
 43 1 36823 36823.0 3.7 ).astype('bool').toarray()
 44 
 45 # faster assigment to pandas dataframes: entire columns at once
 46 1 73476 73476.0 7.3 cluster_number, point_number = np.where(nn_graph)
 47 1 74 74.0 0.0 cluster_assignments = -np.ones(n_points)
 48 1 11675 11675.0 1.2 cluster_assignments[point_number] = cluster_number
 49 1 2548 2548.0 0.3 points.loc[:, 'cluster'] = cluster_assignments
 50 
 51 # for comparison: assignment by specific rows is slow
 52 1 889 889.0 0.1 points['cluster_alt'] = -1
 53 1001 2464 2.5 0.2 for clusterNr, idx in enumerate(nn_graph):
 54 1000 763337 763.3 76.3 points.loc[idx, 'cluster_alt'] = clusterNr
 55 
 56 # ensure equality of both approaches
 57 1 341 341.0 0.0 assert np.all(np.equal(points.cluster, points.cluster_alt))
 58 
 59 1 1 1.0 0.0 return points, nn_graph

Thus assigning an entire column at once was just under 10x faster.

Curt F. 1,65611 silver badges22 bronze badges · Answer 1 · 2016-01-31 10:48:01Z

Assignments to Pandas dataframes always work better when doing entire columns at once, instead of filling a single row or a small number of rows at a time. In the code below I first completely define a NumPy array of cluster numbers and only after it is completely defined do I pass it into the Pandas DataFrame.
Even aside from the speed issue, your for loop where you assign cluster numbers is very confusing because of a misuse of enumerate: you have clusterNr, idx in enumerate(nngrf) but the idiom is always idx, value in enumerate(values). You have the position of the index and the value switched. In your particular case this works and is a nice trick but you should document how this works in your code if anyone besides you is ever meant to read it.
I changed variable names to be PEP8 compliant and to be easier to read. I also added some comments, docstrings, and formatting. Particular changes of note were defining n_points and n_samples as variables, wrapping the code you wrote into a function (so that I could line profile it), making RADIUS a variable defined in all-caps (which in Python suggests it is a constant), and defining the radius in the call to .radius_neighbors_graph to be in terms of RADIUS rather than just be another magic number hard-coded into the code. I think most of these changes improve your code's readability and make it more in line with style guidelines for Python.

Minor point: coercing to boolean with .astype('bool') works on SciPy sparse CSR matrices, and so doing the coercion before converting to a non-sparse NumPy array with .toarray() should be slightly faster and use less memory than doing things the other way around -- no time is wasted on converting zeroes.

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
%load_ext line_profiler
def knn_clusters(n_points, n_samples):
 """
 Finds the which of n_samples points is closest to each of n_points randomly defined points.
 """
 RADIUS = 0.01
 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
 """Generate random coordinates in a pandas dataframe"""
 points = np.random.rand(samples, 2) * width
 data = pd.DataFrame(points, columns=['latitude', 'longitude'])
 data.latitude += offsetX
 data.longitude += offsetY
 # set spatial properties
 data.columnX = 'latitude'
 data.columnY = 'longitude'
 return data
 # some random points
 points = get_random_coords(n_points)
 # a subset of those points
 samples = points.sample(n_samples)
 # KNN
 tree = NearestNeighbors(n_neighbors=2, 
 radius=RADIUS, 
 leaf_size=30, 
 algorithm="ball_tree", 
 n_jobs=1,
 ).fit(points)
 # 
 nn_graph = tree.radius_neighbors_graph(samples, 
 radius = 10 * RADIUS, 
 mode = 'connectivity',
 ).astype('bool').toarray()
 # faster assigment to pandas dataframes: entire columns at once
 cluster_number, point_number = np.where(nn_graph)
 cluster_assignments = -np.ones(n_points)
 cluster_assignments[point_number] = cluster_number
 points.loc[:, 'cluster'] = cluster_assignments
 # for comparison: assignment by specific rows is slow
 points['cluster_alt'] = -1
 for clusterNr, idx in enumerate(nn_graph):
 points.loc[idx, 'cluster_alt'] = clusterNr
 # ensure equality of both approaches
 assert np.all(np.equal(points.cluster, points.cluster_alt))
 return points, nn_graph

The results of line profiling with %lprun -f knn_clusters knn_clusters(10000, 1000) were:

Timer unit: 1e-06 s
Total time: 0.999948 s
File: <ipython-input-76-7be6bc3ea686>
Function: knn_clusters at line 6
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 6 def knn_clusters(n_points, n_samples):
 7 """
 8 Finds the which of n_samples points is closest to each of n_points randomly defined points.
 9 """
 10 1 2 2.0 0.0 RADIUS = 0.01
 11 
 12 1 2 2.0 0.0 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):
 13 """Generate random coordinates in a pandas dataframe"""
 14 points = np.random.rand(samples, 2) * width
 15 data = pd.DataFrame(points, columns=['latitude', 'longitude'])
 16 data.latitude += offsetX
 17 data.longitude += offsetY
 18 
 19 # set spatial properties
 20 data.columnX = 'latitude'
 21 data.columnY = 'longitude'
 22 
 23 return data
 24 
 25 # some random points
 26 1 3710 3710.0 0.4 points = get_random_coords(n_points)
 27 
 28 # a subset of those points
 29 1 7687 7687.0 0.8 samples = points.sample(n_samples)
 30 
 31 # KNN
 32 1 4 4.0 0.0 tree = NearestNeighbors(n_neighbors=2, 
 33 1 2 2.0 0.0 radius=RADIUS, 
 34 1 2 2.0 0.0 leaf_size=30, 
 35 1 2 2.0 0.0 algorithm="ball_tree", 
 36 1 275 275.0 0.0 n_jobs=1,
 37 1 6029 6029.0 0.6 ).fit(points)
 38 
 39 # 
 40 1 7 7.0 0.0 nn_graph = tree.radius_neighbors_graph(samples, 
 41 1 4 4.0 0.0 radius=10*RADIUS, 
 42 1 90594 90594.0 9.1 mode='connectivity',
 43 1 36823 36823.0 3.7 ).astype('bool').toarray()
 44 
 45 # faster assigment to pandas dataframes: entire columns at once
 46 1 73476 73476.0 7.3 cluster_number, point_number = np.where(nn_graph)
 47 1 74 74.0 0.0 cluster_assignments = -np.ones(n_points)
 48 1 11675 11675.0 1.2 cluster_assignments[point_number] = cluster_number
 49 1 2548 2548.0 0.3 points.loc[:, 'cluster'] = cluster_assignments
 50 
 51 # for comparison: assignment by specific rows is slow
 52 1 889 889.0 0.1 points['cluster_alt'] = -1
 53 1001 2464 2.5 0.2 for clusterNr, idx in enumerate(nn_graph):
 54 1000 763337 763.3 76.3 points.loc[idx, 'cluster_alt'] = clusterNr
 55 
 56 # ensure equality of both approaches
 57 1 341 341.0 0.0 assert np.all(np.equal(points.cluster, points.cluster_alt))
 58 
 59 1 1 1.0 0.0 return points, nn_graph

Thus assigning an entire column at once was just under 10x faster.

Stack Exchange Network

PANDAS spatial clustering

1 Answer 1

You must log in to answer this question.

Hot Network Questions

PANDAS spatial clustering

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions