Estimation of min_samples for DBSCAN

Question 1

I'm attempting to speed up some python code that is supposed to automatically pick the minimum samples argument in DBSCAN. Currently the execution time grows exponentially as the number of training samples increases:

0.128999948502 seconds for 100 training examples
0.281999826431 seconds for 1000 training examples
11.5310001373 seconds for 10000 training examples

min_samples is the number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

Can anyone make some suggestions for how to speed up the code?

As a bonus, adding the n_jobs argument to DBSCAN doesn't seem to affect the execution time of DBSCAN.

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from time import time
def autoselect_minsamps_orig(D_sym,eps,K,kmedian_thresh):
 min_samps = 10
 kmedian_dist_prev = 0
 for k in range(2,K):
 k_avg = []
 # For each row, compute median distance to k nearest neighbors
 for row in D_sym:
 flat_row = np.squeeze(np.asarray(row))
 row_nonzero = flat_row[np.flatnonzero(flat_row)]
 # Some nodes aren't connected to anything
 if len(row_nonzero) == 0: continue
 sorted_row = np.sort(row_nonzero)
 sorted_row_t = sorted_row[0:k]
 k_avg.append(np.median(sorted_row_t))
 kmedian_dist = np.median(k_avg)
 kmedian_delta = abs(kmedian_dist - kmedian_dist_prev)
 if kmedian_delta <= kmedian_thresh:
 min_samps=k
 break
 kmedian_dist_prev = kmedian_dist
 return min_samps
num_training_examples = 100
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
# generate a symmetric distance matrix
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
start = time()
result = autoselect_minsamps_orig(D,eps,K,kmedian_thresh)
end = time()
total_time = end - start
print('autoselect_minsamps took {} seconds for {} training examples'.format(total_time,num_training_examples))
num_training_examples = 1000
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
# generate a symmetric distance matrix
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
start = time()
result = autoselect_minsamps_orig(D,eps,K,kmedian_thresh)
end = time()
total_time = end - start
print('autoselect_minsamps took {} seconds for {} training examples'.format(total_time,num_training_examples))
num_training_examples = 10000
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
# generate a symmetric distance matrix
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
start = time()
min_samples = autoselect_minsamps_orig(D,eps,K,kmedian_thresh)
end = time()
total_time = end - start
print('autoselect_minsamps took {} seconds for {} training examples'.format(total_time,num_training_examples))
# min_samples is used here
db = DBSCAN(eps=eps,
 min_samples = min_samples,
 metric='precomputed',
 n_jobs=-1).fit(D)

Question 2

Since the elements in each row grows, sorting the row only to take the k smallest is a bit wasteful. Using numpy.partition should always be faster:

sorted_row_t = np.partition(row_nonzero, k)[:k]

This is not needed with the examples you give, the row is already an array and flattened:

flat_row = np.squeeze(np.asarray(row))

Similarly, you can just use numpy.nonzero, no need to flatten the flattened row again:

row_nonzero = flat_row[np.nonzero(frow)]

However, this whole for loop can be done entirely in numpy:

First, mask all zeros by setting them to numpy.nan:
```
D = np.where(D == 0, np.nan, D)
```
Get the k smallest elements of each row. Since numpy.nan is not smaller than any number it will only appear at the beginning if (almost) the whole row was zeros:
```
k_smallest = np.partition(D, k, axis=1)[:, :k]
```
Compute the median of each row, and then the median of the median, ignoring numpy.nan values using numpy.nanmedian:
```
kmedian_dist = np.nanmedian(np.nanmedian(k_smallest, axis=1))
```

The first two steps can even be pulled outside of the for k in ... loop, which will save some more time. In that case we have to sort the whole array, though, because numpy.partition does not guarantee that the values are sorted, only that the k smallest values are in the slice up to k. Since K seems to be max(D_sym.shape) in all cases anyway, this should not make such a big difference. It will if the k at which the for loops stops is small enough, though, so you will have to test this.

There's no need to break and then immediately return afterwards, just return directly.

Python has an official style-guide, PEP8. It recommends using 4 spaces as indentation (not 3) and spaces after commas.

In summary, this should do the same as your code:

def autoselect_minsamps_graipher(D_sym, eps, K, kmedian_thresh):
 D = np.sort(np.where(D_sym == 0, np.nan, D_sym), axis=1)
 kmedian_dist_prev = 0
 for k in range(2, K):
 kmedian_dist = np.nanmedian(np.nanmedian(D[:,:k], axis=1))
 if abs(kmedian_dist - kmedian_dist_prev) <= kmedian_thresh:
 return k
 kmedian_dist_prev = kmedian_dist
 return 10 # fall back value

This is faster than your implementation, but suffers from similar scaling behaviour:

n autoselect_minsamps_orig autoselect_minsamps_graipher
100 0.028 s 0.0088 s
1000 0.32 s 0.064 s
10000 9.5 s 3.6 s

You could also make timing your code easier by using a decorator:

from time import perf_counter
class Timer:
 def __init__(self, name=""):
 self.name = ""
 self.start = None
 def __enter__(self):
 self.start = perf_counter()
 def __exit__(self, *args, **kwargs):
 delta = perf_counter() - self.start
 print(f"{self.name} finished in {delta:.2f} seconds")

Or, if you are stuck on Python 2:

from time import time
class Timer:
 def __init__(self, name=""):
 self.name = ""
 self.start = None
 def __enter__(self):
 self.start = time()
 def __exit__(self, *args, **kwargs):
 delta = time() - self.start
 print "{} finished in {:.2f} seconds".format(self.name, delta)

Which you can use like this:

D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
with Timer(f"autoselect_minsamps_orig, K = {K}"):
 min_samples = autoselect_minsamps_orig(D, eps, K, kmedian_thresh)

Note that I used time.perf_counter, which is better suited for timing, but which is Python 3 only.

Question 3

I'm having trouble importing perf_counter from time...is it a Python 3 feature? I'm stuck with 2.7...

Question 4

@random_dsp_guy It is. You can use time.time() instead. In that case, you should also do print "...".format(...) wherever I did print(f"..."}.

Question 5

Could you post a Python 2.7 compatible solution? I'm having trouble getting the arguments Timer to work (AttributeError: Timer instance has no attribute 'name')

Question 6

@random_dsp_guy That is actually a typo Independent of the Python version, fixed. I'll see if I have some time later to do a Python 2 version.

Question 7

@random_dsp_guy Ok, I quickly did it for the class, you will have to adapt the calling code yourself (though it is just the name string). Code is untested, so let me know if there are further typos hidden ;-) In the future, if you are asking a question and are still stuck on Python 2, we have the tag python-2.x, which you can add in addition to the normal Python tag.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2020-01-29 14:40:58Z

Since the elements in each row grows, sorting the row only to take the k smallest is a bit wasteful. Using numpy.partition should always be faster:

sorted_row_t = np.partition(row_nonzero, k)[:k]

This is not needed with the examples you give, the row is already an array and flattened:

flat_row = np.squeeze(np.asarray(row))

Similarly, you can just use numpy.nonzero, no need to flatten the flattened row again:

row_nonzero = flat_row[np.nonzero(frow)]

However, this whole for loop can be done entirely in numpy:

First, mask all zeros by setting them to numpy.nan:
```
D = np.where(D == 0, np.nan, D)
```
Get the k smallest elements of each row. Since numpy.nan is not smaller than any number it will only appear at the beginning if (almost) the whole row was zeros:
```
k_smallest = np.partition(D, k, axis=1)[:, :k]
```
Compute the median of each row, and then the median of the median, ignoring numpy.nan values using numpy.nanmedian:
```
kmedian_dist = np.nanmedian(np.nanmedian(k_smallest, axis=1))
```

The first two steps can even be pulled outside of the for k in ... loop, which will save some more time. In that case we have to sort the whole array, though, because numpy.partition does not guarantee that the values are sorted, only that the k smallest values are in the slice up to k. Since K seems to be max(D_sym.shape) in all cases anyway, this should not make such a big difference. It will if the k at which the for loops stops is small enough, though, so you will have to test this.

There's no need to break and then immediately return afterwards, just return directly.

Python has an official style-guide, PEP8. It recommends using 4 spaces as indentation (not 3) and spaces after commas.

In summary, this should do the same as your code:

def autoselect_minsamps_graipher(D_sym, eps, K, kmedian_thresh):
 D = np.sort(np.where(D_sym == 0, np.nan, D_sym), axis=1)
 kmedian_dist_prev = 0
 for k in range(2, K):
 kmedian_dist = np.nanmedian(np.nanmedian(D[:,:k], axis=1))
 if abs(kmedian_dist - kmedian_dist_prev) <= kmedian_thresh:
 return k
 kmedian_dist_prev = kmedian_dist
 return 10 # fall back value

This is faster than your implementation, but suffers from similar scaling behaviour:

n autoselect_minsamps_orig autoselect_minsamps_graipher
100 0.028 s 0.0088 s
1000 0.32 s 0.064 s
10000 9.5 s 3.6 s

You could also make timing your code easier by using a decorator:

from time import perf_counter
class Timer:
 def __init__(self, name=""):
 self.name = ""
 self.start = None
 def __enter__(self):
 self.start = perf_counter()
 def __exit__(self, *args, **kwargs):
 delta = perf_counter() - self.start
 print(f"{self.name} finished in {delta:.2f} seconds")

Or, if you are stuck on Python 2:

from time import time
class Timer:
 def __init__(self, name=""):
 self.name = ""
 self.start = None
 def __enter__(self):
 self.start = time()
 def __exit__(self, *args, **kwargs):
 delta = time() - self.start
 print "{} finished in {:.2f} seconds".format(self.name, delta)

Which you can use like this:

D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
with Timer(f"autoselect_minsamps_orig, K = {K}"):
 min_samples = autoselect_minsamps_orig(D, eps, K, kmedian_thresh)

Note that I used time.perf_counter, which is better suited for timing, but which is Python 3 only.

I'm having trouble importing perf_counter from time...is it a Python 3 feature? I'm stuck with 2.7...
@random_dsp_guy It is. You can use time.time() instead. In that case, you should also do print "...".format(...) wherever I did print(f"..."}.
Could you post a Python 2.7 compatible solution? I'm having trouble getting the arguments Timer to work (AttributeError: Timer instance has no attribute 'name')
@random_dsp_guy That is actually a typo Independent of the Python version, fixed. I'll see if I have some time later to do a Python 2 version.
@random_dsp_guy Ok, I quickly did it for the class, you will have to adapt the calling code yourself (though it is just the name string). Code is untested, so let me know if there are further typos hidden ;-) In the future, if you are asking a question and are still stuck on Python 2, we have the tag python-2.x, which you can add in addition to the normal Python tag.

Stack Exchange Network

Estimation of min_samples for DBSCAN

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Estimation of min_samples for DBSCAN

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions