I'm attempting to speed up some python code that is supposed to automatically pick the minimum samples argument in DBSCAN. Currently the execution time grows exponentially as the number of training samples increases:
- 0.128999948502 seconds for 100 training examples
- 0.281999826431 seconds for 1000 training examples
- 11.5310001373 seconds for 10000 training examples
min_samples
is the number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
Can anyone make some suggestions for how to speed up the code?
As a bonus, adding the n_jobs
argument to DBSCAN doesn't seem to affect the execution time of DBSCAN.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from time import time
def autoselect_minsamps_orig(D_sym,eps,K,kmedian_thresh):
min_samps = 10
kmedian_dist_prev = 0
for k in range(2,K):
k_avg = []
# For each row, compute median distance to k nearest neighbors
for row in D_sym:
flat_row = np.squeeze(np.asarray(row))
row_nonzero = flat_row[np.flatnonzero(flat_row)]
# Some nodes aren't connected to anything
if len(row_nonzero) == 0: continue
sorted_row = np.sort(row_nonzero)
sorted_row_t = sorted_row[0:k]
k_avg.append(np.median(sorted_row_t))
kmedian_dist = np.median(k_avg)
kmedian_delta = abs(kmedian_dist - kmedian_dist_prev)
if kmedian_delta <= kmedian_thresh:
min_samps=k
break
kmedian_dist_prev = kmedian_dist
return min_samps
num_training_examples = 100
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
# generate a symmetric distance matrix
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
start = time()
result = autoselect_minsamps_orig(D,eps,K,kmedian_thresh)
end = time()
total_time = end - start
print('autoselect_minsamps took {} seconds for {} training examples'.format(total_time,num_training_examples))
num_training_examples = 1000
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
# generate a symmetric distance matrix
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
start = time()
result = autoselect_minsamps_orig(D,eps,K,kmedian_thresh)
end = time()
total_time = end - start
print('autoselect_minsamps took {} seconds for {} training examples'.format(total_time,num_training_examples))
num_training_examples = 10000
num_features = 10
X = np.random.randint(5, size=(num_training_examples, num_features))
# generate a symmetric distance matrix
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
start = time()
min_samples = autoselect_minsamps_orig(D,eps,K,kmedian_thresh)
end = time()
total_time = end - start
print('autoselect_minsamps took {} seconds for {} training examples'.format(total_time,num_training_examples))
# min_samples is used here
db = DBSCAN(eps=eps,
min_samples = min_samples,
metric='precomputed',
n_jobs=-1).fit(D)
1 Answer 1
Since the elements in each row grows, sorting the row only to take the k
smallest is a bit wasteful. Using numpy.partition
should always be faster:
sorted_row_t = np.partition(row_nonzero, k)[:k]
This is not needed with the examples you give, the row is already an array and flattened:
flat_row = np.squeeze(np.asarray(row))
Similarly, you can just use numpy.nonzero
, no need to flatten the flattened row again:
row_nonzero = flat_row[np.nonzero(frow)]
However, this whole for
loop can be done entirely in numpy
:
First, mask all zeros by setting them to
numpy.nan
:D = np.where(D == 0, np.nan, D)
Get the
k
smallest elements of each row. Sincenumpy.nan
is not smaller than any number it will only appear at the beginning if (almost) the whole row was zeros:k_smallest = np.partition(D, k, axis=1)[:, :k]
Compute the median of each row, and then the median of the median, ignoring
numpy.nan
values usingnumpy.nanmedian
:kmedian_dist = np.nanmedian(np.nanmedian(k_smallest, axis=1))
The first two steps can even be pulled outside of the for k in ...
loop, which will save some more time. In that case we have to sort the whole array, though, because numpy.partition
does not guarantee that the values are sorted, only that the k
smallest values are in the slice up to k
. Since K
seems to be max(D_sym.shape)
in all cases anyway, this should not make such a big difference. It will if the k
at which the for
loops stops is small enough, though, so you will have to test this.
There's no need to break
and then immediately return
afterwards, just return
directly.
Python has an official style-guide, PEP8. It recommends using 4 spaces as indentation (not 3) and spaces after commas.
In summary, this should do the same as your code:
def autoselect_minsamps_graipher(D_sym, eps, K, kmedian_thresh):
D = np.sort(np.where(D_sym == 0, np.nan, D_sym), axis=1)
kmedian_dist_prev = 0
for k in range(2, K):
kmedian_dist = np.nanmedian(np.nanmedian(D[:,:k], axis=1))
if abs(kmedian_dist - kmedian_dist_prev) <= kmedian_thresh:
return k
kmedian_dist_prev = kmedian_dist
return 10 # fall back value
This is faster than your implementation, but suffers from similar scaling behaviour:
n autoselect_minsamps_orig autoselect_minsamps_graipher
100 0.028 s 0.0088 s
1000 0.32 s 0.064 s
10000 9.5 s 3.6 s
You could also make timing your code easier by using a decorator:
from time import perf_counter
class Timer:
def __init__(self, name=""):
self.name = ""
self.start = None
def __enter__(self):
self.start = perf_counter()
def __exit__(self, *args, **kwargs):
delta = perf_counter() - self.start
print(f"{self.name} finished in {delta:.2f} seconds")
Or, if you are stuck on Python 2:
from time import time
class Timer:
def __init__(self, name=""):
self.name = ""
self.start = None
def __enter__(self):
self.start = time()
def __exit__(self, *args, **kwargs):
delta = time() - self.start
print "{} finished in {:.2f} seconds".format(self.name, delta)
Which you can use like this:
D = euclidean_distances(X,X)
eps = 0.25
kmedian_thresh = 0.005
K = num_training_examples
with Timer(f"autoselect_minsamps_orig, K = {K}"):
min_samples = autoselect_minsamps_orig(D, eps, K, kmedian_thresh)
Note that I used time.perf_counter
, which is better suited for timing, but which is Python 3 only.
-
\$\begingroup\$ I'm having trouble importing
perf_counter
fromtime
...is it a Python 3 feature? I'm stuck with 2.7... \$\endgroup\$random_dsp_guy– random_dsp_guy2020年02月17日 17:24:49 +00:00Commented Feb 17, 2020 at 17:24 -
1\$\begingroup\$ @random_dsp_guy It is. You can use
time.time()
instead. In that case, you should also doprint "...".format(...)
wherever I didprint(f"..."}
. \$\endgroup\$Graipher– Graipher2020年02月17日 17:26:44 +00:00Commented Feb 17, 2020 at 17:26 -
\$\begingroup\$ Could you post a Python 2.7 compatible solution? I'm having trouble getting the arguments
Timer
to work (AttributeError: Timer instance has no attribute 'name'
) \$\endgroup\$random_dsp_guy– random_dsp_guy2020年02月17日 17:33:21 +00:00Commented Feb 17, 2020 at 17:33 -
1\$\begingroup\$ @random_dsp_guy That is actually a typo Independent of the Python version, fixed. I'll see if I have some time later to do a Python 2 version. \$\endgroup\$Graipher– Graipher2020年02月17日 18:07:47 +00:00Commented Feb 17, 2020 at 18:07
-
1\$\begingroup\$ @random_dsp_guy Ok, I quickly did it for the class, you will have to adapt the calling code yourself (though it is just the name string). Code is untested, so let me know if there are further typos hidden ;-) In the future, if you are asking a question and are still stuck on Python 2, we have the tag python-2.x, which you can add in addition to the normal Python tag. \$\endgroup\$Graipher– Graipher2020年02月17日 18:12:18 +00:00Commented Feb 17, 2020 at 18:12
Explore related questions
See similar questions with these tags.