K-Nearest Neighbors in pure Python

Question 1

I want a general criticism on this code. Using external modules is not an option, I can only use what comes with CPython.

from collections import Counter
from typing import Sequence, NamedTuple
Coordinates = Sequence[float]
class KNNPoint(NamedTuple):
 coords: Coordinates
 classif: str
def predict(target: Coordinates, points: Sequence[KNNPoint], k: int) -> str:
 '''
 Applies the K-Nearest Neighborhood algorithm in order to find the
 classification of target.
 - target: Single data to be classified.
 - points: Collection of data which classifications are known.
 - k: The number of closest neighbors to be used.
 '''
 def distance(p: KNNPoint) -> float:
 return sum((a - b) ** 2 for a, b in zip(target, p.coords))
 neighbors = sorted(points, key=distance)
 counter = Counter(x.classif for x in neighbors[:k])
 return counter.most_common(1)[0][0]

If you'd like to run it, this gist has everything ready. it uses a dataset of mobile phones. (This gist shall not be reviewed)

Question 2

If you want any speed from this, you will have to use external modules. I don't think you will get even close to the performance of sklearn.neighbors.KDTree, which is a better data structure for this than a list and also implemented in C, otherwise.

Question 3

@Graipher Yes. Also, running the same thing on pypy3 takes only 35% of the time taken on cpython.

Question 4

I know you explicitly stated the gist shall not be reviewed ... but ... you really need to normalize your data! The clock speed (GHz) has a range of 2.5 where as the RAM (MB) has a range of 3742. When you compute the distance between locations, any difference in clock speed will become noise compared to a small difference in RAM. For example, the entire clock-speed range of 0.5 GHz to 3.0 GHz is a 6.25 square-distance, where as a tiny change in RAM, say from 256MB to 512MB results in a square-distance 10,000 times larger.

Question 5

@AJNeufeld I didn't think about that. I'll normalize the data and see if it helps, however I won't report my results here because it's not part of the review. Thanks for pointing it out!

Question 6

Assumption: k << N, where N = len(points)

There is no need to sort the entire list of points!

Instead, take the first k points, and determine their distance values, and sort them. Then, for each success point:

determine its distance,
if it is smaller than the maximum,
- discard the maximum, and insert the new point in the sorted list.

Sorting N points by distance is O(N log N); creating and maintaining a sorted list of k smallest elements is only O(N log k), which should be considerably faster.

I'm not sure if heapq.nsmallest() is built into CPython or not ...

k_neighbours = heapq.nsmallest(k, points, key=distance)
counter = Counter(x.classif for x in k_neighbours)

Well, I'm disappointed to see heapq.nsmallest() performed up to 40% worse that sorted on CPython, but I'm happy to see PyPy validates my assertion that you don't need to sort the entire list.

Continuing with that thought, bisect.insort() may be used to maintain a list of the k-nearest neighbours so far:

 neighbours = [(float('inf'), None)] * k
 for pnt in points:
 dist = distance(pnt)
 if dist < neighbours[-1][0]:
 neighbours.pop()
 bisect.insort(neighbours, (dist, pnt))
 counter = Counter(pnt.classif for dist, pnt in neighbours)

This gave me 4% speedup over sorted()[:k] with your gist sample set.

Significant, but not impressive. Still, it was enough encouragement to press on an look for other inefficiencies.

How about the distance() code. It gets called a lot; can we speed it up? Sure!

def predict(target: Coordinates, points: Sequence[KNNPoint], k: int, *,
 sum=sum, zip=zip) -> str:
 def distance(p: KNNPoint) -> float:
 return sum((a - b) ** 2 for a, b in zip(target, p.coords))
 # ...

Instead of searching the global scope for the sum and zip functions, they are saved as variables sum, zip in the local scope, along with target, for use in distance(). Total improvement: 6%.

Applying the same sum=sum, zip=zip change to the original code, without the bisect.insort() change, also speeds it up by 2%.

Further, adding insort=bisect.insort to the function declaration, and using insort(neighbours, (dist, pnt)) in the function body also provides a minor improvement.

Finally, I was concerned about neighbours[-1][0]. Looking up the first tuple of the last element in the array seemed inefficient. We could keep track of this in a local threshold variable. Final total speedup: 7.7%.

neighbours = [(float('inf'), None)] * k
threshold = neighbours[-1][0]
for pnt in points:
 dist = distance(pnt)
 if dist < threshold:
 neighbours.pop()
 insort(neighbours, (dist, pnt))
 threshold = neighbours[-1][0]

YMMV

Question 7

In CPython, nsmallest performed equal to sorted for small k; and 40% worst for large k. Using PyPy, nsmallesr performed 50% better than sorted for both small and large k.

Question 8

And yes, it's built into CPython.

Question 9

Thanks, I didn't know about insort. I just found out that changing (a - b) ** 2 to (a - b) * (a - b) causes a 28% performance improvement on CPython, while makes no difference on PyPy (I guess PyPy JIT compiles both versions to the same thing).

Question 10

Wow! Good find! Makes my micro optimizations almost laughable. Nice low hanging fruit!

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Accepted Answer · 2018-07-10 05:21:04Z

Assumption: k << N, where N = len(points)

There is no need to sort the entire list of points!

Instead, take the first k points, and determine their distance values, and sort them. Then, for each success point:

determine its distance,
if it is smaller than the maximum,
- discard the maximum, and insert the new point in the sorted list.

Sorting N points by distance is O(N log N); creating and maintaining a sorted list of k smallest elements is only O(N log k), which should be considerably faster.

I'm not sure if heapq.nsmallest() is built into CPython or not ...

k_neighbours = heapq.nsmallest(k, points, key=distance)
counter = Counter(x.classif for x in k_neighbours)

Well, I'm disappointed to see heapq.nsmallest() performed up to 40% worse that sorted on CPython, but I'm happy to see PyPy validates my assertion that you don't need to sort the entire list.

Continuing with that thought, bisect.insort() may be used to maintain a list of the k-nearest neighbours so far:

 neighbours = [(float('inf'), None)] * k
 for pnt in points:
 dist = distance(pnt)
 if dist < neighbours[-1][0]:
 neighbours.pop()
 bisect.insort(neighbours, (dist, pnt))
 counter = Counter(pnt.classif for dist, pnt in neighbours)

This gave me 4% speedup over sorted()[:k] with your gist sample set.

Significant, but not impressive. Still, it was enough encouragement to press on an look for other inefficiencies.

How about the distance() code. It gets called a lot; can we speed it up? Sure!

def predict(target: Coordinates, points: Sequence[KNNPoint], k: int, *,
 sum=sum, zip=zip) -> str:
 def distance(p: KNNPoint) -> float:
 return sum((a - b) ** 2 for a, b in zip(target, p.coords))
 # ...

Instead of searching the global scope for the sum and zip functions, they are saved as variables sum, zip in the local scope, along with target, for use in distance(). Total improvement: 6%.

Applying the same sum=sum, zip=zip change to the original code, without the bisect.insort() change, also speeds it up by 2%.

Further, adding insort=bisect.insort to the function declaration, and using insort(neighbours, (dist, pnt)) in the function body also provides a minor improvement.

Finally, I was concerned about neighbours[-1][0]. Looking up the first tuple of the last element in the array seemed inefficient. We could keep track of this in a local threshold variable. Final total speedup: 7.7%.

neighbours = [(float('inf'), None)] * k
threshold = neighbours[-1][0]
for pnt in points:
 dist = distance(pnt)
 if dist < threshold:
 neighbours.pop()
 insort(neighbours, (dist, pnt))
 threshold = neighbours[-1][0]

YMMV

In CPython, nsmallest performed equal to sorted for small k; and 40% worst for large k. Using PyPy, nsmallesr performed 50% better than sorted for both small and large k.
Thanks, I didn't know about insort. I just found out that changing (a - b) ** 2 to (a - b) * (a - b) causes a 28% performance improvement on CPython, while makes no difference on PyPy (I guess PyPy JIT compiles both versions to the same thing).
Wow! Good find! Makes my micro optimizations almost laughable. Nice low hanging fruit!

Stack Exchange Network

K-Nearest Neighbors in pure Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

K-Nearest Neighbors in pure Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions