Alexander G. Gray's Proximity Page

Proximity Search

Proximity problems ask questions about distances between points such as various sorts of nearest-neighbor queries. Algorithms and data structures (typically trees) are studied with theoretical goals in the field of computational geometry and with practical goals in many fields such as databases and machine learning.

???

Tutorial:
Data Structures for Fast Statistics

Our main interest in proximity methods is a little different from the usual ones. We are typically not interested in performing, say, nearest-neighbor queries per se; instead, we use the data structures that are effective for proximity problems to accelerate more complicated statistical and machine learning methods: see this tutorial given at ICML 04.

The Proximity Project

The Proximity Project is a large undertaking begun in summer 2004 which attempts to finally answer the question once and for all "What is the best nearest-neighbor method?", by implementing and empirically comparing the best-known published methods from the last three decades for five variations of the nearest-neighbor searching problem. Source code and datasets will be available. The results are done but the tedious writeup is not, though I am doing my best to make this appear here shortly.

Nearest-Neighbor Classification

Perhaps the most common need for nearest-neighbor search is for the purpose of nearest-neighbor classification. It turns out, however, that solving this problem need not be as hard as actually finding the k nearest neighbors and then finding out what the majority label is. To determine the correct class label for a query point, one need only determine whether more of its k nearest neighbors are of the positive class or of the negative class.

We demonstrate for the first time that this can be achieved exactly using bounds on distances, in less time than it takes to actually identify the k points in question using the best state-of-the-art algorithms. We show, for example, that this can make efficient search possible for a dataset with over 1 million dimensions and a few hundred thousand points, something which cannot be achieved with full nearest-neighbor search. (Liu, Moore, and Gray, New Algorithms for Efficient High-dimensional Nonparametric Classification [pdf], [ps] NIPS 2003.) A journal version has been accepted to JMLR but hasn't appeared yet. Ting Liu has also developed the multi-class generalization of this algorithm, in another paper.

Approximate Nearest-Neighbor

For a finite dataset in high dimension, it may not be possible in practice to find the nearest neighbor in less time than exhaustive search. So what can be done? Much attention has been paid lately to the idea of approximate nearest-neighbor searching -- instead of finding the exact nearest neighbor, change the problem to that of finding a neighbor within 1+ε of the distance to the true nearest neighbor. While this may seem sensible, this criterion says nothing about the rank of the points returned by such a method -- which may be arbitrarily poor. In fact the higher the dimension, the more of the points lie within 1+ε of the absolute nearest-neighbor distance -- so it can be that returning virtually any point is sufficient to solve this problem. Thus I strongly urge caution regarding using this approach in practice. Nonetheless, some applications of nearest-neighbor searching can withstand occasionally egregious errors. Computer vision researchers, for example, have found this approach useful in some applications.

Our tree-based data structure and associated algorithm yields up to 30x speedups over the best-known methods such as recent hashing approaches. (Liu, Moore, Gray, and Yang, An Investigation of Practical Approximate Nearest-Neighbor Algorithms [pdf], [ps] NIPS 2004.)

All-Nearest-Neighbors

Although the nearest-neighbor searching problem is posed in terms of a single query point and a set of reference points to be searched, in fact it is very often (perhaps most often) the case that the user has in hand a set of query points and wants the nearest neighbor for each. It turns out that this can be done in less time than it takes to process each query point individually.

See our formulation of this problem (in its most general bichromatic form) as an N-body problem, which can utilize virtually any space-partitioning tree. The standard nearest-neighbor branch-and-bound algorithm is a special case of this algorithm, corresponding to a query set of size 1.