BK-tree

Tree data structure for metric spaces

A BK-tree (short for Burkhard-Keller tree) is a metric tree suggested by Walter Austin Burkhard and Robert M. Keller ^[1] specifically adapted to discrete metric spaces. For simplicity, given a way to measure the distance between any two elements of a set, a BK-tree is built with a single root node and several subtrees, each connected to the root as a child. All nodes in a subtree has an equal distance to the root node, and the edge weight of the edge connecting the subtree to the root is equal to the distance. As shown in the picture. Also, each subtree of a BK-tree is a BK-tree.

The structure of the BK-tree is shown in the diagram: every node in a subtree is equidistant from the root

BK-trees can be used for approximate string matching in a dictionary ^[2]. The problem is formulated as follows: Given a pattern string $P=p_{1}p_{2}...p_{m}$ {\displaystyle P=p_{1}p_{2}...p_{m}} and a text string $T=t_{1}t_{2}\dots t_{n}$ {\displaystyle T=t_{1}t_{2}\dots t_{n}}, we need to find a substring $T_{j',j}=t_{j'}\dots t_{j}$ {\displaystyle T_{j',j}=t_{j'}\dots t_{j}} in $T$ {\displaystyle T}, which, of all substrings of $T$ {\displaystyle T}, has the smallest edit distance to the pattern $P$ {\displaystyle P}.

An ordinary way using BK-tree is to insert all $\Theta (n^{2})$ {\displaystyle \Theta (n^{2})} substrings of $T$ {\displaystyle T} and the pattern string $P$ {\displaystyle P} into the BK-tree, and find the subtree containing nodes with the smallest distance from the root. This will lead to a time complexity of $\Theta (n^{2}\log n)$ {\displaystyle \Theta (n^{2}\log n)}. However, this algorithm is very accurate and can find every substrings with the smallest edit distance to $P$ {\displaystyle P}.

Example

[edit ]

An example of BK-tree

This picture depicts the BK-tree for the set $W$ {\displaystyle W} of words {"book", "books", "cake", "boo", "boon", "cook", "cape", "cart"} obtained by using the Levenshtein distance

each node $u$ {\displaystyle u} is labeled by a string of $w_{u}\in W$ {\displaystyle w_{u}\in W};
each arc $(u,v)$ {\displaystyle (u,v)} is labeled by $d_{uv}=d(w_{u},w_{v})$ {\displaystyle d_{uv}=d(w_{u},w_{v})} where $w_{u}$ {\displaystyle w_{u}} denotes the word assigned to $u$ {\displaystyle u}.

The BK-tree is built so that:

for all node $u$ {\displaystyle u} of the BK-tree, the weight assigned to its egress arcs are distinct;
for all arc $e=(u,v)$ {\displaystyle e=(u,v)} labeled by $k$ {\displaystyle k}, each descendant $v'$ {\displaystyle v'} of $v$ {\displaystyle v} satisfies the following equation: $d(w_{u},w_{v'})=k$ {\displaystyle d(w_{u},w_{v'})=k}:
- Example 1: Consider the arc from "book" to "books". The distance between "book" and any word in {"books", "boo", "boon", "cook"} is equal to 1;
- Example 2: Consider the arc from "books" to "boo". The distance between "books" and any word in {"boo", "boon", "cook"} is equal to 2.

Insertion

[edit ]

The insertion primitive is used to populate a BK-tree $t$ {\displaystyle t} according to a discrete metric $d$ {\displaystyle d}.

Input:

$t$ {\displaystyle t}: the BK-tree;
- $d_{uv}$ {\displaystyle d_{uv}} denotes the weight assigned to an arc $(u,v)$ {\displaystyle (u,v)};
- $w_{u}$ {\displaystyle w_{u}} denotes word assigned to a node $u$ {\displaystyle u};
$d$ {\displaystyle d}: the discrete metric used by $t$ {\displaystyle t} (e.g. the Levenshtein distance);
$w$ {\displaystyle w}: the element to be inserted into $t$ {\displaystyle t};

Output:

The node of $t$ {\displaystyle t} corresponding to $w$ {\displaystyle w}

Algorithm:

If the $t$ {\displaystyle t} is empty:
- Create a root node $r$ {\displaystyle r} in $t$ {\displaystyle t}
- $w_{r}\leftarrow w$ {\displaystyle w_{r}\leftarrow w}
- Return $r$ {\displaystyle r}
Set $u$ {\displaystyle u} to the root of $t$ {\displaystyle t}
While $u$ {\displaystyle u} exists:
- $k\leftarrow d(w_{u},w)$ {\displaystyle k\leftarrow d(w_{u},w)}
- If $k=0$ {\displaystyle k=0}:
  - Return $u$ {\displaystyle u}
- Find $v$ {\displaystyle v} the child of $u$ {\displaystyle u} such that $d_{uv}=k$ {\displaystyle d_{uv}=k}
- If $v$ {\displaystyle v} is not found:
  - Create the node $v$ {\displaystyle v}
  - $w_{v}\leftarrow w$ {\displaystyle w_{v}\leftarrow w}
  - Create the arc $(u,v)$ {\displaystyle (u,v)}
  - $d_{uv}\leftarrow k$ {\displaystyle d_{uv}\leftarrow k}
  - Return $v$ {\displaystyle v}
- $u\leftarrow v$ {\displaystyle u\leftarrow v}

Lookup

[edit ]

Given a searched element $w$ {\displaystyle w}, the lookup primitive traverses the BK-tree to find the closest element of $w$ {\displaystyle w}. The key idea is to restrict the exploration of $t$ {\displaystyle t} to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).

Input:

$t$ {\displaystyle t}: the BK-tree;
$d$ {\displaystyle d}: the corresponding discrete metric (e.g. the Levenshtein distance);
$w$ {\displaystyle w}: the searched element;
$d_{max}$ {\displaystyle d_{max}}: the maximum distance allowed between the best match and $w$ {\displaystyle w}, defaults to $+\infty$ {\displaystyle +\infty };

Output:

$w_{best}$ {\displaystyle w_{best}}: the closest element to $w$ {\displaystyle w} stored in $t$ {\displaystyle t} and according to $d$ {\displaystyle d} or $\perp$ {\displaystyle \perp } if not found;

Algorithm:

If $t$ {\displaystyle t} is empty:
- Return $\perp$ {\displaystyle \perp }
Create $S$ {\displaystyle S} a set of nodes to process, and insert the root of $t$ {\displaystyle t} into $S$ {\displaystyle S}.
$(w_{best},d_{best})\leftarrow (\perp ,d_{max})$ {\displaystyle (w_{best},d_{best})\leftarrow (\perp ,d_{max})}
While $S\neq \emptyset$ {\displaystyle S\neq \emptyset }:
- Pop an arbitrary node $u$ {\displaystyle u} from $S$ {\displaystyle S}
- $d_{u}\leftarrow d(w,w_{u})$ {\displaystyle d_{u}\leftarrow d(w,w_{u})}
- If $d_{u}<d_{best}$ {\displaystyle d_{u}<d_{best}}:
  - $(w_{best},d_{best})\leftarrow (w_{u},d_{u})$ {\displaystyle (w_{best},d_{best})\leftarrow (w_{u},d_{u})}
- For each egress-arc $(u,v)$ {\displaystyle (u,v)}:
  - If $|d_{uv}-d_{u}|<d_{best}$ {\displaystyle |d_{uv}-d_{u}|<d_{best}}: (cut-off criterion)
    - Insert $v$ {\displaystyle v} into $S$ {\displaystyle S}.
Return $w_{best}$ {\displaystyle w_{best}}

Example of the lookup algorithm

[edit ]

Consider the example 8-node B-K Tree shown above and set $w=$ {\displaystyle w=}"cool". $S$ {\displaystyle S} is initialized to contain the root of the tree, which is subsequently popped as the first value of $u$ {\displaystyle u} with $w_{u}$ {\displaystyle w_{u}}="book". Further $d_{u}=2$ {\displaystyle d_{u}=2} since the distance from "book" to "cool" is 2, and $d_{best}=2$ {\displaystyle d_{best}=2} as this is the best (i.e. smallest) distance found thus far. Next each outgoing arc from the root is considered in turn: the arc from "book" to "books" has weight 1, and since $|1-2|=1$ {\displaystyle |1-2|=1} is less than $d_{best}=2$ {\displaystyle d_{best}=2}, the node containing "books" is inserted into $S$ {\displaystyle S} for further processing. The next arc, from "book" to "cake," has weight 4, and since $|4-2|=2$ {\displaystyle |4-2|=2} is not less than $d_{best}=2$ {\displaystyle d_{best}=2}, the node containing "cake" is not inserted into $S$ {\displaystyle S}. Therefore, the subtree rooted at "cake" will be pruned from the search, as the word closest to "cool" cannot appear in that subtree. To see why this pruning is correct, notice that a candidate word $c$ {\displaystyle c} appearing in "cake"s subtree having distance less than 2 to "cool" would violate the triangle inequality: the triangle inequality requires that for this set of three numbers (as sides of a triangle), no two can sum to less than the third, but here the distance from "cool" to "book" (which is 2) plus the distance from "cool" to $c$ {\displaystyle c} (which is less than 2) cannot reach or exceed the distance from "book" to "cake" (which is 4). Therefore, it is safe to disregard the entire subtree rooted at "cake".

Next the node containing "books" is popped from $S$ {\displaystyle S} and now $d_{u}=3$ {\displaystyle d_{u}=3}, the distance from "cool" to "books." As $d_{u}>d_{best}$ {\displaystyle d_{u}>d_{best}}, $d_{best}$ {\displaystyle d_{best}} remains set at 2 and the single outgoing arc from the node containing "books" is considered. Next, the node containing "boo" is popped from $S$ {\displaystyle S} and $d_{u}=2$ {\displaystyle d_{u}=2}, the distance from "cool" to "boo." This again does not improve upon $d_{best}=2$ {\displaystyle d_{best}=2}. Each outgoing arc from "boo" is now considered; the arc from "boo" to "boon" has weight 1, and since $|2-1|=1<d_{best}=2$ {\displaystyle |2-1|=1<d_{best}=2}, "boon" is added to $S$ {\displaystyle S}. Similarly, since $|2-2|=0<d_{best}$ {\displaystyle |2-2|=0<d_{best}}, "cook" is also added to $S$ {\displaystyle S}.

Finally each of the two last elements in $S$ {\displaystyle S} are considered in arbitrary order: suppose the node containing "cook" is popped first, improving $d_{best}$ {\displaystyle d_{best}} to distance 1, then the node containing "boon" is popped last, which has distance 2 from "cool" and therefore does not improve the best result. Finally, "cook" is returned as the answer $w_{best}$ {\displaystyle w_{best}} with $d_{best}=1$ {\displaystyle d_{best}=1}.

Time Complexity

[edit ]

The efficiency of BK-trees depends strongly on the structure of the tree and the distribution of distances between stored elements.

In the average case, both insertion and lookup operations take $\Theta (\log n)$ {\displaystyle \Theta (\log n)} time, assuming the tree remains relatively balanced and the distance metric distributes elements evenly.^[3]^[4]

In the worst case, when the data or distance function causes the tree to become highly unbalanced (for example, when many elements are at similar distances), both insertion and lookup can degrade to $\Theta (n)$ {\displaystyle \Theta (n)}.^[5]^[6]

In practical applications, the actual performance depends on the choice of distance metric (e.g., the Levenshtein distance) and on the allowed search radius during approximate matching.^[7]

References

[edit ]

^ W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
^ R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed queries trees. In M. Crochemore and D. Gusfield, editors, 5th Combinatorial Pattern Matching, LNCS 807, pages 198–212, Asilomar, CA, June 1994.
^ Ricardo Baeza-Yates and Gonzalo Navarro. Fast Approximate String Matching in a Dictionary. Proc. SPIRE'98

^ W. A. Burkhard and R. M. Keller. Some approaches to best match file searching. Communications of the ACM, 16(4):230–236, 1973.
^ R. Baeza-Yates and G. Navarro. Fast Approximate String Matching in a Dictionary. In Proceedings of SPIRE'98, 1998.
^ H. Hyyro. Explaining and Extending the Burkhard–Keller Tree for Fast String Searching. In Proceedings of the 7th Symposium on String Processing and Information Retrieval (SPIRE), pages 1–10, 2003.
^ S. Mihov and K. Schulz. Fast approximate search in large dictionaries. Computational Linguistics, 30(4):451–477, 2004.

External links

[edit ]

A BK-tree implementation in Common Lisp with test results and performance graphs.
An explanation of BK-Trees and their relationship to metric spaces [8]
An explanation of BK-Trees with an implementation in C# [9]
A BK-tree implementation in Lua [10]
A BK-tree implementation in Python [11]

v t e Tree data structures
Search trees (dynamic sets, associative arrays)	2–3 2–3–4 AA (a,b) AVL B K-Dimensional B+ B* B^x Binary search Optimal Self-balancing Dancing HTree Interval Order statistic Palindrome (Left-leaning) Red–black Scapegoat Splay T Treap UB Weight-balanced
Heaps	Binary Binomial Brodal d-ary Fibonacci Leftist Pairing Skew binomial Skew van Emde Boas Weak
Tries	Ctrie C-trie (compressed ADT) Hash Radix Suffix Ternary search X-fast Y-fast
Spatial data partitioning trees	Ball BK BSP Cartesian Hilbert R k-d (implicit k-d) M Metric MVP Octree PH Priority R Quad R R+ R* Segment VP X
Other trees	Cover Exponential Fenwick Finger Fractal index Fusion Hash calendar iDistance K-ary Left-child right-sibling Link/cut Log-structured merge Merkle PQ Range SPQR Top

Retrieved from "https://en.wikipedia.org/w/index.php?title=BK-tree&oldid=1317722293"

Example

Insertion

Lookup

Example of the lookup algorithm

Time Complexity

See also

References

External links