Find approximate nearest neighbors (ANN) and query vector embeddings

This page describes how to find approximate nearest neighbors (ANN) and query vector embeddings using the ANN distance functions.

When a dataset is small, you can use K-nearest neighbors (KNN) to find the exact k-nearest vectors. However, as your dataset grows, the latency and cost of a KNN search also increase. You can use ANN to find the approximate k-nearest neighbors with significantly reduced latency and cost.

In an ANN search, the k-returned vectors aren't the true top k-nearest neighbors because the ANN search calculates approximate distances and might not look at all the vectors in the dataset. Occasionally, a few vectors that aren't among the top k-nearest neighbors are returned. This is known as recall loss. How much recall loss is acceptable to you depends on the use case, but in most cases, losing a bit of recall in return for improved database performance is an acceptable tradeoff.

For more details about the approximate distance functions supported in Spanner, see the following GoogleSQL reference pages:

Query vector embeddings

Spanner accelerates approximate nearest neighbor (ANN) vector searches by using a vector index. You can use a vector index to query vector embeddings. To query vector embeddings, you must first create a vector index. You can then use any one of the three approximate distance functions to find the ANN.

Restrictions when using the approximate distance functions include the following:

  • The approximate distance function must calculate the distance between an embedding column and a constant expression (for example, a parameter or a literal).
  • The approximate distance function output must be used in a ORDER BY clause as the sole sort key, and a LIMIT must be specified after the ORDER BY.
  • The query must explicitly filter out rows that aren't indexed. In most cases, this means that the query must include a WHERE <column_name> IS NOT NULL clause that matches the vector index definition, unless the column is already marked as NOT NULL in the table definition.

For a detailed list of limitations, see the approximate distance function reference page.

Examples

Consider a Documents table that has a DocEmbedding column of precomputed text embeddings from the DocContents bytes column, and a NullableDocEmbedding column populated from other sources that might be null.

CREATETABLEDocuments(
UserIdINT64NOTNULL,
DocIdINT64NOTNULL,
AuthorSTRING(1024),
DocContentsBYTES(MAX),
DocEmbeddingARRAY<FLOAT32>NOTNULL,
NullableDocEmbeddingARRAY<FLOAT32>,
WordCountINT64
)PRIMARYKEY(UserId,DocId);

To search for the nearest 100 vectors to [1.0, 2.0, 3.0]:

SELECTDocId
FROMDocuments
WHEREWordCount > 1000
ORDERBYAPPROX_EUCLIDEAN_DISTANCE(
ARRAY<FLOAT32>[1.0,2.0,3.0],DocEmbedding,
options=>JSON'{"num_leaves_to_search": 10}')
LIMIT100

If the embedding column is nullable:

SELECTDocId
FROMDocuments
WHERENullableDocEmbeddingISNOTNULLANDWordCount > 1000
ORDERBYAPPROX_EUCLIDEAN_DISTANCE(
ARRAY<FLOAT32>[1.0,2.0,3.0],NullableDocEmbedding,
options=>JSON'{"num_leaves_to_search": 10}')
LIMIT100

What's next

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年11月10日 UTC.