Embeddings: How AI Knows Things Are Similar

DEV Community

That is semantic search. It is the foundation for a surprising number of AI applications.

What embeddings enable

The most common way people encounter embeddings is through RAG, but embeddings are useful on their own, without a language model involved at all.

Semantic search. The example I just described. Instead of matching keywords, you match meaning. This is a common way modern search engines, documentation sites, and knowledge bases find relevant results even when your query uses different terminology than the source material.

Deduplication. If you have a database of support tickets and you want to find near-duplicates, you can embed each ticket and cluster the ones with high similarity. Two tickets that describe the same bug in different words will land close together.

Classification and clustering. Embed a set of documents and group them by similarity. Customer feedback sorts itself into themes without you defining the categories upfront. Product reviews cluster into topics. The structure emerges from the data.

Anomaly detection. If most of your data points cluster together but one sits far away, that outlier might be worth investigating. Fraud detection, content moderation, and quality control all use this pattern.

Recommendation. "If you liked this article, here are similar ones." Embed the articles, find the nearest neighbors to the one the user just read. This can complement the collaborative filtering you may already have in place.

The part that changes how you think about code

Here is where this gets practical for anyone who writes software or works with data.

Anywhere you have fuzzy matching logic in code, embeddings might be a better solution. I mean the kind of code where you are trying to determine if two strings are "close enough" to be considered the same thing.

Think about:

A customer types "NYC" and you need to match it to "New York City" in your database
Searching product descriptions when the user's query does not match your exact product names
Matching job postings to resumes when the terminology differs between industries
Finding related articles when titles and tags do not overlap

Traditional approaches use Levenshtein distance, regex patterns, synonym lists, or elaborate normalization pipelines. They work until they do not. Every edge case requires another rule. The rule list grows. Maintenance becomes painful.

Embeddings can often match or beat those results with far less code: embed both strings, compute cosine similarity, threshold at a score you choose. The matching is based on meaning, not character patterns. "NYC" and "New York City" are close. "I need to fix a bug in my Python code" and "there is an error in my script" are close. No lookup table required.

This is not hypothetical for me. I replaced keyword-based search in my own memory system with embedding-based search and the improvement was immediate. Queries that returned nothing before started finding exactly the right entries.

Embedding models are not language models

This is a distinction worth understanding. When you use ChatGPT or Claude, you are using a language model. It generates text, reasons through problems, and holds conversations.

An embedding model does one thing: it converts text into a vector. It does not generate text or have conversations. It is a different kind of model, trained specifically to produce useful numerical representations of meaning.

You can use embedding models from OpenAI, Google, Voyage AI, Cohere, and others. Some are general purpose. Some are optimized for specific domains like code, legal documents, or financial text. The choice of model matters because different models capture different nuances. A model trained heavily on code will produce better embeddings for code search than a general-purpose model.

The cost is also dramatically different from language models. Embedding a million tokens of text might cost a few cents. Generating a million tokens of text with a language model costs dollars. Embeddings are cheap to produce and cheap to store.

The practical tradeoffs

Embeddings are not magic. A few things worth knowing before you reach for them:

You need to embed everything upfront. Before you can search your data semantically, every piece of text needs to be converted to a vector and stored. For a small dataset, this is trivial. For millions of documents, it takes planning.

Embedding quality depends on the model. A model that was not trained on your domain might produce mediocre representations of your specific terminology. If you work in a specialized field, test a few models before committing.

Vectors are opaque. You cannot look at a vector and understand what it means. If the similarity score is wrong, debugging is harder than with keyword search. You cannot just add a synonym to fix it.

Context length matters. Most embedding models have a maximum input length. If you need to embed a 50-page document, you will need to chunk it into smaller pieces first. How you chunk affects quality. This is where the nuance lives in production systems.

Where this leads

Tomorrow: the question that ties this all together. You have a million-token context window. You have embeddings that let you search semantically. When should you load the whole book into the context, and when should you retrieve just the relevant pieces? That is the RAG decision, and it is one of the most important architectural choices in AI applications right now.