What AI Native Infrastructure Looks Like in Practice

DEV Community

Retrieval and re-ranking is the search step itself, followed by a second pass that re-scores the top candidates for relevance before passing them to the model. Bi-encoder models (the ones that power standard vector search) optimize for broad recall. Cross-encoder re-rankers optimize for precision among the top results. Running both — retrieve broadly with the bi-encoder, then re-rank the top 20 results with a cross-encoder before selecting the final context — produces meaningfully better retrieval quality than either approach alone, at a latency cost that is usually under 50 milliseconds.

Context assembly is the final step before the prompt. Which chunks to include, in what order, how to handle redundancy across chunks, whether to add metadata like document date or source type — these decisions shape what the model sees. Models perform better when the most relevant context appears at the beginning of the context window, not buried in the middle. Position matters more than engineers typically expect.

Layer Three: Context Management

This is where most teams discover that they had an implicit assumption they never examined.

Layer Three: Context Management

They assumed context would stay small enough to not matter.
Context management is the layer that tracks what the model needs to know within a session, across sessions, and at the system level — and makes deliberate choices about what to include, what to compress, and what to discard. It sounds simple. In practice, it is the layer that silently determines whether the system feels coherent or amnesiac, expensive or cost-efficient.

The clearest failure mode is context stuffing: including everything the system might need, in full, on every request, because it is easier than deciding what to exclude. At low traffic volumes this is fine. At scale, the token cost compounds fast, latency climbs as the context window fills, and the model's attention degrades on long-context inputs. An enterprise application routing ten thousand requests per hour through a 128K context window, when 60K of that context is the same static background information repeated verbatim on every call, is not a data architecture problem — it is an engineering decision that has simply not been made yet.

Effective context management has three components. A session layer tracks the immediate conversation and recent user actions, kept compact, summarized aggressively after the first few turns rather than appended indefinitely. A memory layer handles what the system should retain across sessions — user preferences, prior decisions, domain-specific facts about this user's context — stored as structured records, not as raw conversation history. And a system layer manages the baseline context that every request needs: the product's core knowledge, current configuration, and any real-time state the model should be aware of.

The goal is not minimalism for its own sake. It is precision. The right context, fresh, in the right position, without padding.

Layer Four: The Eval Framework

Everything built so far produces outputs that cannot be tested with a passing or failing unit test. The model might return a factually correct response in the wrong format. It might answer the literal question while missing the user's actual intent. It might perform well on the examples in your test suite and drift on the long tail of real queries that you have not seen yet.

Layer Four: The Eval Framework

Eval infrastructure is what makes AI Native systems improvable, rather than just deployed.

The production pattern that engineering teams are converging on in 2026 uses two tools with a clear division of labor. A lightweight open-source framework handles CI/CD gating at the PR level: DeepEval is the closest thing the LLM eval world has to pytest, running assertion-style tests against model outputs on every code change. RAGAS handles retrieval-specific metrics — context precision, answer faithfulness, answer relevance — for RAG-heavy systems. These run in the pipeline, automatically, before any change ships.

A second tier handles production monitoring and regression tracking: Braintrust for dataset-first prompt regression workflows with human annotation, or Arize Phoenix for teams that need production observability alongside evaluation. The two tiers run together. Unit-level evals catch regressions before deployment. Production evals catch drift after it.

The discipline that separates teams who use evals from teams who have eval infrastructure is this: the metrics are defined before the system is built, not after. What does "correct" mean for this use case? What does "faithful" mean? What does "hallucinated" mean, specifically, for this domain? These are design questions, not measurement questions. Teams that get this right start their architecture work at the eval layer. Teams that get this wrong discover they cannot measure progress at the point when it matters most.

Layer Five: The Gateway

The LLM gateway is the layer that most teams add last. It should be among the first decisions made.

Layer Five: The Gateway

A gateway sits between your application and every model provider. It handles routing, cost controls, caching, failover, and observability — functions that are not optional at any meaningful production scale, but that most teams implement as ad hoc logic scattered across application code until a provider outage or a cost spike forces the issue.

At scale, the case is not theoretical. Teams running production AI workloads that skip this layer see token spend compound 30 to 40 percent faster than necessary from redundant identical requests that a semantic cache would have served without an inference call. They carry outsized operational risk during provider outages that proper failover configuration would absorb. They cannot attribute costs to teams or features because there is no central point of control.

Bifrost, an open-source gateway built in Go, handles 5,000 requests per second at 11 microseconds of overhead — low enough that it adds no perceptible latency to the inference call. LiteLLM is the most widely deployed open-source option for teams that want a Python-native solution with broad provider coverage. Cloudflare AI Gateway is the lowest-friction managed option for teams that want zero infrastructure maintenance. Kong AI Gateway integrates into existing API management infrastructure for enterprise environments already running Kong.

The right choice between them matters less than the decision to have one. Without a gateway, every team inevitably rebuilds fragments of it at the application layer: manual retry logic, cost tracking spreadsheets, per-feature model selection buried in function calls. The gateway consolidates that logic into a single, auditable layer. When a provider goes down at 2am, the failover runs automatically. When a new model releases and you want to test it on five percent of traffic, you change one configuration line.

The Right Build Order

The mistake is not in the individual layer choices. Most teams are thoughtful about which embedding store they pick, which eval framework they try. The mistake is in the order.

Teams that start with the model end up retrofitting the infrastructure around a system that was already making assumptions about what the data layer would eventually provide. The embedding store gets added to support a retrieval pattern that the prompt design has already locked in. The eval framework gets added when the system is already live and there is no baseline to regress against. The gateway gets added when the first cost spike arrives.

Teams that start with the data layer make different decisions. They define what "good retrieval" means before they write a prompt. They choose their embedding store based on the query patterns their system will actually need to support. They design the context management strategy before they know how often it will need to run.

The model sits at the top of this stack, not the bottom. It is the most visible layer. It is the layer that produces the output the user sees. But it is the last thing to configure, not the first.

Starting with the model is like designing a building by choosing the facade material before you know the load-bearing structure. The facade is what people will look at. The structure is what holds it up.

Build the structure first.