context engineering, and it is a true software and system architecture discipline that replaces prompt writing.
As someone with over 20 years in system and backend development, I can tell you: roughly all the problems we encounter when integrating AI into enterprise systems stem not from "wrong prompts," but from "dirty and uncontrolled context." Whatever you put in front of the model, it will process. If you indiscriminately dump the entire database in front of it, you'll inflate the bill and confuse the model, leading to incorrect decisions (hallucinations).
Why is prompt engineering becoming insufficient?
Prompt engineering was a temporary solution that worked when models had small context windows and their reasoning abilities were in their infancy. Back then, guiding the model with templates like "Think step-by-step" or "You are a financial expert" made a difference because these words were the only guidance the model had. However, as models evolved and context windows reached hundreds of thousands of tokens, the impact of these word games significantly diminished.
The biggest problem we face today is not the prompt itself, but the "lost in the middle" phenomenon that arises with the growth of context windows. Research and our field tests clearly show a truth: LLMs tend to focus on information at the beginning and end of the context they are given; they often ignore the massive data pile in the middle. If you write a fancy prompt and then indiscriminately paste 50 pages of documentation below it, there's a very high chance the model will miss a critical business rule written somewhere in the middle.
⚠️ Lost in the Middle Effect
As the model's context window grows, even if its processing capacity increases, attention distribution does not remain homogeneous. You should always place the most critical rules and instructions at the very beginning or very end of the context.
What is context engineering and what does it aim for?
Context engineering is the process of filtering, structuring, prioritizing, and packaging the data to be sent to the model in the most optimized way. Its goal is to provide the model with only the most refined information necessary to complete the current task, thereby reducing latency, lowering costs, and maximizing output quality. This is not an art of words, but a design of data pipelines.
When designing a good context, we don't leave the data in its raw form. We clean rows from databases, logs, or documents, removing unnecessary noise (boilerplate code, repetitive headers, redundant metadata fields). Then, we convert this data into Markdown or structured JSON format, which the model can parse most quickly and accurately.
Diagram
How is context designed in RAG architectures?
The biggest mistake made when building Retrieval-Augmented Generation (RAG) systems is directly feeding the first 5-10 results from the vector database to the model. If you rely solely on cosine similarity to create context, you might overwhelm the model with repetitive or completely irrelevant data. This increases token costs and confuses the model.
When designing context in a RAG architecture, we must follow these steps:
- Semantic Chunking: Splitting documents only by character limits (e.g., every 1000 characters) breaks semantic integrity. Instead, we should use intelligent chunking strategies that follow paragraphs, Markdown headings, or code blocks.
- Metadata Enrichment: Each data chunk should be tagged with information such as which document it belongs to, its creation date, and its authorization level. The model should be able to read the context of the information presented to it from these tags.
- Re-ranking: The results returned from the vector database should be passed through a reranker model (e.g., Cohere or BAAI reranker) that optimizes keyword and semantic alignment, selecting the top 3 most relevant results.
How to manage token economics and context window limits?
The growth of context windows does not mean we can use them indefinitely; token costs and network latency still increase linearly (and sometimes exponentially). Sending 100,000 tokens with every request in a production system will quickly drain your wallet and ruin the user experience (TTFT - Time to First Token). Therefore, dynamically managing context size is a critical system engineering task.
To manage this situation, we utilize "Prompt Caching" mechanisms. If the system instructions and fixed documents we send to the system do not change, we cache them with supporting API providers (e.g., Anthropic or OpenAI) to gain significant cost and speed advantages in subsequent requests. Additionally, when storing user history, we should use a "sliding window" approach, keeping only the last N messages in the context and summarizing older messages into a single token block.
# Simple sliding window and summarization logic for context optimization
def build_context(user_history, current_query, max_tokens=4000):
context = []
current_tokens = count_tokens(current_query)
# Prioritize the latest messages by iterating backward
for message in reversed(user_history):
msg_tokens = count_tokens(message["content"])
if current_tokens + msg_tokens < max_tokens:
context.insert(0, message)
current_tokens += msg_tokens
else:
# For old messages exceeding the limit, pass through a summary service or skip
context.insert(0, {"role": "system", "content": "[Summary of old conversations...]"})
break
context.append({"role": "user", "content": current_query})
return context
How should state and memory management be handled in an LLM agent architecture?
When designing autonomous agents, the agent needs to remember its past actions and their outcomes. However, if this "memory" grows uncontrollably, the agent will eventually get lost in its own loops. Memory management is one of the most complex areas of context engineering.
In agent architectures, we store state data in fast and persistent data stores like Redis or PostgreSQL. Instead of sending the entire history to the agent at each step, we design a minimalist JSON object representing the agent's current "state." For example, if we are designing an e-commerce return agent, the agent's context should contain only a clean state object like this, rather than the entire chat history:
{"current_step":"verify_invoice","invoice_id":"INV-2026-0042","verification_status":"pending_user_signature","attempts":2}
This structured data allows the agent to focus directly on business logic without getting confused about what to do next.
Context engineering application in a real production ERP
In a manufacturing ERP, we designed an AI assistant for operators to analyze machine error codes and automatically open work orders for maintenance teams. In our initial attempts, we sent all sensor logs and historical maintenance documents from the machine to the LLM in their raw form. The result was a complete disaster: the model constantly made incorrect fault diagnoses, and each query took seconds.
To solve the problem, we stopped classic prompt modification and redesigned the context pipeline from scratch. First, we normalized the sensor data; we only added anomalous values to the context. For maintenance documents, we indexed them by error codes and only retrieved paragraphs matching the current error code.
# ERP Context Preparation Pipeline Example
def prepare_operator_context(machine_id, error_code):
# 1. Get only active and anomalous sensor data (Reduce noise)
telemetry = get_active_anomalies(machine_id)
# 2. Filter historical maintenance records specific to the error code
history = query_maintenance_db(error_code, limit=2)
# 3. Combine the context in Markdown format, which the model understands best
context = f"""
# MACHINE STATUS: {machine_id}
Active Anomalies: {telemetry}
# RELEVANT MAINTENANCE HISTORY FOR ERROR {error_code}:
{history}"""
return context
After this structural change, the model's accuracy in diagnosis significantly increased, and we reduced token consumption by roughly a third.
Conclusion
My clear stance is this: success in AI projects comes from data engineering, not wordplay. Put prompt writing aside; focus on filtering, prioritizing, and presenting data to the model in its most refined form. When you manage context correctly, even the most mediocre model can turn into a genius; when you pollute the context, even the most advanced model will only produce garbage for you.