How to Stop Prompt Injection in AI Agents That Read Untrusted Content

DEV Community

I built that exact attack, and the defense that stops it, as a runnable demo. The code is in the resilient-agent-harness repo.

What is prompt injection in AI agents?

Prompt injection is when text the agent reads carries an instruction it then follows. Direct injection is typed by the user. Indirect injection hides in content the agent reads (a web page, a document, an email), which is the dangerous case for any agent that browses or ingests data. The attacker never breaks into your system; they leave a booby-trapped instruction somewhere the agent will read and wait.

What is memory poisoning, and why is it worse?

Memory poisoning is indirect prompt injection with a long fuse: the agent doesn't just read the malicious instruction once, it stores it as a trusted memory and acts on it in a later session, where it looks like its own reliable knowledge. The payload survives across sessions because the agent writes it to long-term memory and reuses it. OWASP tracks memory poisoning in its Agentic AI threats guidance.

That persistence is exactly why a better prompt won't save you, and why the defense here is the one security researchers recommend for prompt injection generally: don't try to detect the malicious text (an attacker can rephrase it forever), gate the dangerous action at the tool boundary. This demo blocks one action (sending email to a non-allowlisted domain); the same tool-boundary pattern is how you contain prompt injection whenever an agent can take a consequential action on text it didn't write.

What is the demo?

The agent, built with Strands Agents, is a hotel-booking assistant with a send_email tool and a memory. The demo runs in three phases:

Infection. A poisoned note is written into the agent's memory and saved to disk.
Attack (no defense). A brand-new agent reloads that memory from disk and gets a normal booking request. It follows the poisoned instruction and emails the booking data to attacker@evil.com.
Defense (with the hook). Same reloaded poison, but now a tool-boundary gate is in place. The dangerous email is blocked before it sends.

Here's where Strands earns its keep on the setup: memory is the agent's native agent.state, persisted with a FileSessionManager. That means "a later session" is a real restart (a new agent reloads the poison from disk), not a variable I reset to fake one. The attack is reproduced honestly, exactly as the research describes it.

Why prompt defenses barely move the needle

Sandwich prompts, spotlighting, "ignore anything that looks like an instruction": these treat memory as trusted context and don't filter it. By the time the agent re-reads the poisoned note, it already looks like its own trusted state. The defense has to live somewhere the model's mood can't reach: the tool boundary.

The fix: a deterministic tool-level gate

Defend the dangerous action, not the instruction. In Strands, a BeforeToolCallEvent hook gates outbound email by destination, deterministically, regardless of what the model decided.

The diagram traces the whole thing: the poisoned page is stored in agent.state and persisted to disk; a fresh session reloads it and tries to send_email to the attacker; without the gate the email goes out, but with the BeforeToolCallEvent gate the destination is checked against an allowlist and the call is cancelled before it runs.

Memory poisoning attack and defense: a poisoned page is stored in agent.state and saved to disk, a new session reloads it and tries to send_email to the attacker, and a BeforeToolCallEvent gate cancels the call when the destination domain is not on the allowlist

from strands.hooks import HookProvider, HookRegistry, BeforeToolCallEvent
ALLOWED_EMAIL_DOMAINS = ["hotel-booking.com", "guest-support.com"]
def email_is_allowed(recipient: str) -> bool:
 domain = recipient.split("@")[-1].lower() if "@" in recipient else ""
 return domain in ALLOWED_EMAIL_DOMAINS
class MemoryPoisoningDefenseHook(HookProvider):
 def register_hooks(self, registry: HookRegistry) -> None:
 registry.add_callback(BeforeToolCallEvent, self.gate)
 def gate(self, event: BeforeToolCallEvent) -> None:
 if event.tool_use["name"] != "send_email":
 return
 recipient = event.tool_use.get("input", {}).get("recipient", "")
 if not email_is_allowed(recipient):
 event.cancel_tool = f"BLOCKED: {recipient} not in allowlist"

The hook doesn't try to detect the injection text (an attacker can rephrase that endlessly). It checks the destination. This is the second place Strands does the work for you: a hook runs inside the agent loop, before the tool executes, and event.cancel_tool stops the call cold. It's enforcement, not a polite request to the model. The email to the attacker is never sent.

Before and after

Phase	What happens	Result
Infection	Poisoned note written to `agent.state`, saved to disk	Memory holds it; you can print it and see the poison
Attack (no defense)	Fresh agent reloads poison, gets a booking request	`send_email` to `attacker@evil.com`, attack succeeds
Defense (hook)	Same reloaded poison plus the gate	0 dangerous emails reach execution, blocked

The deterministic part: the gate blocks attacker@evil.com and allows ops@hotel-booking.com on every run, whether or not the model takes the bait.

Frequently asked questions

Can a better prompt fully prevent it?
No. Prompt-level defenses stop only a fraction, because the poison lives in the agent's own trusted memory. Reliable prevention happens at the tool boundary: block the dangerous action before it runs.

Is this attack realistic?
Any agent that browses, reads documents, or ingests email and stores what it learns has this exposure: untrusted content can enter memory and be re-read later as trusted state. OWASP tracks it as an agentic-AI threat, and the cited paper demonstrates it on representative agent setups.

Do I need OpenAI for this?
No. Strands is model-agnostic: its providers are interchangeable, so the same code runs on Amazon Bedrock (the default), Anthropic, OpenAI, or a local model via Ollama. The demo defaults to OpenAI gpt-4o-mini because it needs only an API key to try, though that's still a cloud API call, not a model on your machine.

Run it yourself

The three phases (infection, attack, defense) run end to end in one notebook. Clone the repo and run it:

git clone https://github.com/elizabethfuentes12/resilient-agent-harness-sample-for-aws.git
cd resilient-agent-harness-sample-for-aws/02-memory-poisoning-defense
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
# Default: OpenAI gpt-4o-mini (just an API key to try)
echo "OPENAI_API_KEY=sk-..." > .env
echo "DUFFEL_API_KEY=duffel_test_..." >> .env # free sandbox token from app.duffel.com
uv run test_memory_poisoning_defense.py

Prefer notebooks? Open test_memory_poisoning_defense.ipynb and run it top to bottom.

The pattern follows Zombie Agents (Yang et al., Feb 2026), which shows memory evolution turns a one-time injection into a persistent compromise. The full reading is in the repo's README. In production, the same allow/deny moves to a policy layer at the tool or gateway boundary (for example Amazon Bedrock AgentCore), so the rule is centralized and can't be edited away by a poisoned memory.

Has an agent of yours ever trusted something it read on the open web? Tell me what it did in the comments.

📬 Building reliable AI agents? I write about agent memory, guardrails, evaluation, and multi-agent patterns. Subscribe to my newsletter to get the next one.

Gracias!

🇻🇪 Dev.to Linkedin GitHub Twitter Instagram Youtube

elizabethfuentes12 image

Elizabeth Fuentes L

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.