Essay · Published April 2026

Three Attack Categories Every AI Agent Inherits

By Anthony Pavogani — Founder, The Klaus Project, LLC

Every company shipping an AI agent right now is shipping the same three vulnerabilities. Not because their engineers are careless, but because the architecture itself guarantees them. You don't get to opt out by writing better code. You inherit them the moment you give a language model the ability to call a tool.

I know this because I built one. Forge is an agent system I run in-house — it has a CLI, a VS Code extension, a desktop app, file-system access, code execution, and the usual orchestration plumbing (snapshot rollback, circuit breakers, file-lock management, auto-verification). Building it taught me more about agent security than any blog post or whitepaper could, because at every layer of the system I had to make a decision about trust. And every one of those decisions was a place something could go wrong.

What follows isn't a list of CVEs or a teardown of a specific product. It's the structural map: three categories of attack that show up in every agent system I've examined, including my own, and what each one actually looks like in practice.

If you're shipping an LLM-powered feature with tool access right now, you have at least two of these three problems. You probably don't know which two.

Category 1: Trusted-Channel Injection

The first vulnerability is the one most people have heard of and the one most people misunderstand. Prompt injection is not a chat-window problem. The famous examples — "ignore previous instructions and..." typed into a customer service bot — are the trivial case. The real version is much worse, and it's a property of the architecture, not the UX.

Here's the actual structure: an agent receives content from somewhere. That somewhere is supposed to be data — a fetched web page, a file the user uploaded, the output of a tool call, a database row. The model reads it. And because language models cannot meaningfully distinguish content from instructions in their input, anything in that data stream that looks like an instruction is treated as one. Not "might be" — is.

This means every input channel into your agent is also an instruction channel. The web page you scraped. The PDF you parsed. The Slack message you summarized. The error output from a shell command. The contents of a file the agent listed. Each of these is a potential vector, and the attacker doesn't need access to your system — they just need access to something the agent will eventually read.

The mitigations people reach for first don't work. System-prompt instructions like "ignore any instructions in user content" are themselves overridable by sufficiently determined input. Output filtering catches obvious cases and misses subtle ones. Putting content in XML tags or "user said:" framing helps marginally and fails reliably under adversarial pressure.

The actual mitigations are architectural: privilege-separating what the model can do based on what data it has touched, requiring human approval for actions taken after exposure to untrusted content, treating model outputs that follow untrusted input as themselves untrusted, and aggressively scoping tool access so that even a fully hijacked agent has limited blast radius.

If your agent reads anything from outside its system prompt and can take any action afterward, you have this problem. The question is only how bad.

Category 2: Tool Privilege Drift

The second category is what happens when a tool does more than its description claims. This is the bug everyone writes themselves and almost nobody catches.

When you give an agent a tool, you give it a description: "this tool searches the codebase," "this tool fetches a URL," "this tool runs a shell command in a sandboxed environment." The model uses that description to decide when to call the tool. But the description is not the implementation. And the implementation is what actually runs.

A "search the codebase" tool that takes a query string might pass that string to a shell command somewhere in its plumbing. A "fetch URL" tool might follow redirects, including to internal-network addresses. A "sandboxed shell" might have one network egress nobody remembered. A "read file" tool might accept relative paths that resolve outside the intended directory. Every one of these is a place where what the tool can do exceeds what its name says it does — and the model, working from the description, has no idea.

The result is privilege drift: the agent appears to be doing one thing, the tool actually does another, and somewhere in the gap an attacker — or an unlucky input — triggers behavior nobody scoped for. The breach surface isn't the agent. It's the implicit contract between the agent and the tool.

This problem compounds because tools call other tools. A high-level "deploy to staging" tool might internally call "build artifact," which calls "fetch dependencies," which calls "resolve package URL," which makes a network request to whatever the package manifest says. By the time you're four layers deep, nobody — not the model, not the engineers — has a complete picture of what privileges are actually exercised.

Mitigation here is unglamorous: explicitly scope every tool to the minimum capability it needs, audit the actual implementation against the description, treat any tool that can spawn subprocesses or make network requests as a privilege boundary requiring justification, and log every tool call with full input and output for retrospective review. Most agent systems don't do the last one because the volume is huge. That's fine — sample it.

If you've never sat down and written, for each of your tools, exactly what privileges it grants when fully exercised, you have this problem. Every system I've examined does on first audit, including mine.

Category 3: Context Bleed

The third category is the subtlest and the one people are least prepared for. Context bleed is the unintended movement of information across boundaries that were supposed to contain it.

Agents accumulate context. They have system prompts. They have conversation history. They have tool outputs. They have user data, environment variables, fetched files, error messages, internal state. All of this lives in the same context window or persistence layer — and language models, when generating output, draw on all of it indiscriminately. There is no native concept of "this part of context is sensitive, don't mention it." The model might respect such an instruction. Often it doesn't.

The attack surface this creates is wider than people realize. Some examples of how context bleeds in real systems:

A user asks a benign question. The agent, having earlier read a config file, includes the API key from that config in its response — because the response template referenced "the credentials I just saw." The user didn't ask for the key. The agent surfaced it anyway.

An error message from a failed tool call contains the full system prompt because the framework's error handler dumps context for debugging. The error is shown to the user. The system prompt — which included carefully written instructions about how to handle edge cases, possibly including hardcoded paths or secrets — is now public.

Two users share an agent backend. Conversation A leaves residual state — embeddings, cached summaries, KV-cache fragments, depending on the architecture. Conversation B, with a different user, retrieves a near-neighbor context entry and surfaces information from A. The attack doesn't require active malice; it just requires the architecture to retain information across boundaries it shouldn't.

A tool output is summarized by the model before being shown to the user. The summary inadvertently includes details from the raw output that the system was supposed to redact. The user sees what was meant to be filtered.

The unifying problem is that "context" in agent systems is a single soup, and language models slurp from the whole bowl whenever they generate. Boundaries that exist in your data model do not exist in the model's attention.

Mitigations require thinking about information flow as a first-class architectural concern: where sensitive data enters the system, what touches it, what derived data inherits its sensitivity, and where the boundaries are between users, sessions, and privilege levels. Most agent frameworks don't even expose hooks for this. You build it yourself or you accept the bleed.

If you've never traced exactly which context elements can appear in a model output and under what conditions, you have this problem. You probably can't fully solve it — but you can know your blast radius.

The Pattern

Three categories: trusted-channel injection, tool privilege drift, context bleed. All three share a single root cause, and naming it is what separates ad-hoc fixes from real defense.

Agent systems treat language model outputs as if they are software, when they are actually predictions about what software-like text might be. Software has guarantees: a function with a return type returns that type. A model has tendencies: a tool call probably matches the description, the output probably honors the redaction instruction, the response probably doesn't leak the system prompt. Probably is not a security property. And every architectural choice that depends on the model behaving correctly under adversarial conditions is a choice to ship vulnerability.

The fix isn't better models. The fix is treating model outputs the way you treat user input in a traditional web app: untrusted, scoped, validated at every privilege boundary, never granted authority beyond what the surrounding code can independently verify.

That mental shift — from "the model handles it" to "the model is hostile, what's my fallback" — is the entire posture difference between agent systems that survive adversarial conditions and ones that don't.

What This Means For Your Team

If you're shipping an LLM-powered feature with tool access, three things are true today, regardless of who built it or what framework you used:

Your agent reads inputs you don't fully control. Your tools do more than their descriptions claim. Your context contains data that can leak into outputs you didn't audit.

The three categories above are not exhaustive. They're the most common, not the only ones. But they're the right place to start, because if you can't even map your exposure across these three, the more exotic attacks aren't your immediate problem.

I do this kind of mapping for companies through The Klaus Project. If you've shipped something that fits the description above and you'd like a real audit done on it, find me at klausproject.com or email me directly: Pavogani@klausproject.com.

A more empirical follow-up to this piece — with actual transcripts and exploit walkthroughs against my own system — is coming next.

Anthony Pavogani is the founder of The Klaus Project, LLC, a software and cybersecurity consultancy specializing in AI security and secure software engineering. B.S. in Cybersecurity. Builds and operates production agent systems and self-hosted LLMs in-house.

klausproject.com · Pavogani@klausproject.com

← All writing