Attack Methods and Defenses in LLM-Based Agentic Systems

Why agents are a new threat model

I start from a simple observation: the moment we give a language model autonomy, meaning the right to call tools, change files, and make transactions, we create threats with no analogue in ordinary software. Production systems, from orchestration platforms to developer tooling, regularly turn out to have critical vulnerabilities. Existing surveys, as I note, fixate on prompt injection and miss attacks on memory, protocol holes, multimodal threats, and tool chains. My goal is to map that whole picture in one place.

An extended attack taxonomy: seven classes

The paper's main contribution is a taxonomy organized around an expanding attack surface. I distinguish seven classes. (1) Prompt injection: the root problem is that a model cannot reliably separate instructions from data in one text stream, and indirect injection through external data and RAG is the most dangerous variant. (2) Memory attacks: MINJA, for instance, poisons long-term memory through ordinary queries with over 95% success, while the Zombie Agents concept achieves cross-session persistence and turns an agent into a permanent puppet. (3) Tool and protocol attacks: three fundamental holes in the Model Context Protocol, spoofed tool descriptions, exfiltration disguised as logging. (4) Multi-agent attacks: intercepting inter-agent messages (Agent-in-the-Middle) and exploiting inter-agent trust, where a model refuses a direct malicious command yet runs the very same payload from a "trusted" agent. (5) Multimodal attacks: coordinated signals planted in both image and text. (6) Tool-chain and supply-chain attacks: each call passes its safety check on its own, yet their composition compromises the system (STAC, ASR above 90%), up to a self-propagating "worm". (7) Temporal attacks of the TOCTOU kind, which exploit the race between check and use.

Defenses by intervention level

I organize defenses the same way, by stage. At the text level, LLM guardrails such as PromptArmor catch injections. At the model level, you analyze internal activations: ICON spots the "over-focusing" of an attacked model, and ARGUS steers activations against multimodal injections. At the tool level, the work is privilege control, with Progent supplying a DSL for policies, least privilege enforced at every step, and execution traces analyzed as dependency graphs. The protocol level adds MCP security extensions, the firewall level adds agentic firewalls that cut data leakage from 70% to near zero, and the system level brings partial formal verification and cryptographic guarantees.

Key takeaways

Two claims matter most to me. First, existing defenses against indirect injection do not hold up against an adaptive adversary, so I argue that testing defenses against an adaptive attacker should be standard practice rather than an option. Second, I propose a "security trilemma": no single approach delivers high protection, high utility, and low latency at once. To keep that from sounding abstract, I tie it to real CVEs: remote code execution in GitHub Copilot through invisible Unicode characters, a full takeover of the n8n platform (CVSS 10.0), and privilege escalation in ServiceNow through inter-agent trust. The message is that it is time to move from collecting individual attacks and patches to designing secure agentic architectures with guarantees we can formally verify. That is the core of what I work on now.

Research

Why agents are a new threat model

An extended attack taxonomy: seven classes

Defenses by intervention level

Key takeaways