Introduction
Your multi-agent system probably has one orchestrator with access to every tool, every database, every API. If that agent gets injected, the entire toolchain is compromised. Guardrails โ string filters, classifiers, content scanners โ won't save you. Real security comes from architectural patterns that bound what an agent can do regardless of what it's convinced to attempt.
Why this matters
- Indirect prompt injection (instructions hidden in retrieved data) is increasingly the dominant attack vector.
- Guardrails are necessary but insufficient โ adversaries iterate faster than your filters.
- Compromise of an over-privileged agent escalates to the entire toolchain.
- Regulators are starting to ask about this; security is becoming a procurement question.
Core concepts
Capability isolation
Agents get only the tools they absolutely need for their role. The "research" agent has no shell, the "execute" agent has no email send. Compromise contains.
Runtime least privilege
Permissions are not just design-time configuration; they are evaluated per call against the current task context. A compromised agent can't escalate to capabilities it didn't already have authorisation for.
Cryptographic boundaries
Use signed tokens, not prompts, to authorise actions. The agent presents a token; the tool validates it cryptographically. Prompts can be social-engineered; signatures cannot.
Trust zones for context
Mark every piece of context with its trust level: system prompt (trusted), user input (semi-trusted), retrieved/tool output (untrusted). Untrusted content cannot issue trusted instructions.
Practical patterns
Privileged orchestrator + untrusted workers
The orchestrator never executes user content; it just routes. Workers execute, with no access to each other or to system tools.
Plan / execute split
A planning agent (sees no untrusted data) issues signed instructions to an execution agent (handles untrusted data, but only follows signed instructions).
Egress firewalls
Outbound network calls are filtered against an allow-list scoped to the current task. Data exfiltration becomes impossible by construction.
Capability tokens
Tools require short-lived, scope-bound capability tokens issued by a trusted authority โ not by an agent.
Pitfalls to avoid
- Believing a "no, you're not allowed to do that" system prompt will hold against motivated injection.
- Storing API keys in agent context โ if the agent leaks, so do the keys.
- Letting one agent both read external content and call privileged tools.
- Logging entire prompts including secrets to your observability stack.
Key takeaways
- 1No god agents. Privilege follows the task, not the actor.
- 2Treat prompts as untrusted; treat tokens as authoritative.
- 3Architect for compromise โ assume one agent is owned, ensure damage is bounded.
- 4Guardrails are a defence-in-depth layer, not the primary defence.
Go deeper ยท external resources
Curated reading list to take you from primer to practitioner. All links are external and free to read.