Securing agentic workflows — a practical framework for deploying AI agents without losing control.
Agents aren't APIs. They're autonomous actors with tools, memory, and credentials — each a new vector.
Agents wield terminal, browser, send_message. A jailbroken coding agent running curl | sh on production infrastructure isn't hypothetical — it's the default capability set.
Cross-session memory means compromise outlives the conversation. An adversary who poisons context today owns the agent's decisions next week.
Agent → sub-agent → sub-sub-agent → tool. Each hop increases distance from human oversight. A leaf agent with file write access may be three delegation layers away from the CISO's approval.
Skills, MCP servers, plugins — arbitrary code by design. The agent framework trusts them implicitly. A malicious skill fork that exfiltrates credentials to an external host is indistinguishable from legitimate behavior.
Agents talking to agents via email, Slack, webhooks — invisible to human monitoring. Two agents negotiating and executing an approved financial transaction without a single human in the loop is an architectural possibility today.
The same layered model from the offensive side — flipped to defensive framing.
Your firewall doesn't know what a sub-agent delegation chain looks like. Your SIEM doesn't log tool calls. Your IAM doesn't issue credentials to non-human actors. Each layer of the stack closes a gap that your existing security infrastructure was never designed to address.
Your agents authenticate as humans. That's the first thing to fix.
.env filesVendor question: "Does your agent framework support scoped identity per agent instance? Can I bind a toolset to a credential? Can I revoke a single agent's access without affecting others?"
Prompt engineering is not security. Policy enforcement is.
Telling an LLM "don't do bad things" is advisory. Jailbreaks, prompt injection, and context manipulation all bypass it. This is not security — it's a hope.
Allow/deny at the tool layer, not the prompt layer. A constitution evaluated deterministically before any tool executes — no LLM in the decision path.
Don't just block. Escalate to a human with context: what the agent tried, why, and what's at stake. The human decides — and the decision is logged.
How Iron Curtain does it: Write your guardrails in plain English. "Agents may not access hosts outside 10.0.0.0/8." "Destructive operations require human approval." "No outbound network connections to non-allowlisted IPs." These rules are enforced deterministically at the MCP tool layer — the agent can't talk its way around them.
If you can't audit it, you can't authorize it.
Actor · timestamp · input · output · approval state · delegation depth
User → orchestrator → sub-agent → leaf → tool. Every hop visible.
Agent actions in the same pipeline as human actions. No separate monitoring silo.
Autonomy is a dial, not a switch. Turn it up as trust is earned.
Research, analysis, code review. No write, no network, no side effects. Safe for any deployment.
Read + write, but confined to sandboxed/containerized environments. No production access.
Read + write + network, but only to explicitly scoped targets. IP ranges, domains, approved APIs.
Full capability, but destructive operations require human approval. Escalation gates with context.
No gates. Red-team and security research only. Must be air-gapped from production.
The agent's capabilities are only as trustworthy as its supply chain.
Red flag: Any agent platform that installs community code (skills, plugins, MCP servers) without signature verification, sandboxing, or an audit trail is running arbitrary code as a feature. Treat it like you'd treat a package manager with no checksums.
Your IR plan assumes a human caused it. Update it for when an agent did.
A user with legitimate access directs an agent to perform an unauthorized action within the agent's scope.
An agent bypasses its policy constraints and performs an action outside its authorized scope.
A malicious skill, plugin, or MCP server exfiltrates data or executes unauthorized commands.
Copy this. Send it to every agentic platform vendor you evaluate.
Three maturity levels. Pick where you are.
You're evaluating agentic tools or already have them in limited use.
You're deploying agents and need safety rails.
You're running agents in production at scale.
The foundational governance framework for AI risk. NIST AI RMF 1.0
Threat taxonomy for LLM-integrated applications. OWASP, 2025
Adversarial Threat Landscape for AI Systems — mapping AI-specific attack techniques. MITRE ATLAS
Evaluations of frontier model capabilities for offensive cyber operations. Anthropic, 2024
Systematic evaluation of frontier model risks including offensive cyber capabilities. OpenAI, 2024
Deterministic policy enforcement for AI agents at the MCP tool layer. Iron Curtain