Introduction
We've spent a decade codifying infrastructure (Terraform, Pulumi, CDK). The next step is treating infrastructure as a queryable data layer: cloud state exposed as facts, agents armed with reusable operational knowledge, and LLM gateways routing and governing model access. Policy-as-code, observability, and identity become first-class citizens of the agent loop โ and the result is an "AI control plane" where agents act on infrastructure safely.
Why this matters
- IaC alone tells you what should be there; control planes tell you what is.
- Agents need queryable infrastructure state to reason about cost, security, and capacity.
- Identity and policy must be first-class to prevent accidental agent misuse of privileged tools.
- The control plane is the only sane place to enforce per-agent quotas and audits.
Core concepts
Infrastructure as a data layer
Cloud state โ accounts, resources, configs, policies โ exposed as a queryable, unified data model rather than a pile of provider-specific APIs.
Skills and reusable runbooks
Operational knowledge codified as skills: "diagnose pod CrashLoopBackOff," "rotate a leaked credential." Agents pick the right skill, not the right command.
LLM gateway
A proxy that routes, authenticates, audits, and rate-limits all model traffic from agents. Single point of policy and observability.
Policy as code in the loop
Every agent action runs through a policy engine (OPA, Cedar) before execution. The agent proposes; the policy disposes.
Practical patterns
Read-only first
Phase 1: agents only query infra. Phase 2: agents propose changes via PRs. Phase 3: bounded direct action with policy gates.
Skill registry
Versioned, auditable library of operational tasks; agents must use registered skills, not free-form commands.
Just-in-time credentials
Short-lived credentials minted per-action against per-action policies; no long-lived secrets in agent context.
Observability everywhere
Every infra-touching agent action is logged with input, output, decision rationale, and policy check results.
Pitfalls to avoid
- Letting agents call provider APIs directly; bypasses your control plane and policy.
- No diff-and-approve step on changes; agents drift production over weeks.
- Treating skills as static; they need versioning, deprecation, and updates.
- Mixing read and write privileges in one agent; widens the blast radius unnecessarily.
Key takeaways
- 1Build a unified data model for infra; agents need a queryable substrate.
- 2Policy as code is non-negotiable for action.
- 3Skills > raw commands; they're what makes agents safe at scale.
- 4Phase your rollout: read, propose, act โ in that order.
Go deeper ยท external resources
Curated reading list to take you from primer to practitioner. All links are external and free to read.