The AI Control Plane: Infrastructure as Data

Introduction

We've spent a decade codifying infrastructure (Terraform, Pulumi, CDK). The next step is treating infrastructure as a queryable data layer: cloud state exposed as facts, agents armed with reusable operational knowledge, and LLM gateways routing and governing model access. Policy-as-code, observability, and identity become first-class citizens of the agent loop — and the result is an "AI control plane" where agents act on infrastructure safely.

Why this matters

IaC alone tells you what should be there; control planes tell you what is.
Agents need queryable infrastructure state to reason about cost, security, and capacity.
Identity and policy must be first-class to prevent accidental agent misuse of privileged tools.
The control plane is the only sane place to enforce per-agent quotas and audits.

Core concepts

Infrastructure as a data layer

Cloud state — accounts, resources, configs, policies — exposed as a queryable, unified data model rather than a pile of provider-specific APIs.

Skills and reusable runbooks

Operational knowledge codified as skills: "diagnose pod CrashLoopBackOff," "rotate a leaked credential." Agents pick the right skill, not the right command.

LLM gateway

A proxy that routes, authenticates, audits, and rate-limits all model traffic from agents. Single point of policy and observability.

Policy as code in the loop

Every agent action runs through a policy engine (OPA, Cedar) before execution. The agent proposes; the policy disposes.

Practical patterns

Read-only first

Phase 1: agents only query infra. Phase 2: agents propose changes via PRs. Phase 3: bounded direct action with policy gates.

Skill registry

Versioned, auditable library of operational tasks; agents must use registered skills, not free-form commands.

Just-in-time credentials

Short-lived credentials minted per-action against per-action policies; no long-lived secrets in agent context.

Observability everywhere

Every infra-touching agent action is logged with input, output, decision rationale, and policy check results.

Pitfalls to avoid

Letting agents call provider APIs directly; bypasses your control plane and policy.
No diff-and-approve step on changes; agents drift production over weeks.
Treating skills as static; they need versioning, deprecation, and updates.
Mixing read and write privileges in one agent; widens the blast radius unnecessarily.