Introduction
The shift from mainframes to PCs in the 1980s removed gatekeepers by putting computing power directly in engineers' hands. Today, on-prem AI hardware (NVIDIA DGX Spark, Apple Silicon clusters, AMD MI-series rigs) does the same for AI. Building a multi-agent system that runs entirely air-gapped is no longer aspirational โ it's available to any engineer who wants to take back the means of inference.
Why this matters
- Sovereignty: data and model never leave your premises.
- Latency: zero-network inference is fast and predictable.
- Compliance: hard guarantees beat policy promises.
- Cost predictability: capex over usage-based pricing for steady workloads.
Core concepts
Air-gap topology
No outbound network from inference hosts. Models pre-loaded; updates via signed offline channels.
Hardware tiers
Workstation-class (DGX Spark, Mac Studio clusters), rack-class (single 8x H100/H200), datacentre-class. Pick by parameter count + concurrency target.
Multi-agent on local hardware
Roles share the GPU pool via vLLM/SGLang batching; smaller specialised models per role beat one big model for many workloads.
Practical patterns
Model registry on a signed share
Air-gap-friendly model distribution; each model has a signed manifest.
Local observability stack
Self-hosted Langfuse / OTel; telemetry never leaves the gap.
Capex/opex modelling
Compare 18-month total cost of ownership vs. cloud token bills before commit.
Pitfalls to avoid
- Buying hardware that fits today's model, not next year's.
- Underestimating ops complexity โ drivers, CUDA versions, cooling, networking.
- No update plan; the air gap becomes a stagnation gap.
Key takeaways
- 1Sovereign AI is now within reach for many workloads.
- 2Plan for the lifecycle: ingest, run, observe, update โ all behind the air gap.
- 3Model the economics honestly before you commit to hardware.
Go deeper ยท external resources
Curated reading list to take you from primer to practitioner. All links are external and free to read.