AI Engineer Melbourne

HomeHallway TrackPrimer

Hallway TrackAdvanced 8 min

Frontier LLMs on a Desktop PC

Storage-centric inference exploits MoE sparsity to run 100B–600B+ models on consumer hardware.

Introduction

It is now possible to run state-of-the-art, ultra-large Mixture-of-Experts models — hundreds of billions of parameters — entirely offline on consumer-grade desktop hardware. The trick is a storage-centric inference engine that leverages MoE sparse activation to bypass the traditional VRAM bottleneck. Cloud-cluster scale intelligence runs privately and securely from a single desktop PC.

Why this matters

Privacy and sovereignty: large-model inference without the cloud.
Cost: a single capex outlay versus per-token bills.
Capability: tasks that previously required rented GPUs now run locally.
Latency floor: removes the round-trip to a remote provider.

Core concepts

1

Mixture-of-Experts (MoE) sparsity

MoE models activate only a subset of "experts" per token. Total parameters can be huge while compute per token stays manageable.

2

Storage-centric inference

Stream expert weights from fast SSDs into a smaller GPU/CPU memory budget on demand, exploiting which experts each token activates.

3

Quantisation choices

Q4/Q5 quantisation makes the gap between consumer hardware and frontier capability viable for many workloads.

Practical patterns

NVMe-first builds

Multi-TB fast NVMe is more important than maxed-out RAM for storage-centric inference.

Per-task model selection

Bigger isn't always better — match model to task and quantisation.

Batching cold sessions

Group requests to keep frequently-used experts warm.

Pitfalls to avoid

Cheap SSDs that wear out under inference workloads.
Quantisation that breaks specific capabilities (math, code) silently.
Underestimating cooling and power draw on sustained loads.

Key takeaways

1Frontier-scale local inference is a reality for sparse architectures.
2Engineering trade-offs are real: storage tier, quantisation, batching.
3Re-evaluate the cloud-only assumption for many workloads.

Go deeper · external resources

Curated reading list to take you from primer to practitioner. All links are external and free to read.

Mixture of Experts Explained (Hugging Face)

llama.cpp — efficient inference

DeepSeek MoE papers

NVIDIA Glossary — MoE

More from Hallway Track

Hallway TrackBeginner

Measuring Whether Your AI Loops Actually Converge

Iteration feels like circling. The fix is making each circle measurable.

7 min readRead primer

Hallway TrackAdvanced

Sovereign AI: Architecting Air-Gapped Agents

On-prem hardware (e.g. NVIDIA DGX Spark) brings frontier AI back under your control.

8 min readRead primer

Hallway TrackIntermediate

Editing Images Faithfully Without a LoRA

Most practitioners assume style-faithful editing requires a fine-tune. It usually doesn't.

8 min readRead primer