Multi-Armed Bandits: A Scientific Shotgun for LLM Evals

Introduction

A/B testing is too rigid for AI systems. You're stuck serving worse results for the duration of the experiment, getting billed for slower models while three providers release SOTA updates this week. Multi-armed bandits steal a trick from data science — adapt traffic allocation in real time as evidence accumulates — to surface ideal models, prompts, and configurations without rebuilding your stack.

Why this matters

Frontier model release cadence is faster than typical A/B test windows.
A/B "losers" cost real money during the experiment; bandits minimise that exposure.
Multi-arm trade-offs (model × prompt × params) explode the design matrix beyond classical A/B.
Evaluation latency is high — bandits use every observation to decide where to send the next one.

Core concepts

Exploit vs. explore

At every step the algorithm decides: send traffic to the current best (exploit) or to less-tested arms (explore). Bandits balance these; A/B fixes the split up front.

Common algorithms

Epsilon-greedy (simple, robust), UCB (theoretically tight), Thompson sampling (Bayesian, plays well with priors). Thompson sampling is usually the default for production.

Reward design is the hard part

The reward function is your eval. Multi-objective (quality + latency + cost) requires either weighted sums or constrained optimisation.

Non-stationarity

Models change, prompts drift, traffic shifts. Use windowed bandits or restart policies; pure stationary bandits will mislead.

Practical patterns

Bandit over models

Each model is an arm; reward is a composite of eval score and latency/cost. Best arm wins; new arms can be added at any time.

Bandit over prompts

Variant prompts as arms. Tighter, since prompt changes affect output more directly.

Contextual bandits

Per-request features (user segment, prompt length, tool inventory) inform which arm to pick. Better fit when one model isn't globally best.

Shadow bandit

Run a bandit offline against logged traffic and only graduate winners to production. Avoids early-experiment user impact.

Pitfalls to avoid

Reward functions that don't reflect actual user value; the bandit converges on the wrong winner.
Ignoring non-stationarity; yesterday's winner is today's loser, and the bandit doesn't notice.
Confidence intervals too tight too early; the bandit commits before it's really learned.
Treating the bandit as fire-and-forget; it needs monitoring like any production system.

Key takeaways

1Bandits are a better fit than A/B for AI experiments where the design space is huge and the cost of "lose" arms is real.
2The reward function is the experiment; design it carefully.
3Account for non-stationarity from day one.
4Start with Thompson sampling; it's a solid default.