RAG is not enough: when to fine-tune, when to ground

"Should we use RAG or fine-tune?" is the wrong question. It assumes the two techniques compete for the same job. They don't. Across the 12 LLM systems we've taken to production, the pattern is consistent: RAG and fine-tuning solve different problems, teams that pick one ideologically end up fighting their architecture, and the deployments that perform best usually use both — each doing the job it's actually good at.

Here's the framework we run clients through before a line of code is written. It's short, because the decision is simpler than the discourse around it suggests.

What RAG is actually good at

Retrieval-augmented generation injects knowledge at query time. That single property drives everything it wins at. Knowledge freshness: when a policy changes, you update a document and the system knows it within one sync cycle — no retraining, no redeployment. For a logistics client whose rate cards change weekly, this alone settled the architecture. Auditability: every answer traces to specific source chunks, which is the difference between "the model said so" and an answer a compliance team will sign off on. Citations: users can click through to the source, which is the single biggest trust-builder we've measured in adoption data. If your problem is "the model doesn't know our facts," RAG is the answer, full stop.

What fine-tuning is actually good at

Fine-tuning changes behavior, not knowledge. Trying to stuff facts into model weights is the most common misuse we see — facts go stale, and the model blends them unpredictably with its pretraining. But for shaping how a model responds, fine-tuning is unmatched. Format and style: when output must conform to a strict schema or a house style every single time, a fine-tuned model does natively what would otherwise take a 1,500-token prompt to approximate. Domain reasoning patterns: teaching a model how your underwriters structure a risk assessment, not what today's rates are. Latency-critical paths: those 1,500 prompt tokens cost time and money on every call; we cut one client's p95 latency by 40% by moving instructions from the prompt into the weights. Vocabulary: jargon, product codes and internal shorthand that a base model mangles.

RAG teaches the model what's true today. Fine-tuning teaches it how to behave.

The decision list

Choose RAG when… the knowledge changes faster than you'd retrain, answers must cite sources, auditors need to trace claims to documents, or you're handling per-customer or per-region knowledge that can't live in shared weights.
Fine-tune when… output format or tone must be consistent and the prompt enforcing it has grown unwieldy, the task needs domain-specific reasoning steps, latency or token cost on a high-volume path justifies one-time training spend, or the model keeps fumbling your vocabulary.
Combine when… you need both reliable behavior and fresh facts — which, in our experience, is most serious enterprise use cases. Fine-tune for the behavior, ground with RAG for the facts.

The hybrid pattern that keeps winning

Of our 12 production deployments, 7 ended up hybrid: a lightly fine-tuned model that knows how to be your assistant — format, tone, refusal behavior, domain reasoning — grounded at runtime on a governed corpus that knows what's currently true. The fine-tune is small and rarely retrained because behavior is stable; the corpus updates continuously because facts aren't. Each layer absorbs the kind of change it's cheap to absorb. A claims-processing copilot we run on this pattern has had its corpus updated more than 200 times since launch; the fine-tuned model has been retrained twice.

Key takeaways

RAG and fine-tuning solve different problems: knowledge versus behavior. Don't make them compete.
Default to RAG for anything that changes, needs citations, or faces an auditor.
Fine-tune for format, style, domain reasoning, vocabulary, and latency-critical paths — never to store facts.
The hybrid (fine-tuned behavior + RAG-grounded facts) won in 7 of our 12 production deployments.
Price the maintenance, not the build: a fine-tune carrying knowledge must be retrained every time that knowledge changes.

The cost conversation nobody has up front

Fine-tuning's training run is the cheap part. The expensive part is what you've signed up for afterwards: if any knowledge lives in the weights, every meaningful change to that knowledge means rebuilding the training set, retraining, re-running the full evaluation suite, and redeploying — typically days of work and a fresh round of risk review. We've watched a team outside one of our engagements burn a quarter doing monthly retrains to keep product facts current, solving with GPUs what a retrieval index solves with a cron job. Before any fine-tune, we make clients write down the answer to one question: how often does what this model is learning actually change? If the answer is "quarterly or faster," that material belongs in the corpus, not the weights — a sequencing question our AI & ML practice settles in the first week of an engagement.

If you're staring at this decision for a real workload, bring it to us — mapping a use case onto this framework takes an afternoon, not a discovery phase.

RAG is not enough: when to fine-tune, when to ground

What RAG is actually good at

What fine-tuning is actually good at

The decision list

The hybrid pattern that keeps winning

Key takeaways

The cost conversation nobody has up front

Related reading

Evaluation-driven development: how we test LLM apps

Why 80% of enterprise GenAI pilots stall

Building something with GenAI?