"Should we use RAG or fine-tune?" is the wrong question. It assumes the two techniques compete for the same job. They don't. Across the 12 LLM systems we've taken to production, the pattern is consistent: RAG and fine-tuning solve different problems, teams that pick one ideologically end up fighting their architecture, and the deployments that perform best usually use both — each doing the job it's actually good at.
Here's the framework we run clients through before a line of code is written. It's short, because the decision is simpler than the discourse around it suggests.
What RAG is actually good at
Retrieval-augmented generation injects knowledge at query time. That single property drives everything it wins at. Knowledge freshness: when a policy changes, you update a document and the system knows it within one sync cycle — no retraining, no redeployment. For a logistics client whose rate cards change weekly, this alone settled the architecture. Auditability: every answer traces to specific source chunks, which is the difference between "the model said so" and an answer a compliance team will sign off on. Citations: users can click through to the source, which is the single biggest trust-builder we've measured in adoption data. If your problem is "the model doesn't know our facts," RAG is the answer, full stop.
What fine-tuning is actually good at
Fine-tuning changes behavior, not knowledge. Trying to stuff facts into model weights is the most common misuse we see — facts go stale, and the model blends them unpredictably with its pretraining. But for shaping how a model responds, fine-tuning is unmatched. Format and style: when output must conform to a strict schema or a house style every single time, a fine-tuned model does natively what would otherwise take a 1,500-token prompt to approximate. Domain reasoning patterns: teaching a model how your underwriters structure a risk assessment, not what today's rates are. Latency-critical paths: those 1,500 prompt tokens cost time and money on every call; we cut one client's p95 latency by 40% by moving instructions from the prompt into the weights. Vocabulary: jargon, product codes and internal shorthand that a base model mangles.
RAG teaches the model what's true today. Fine-tuning teaches it how to behave.
The decision list
- Choose RAG when… the knowledge changes faster than you'd retrain, answers must cite sources, auditors need to trace claims to documents, or you're handling per-customer or per-region knowledge that can't live in shared weights.
- Fine-tune when… output format or tone must be consistent and the prompt enforcing it has grown unwieldy, the task needs domain-specific reasoning steps, latency or token cost on a high-volume path justifies one-time training spend, or the model keeps fumbling your vocabulary.
- Combine when… you need both reliable behavior and fresh facts — which, in our experience, is most serious enterprise use cases. Fine-tune for the behavior, ground with RAG for the facts.
The hybrid pattern that keeps winning
Of our 12 production deployments, 7 ended up hybrid: a lightly fine-tuned model that knows how to be your assistant — format, tone, refusal behavior, domain reasoning — grounded at runtime on a governed corpus that knows what's currently true. The fine-tune is small and rarely retrained because behavior is stable; the corpus updates continuously because facts aren't. Each layer absorbs the kind of change it's cheap to absorb. A claims-processing copilot we run on this pattern has had its corpus updated more than 200 times since launch; the fine-tuned model has been retrained twice.
Key takeaways
- RAG and fine-tuning solve different problems: knowledge versus behavior. Don't make them compete.
- Default to RAG for anything that changes, needs citations, or faces an auditor.
- Fine-tune for format, style, domain reasoning, vocabulary, and latency-critical paths — never to store facts.
- The hybrid (fine-tuned behavior + RAG-grounded facts) won in 7 of our 12 production deployments.
- Price the maintenance, not the build: a fine-tune carrying knowledge must be retrained every time that knowledge changes.
The cost conversation nobody has up front
Fine-tuning's training run is the cheap part. The expensive part is what you've signed up for afterwards: if any knowledge lives in the weights, every meaningful change to that knowledge means rebuilding the training set, retraining, re-running the full evaluation suite, and redeploying — typically days of work and a fresh round of risk review. We've watched a team outside one of our engagements burn a quarter doing monthly retrains to keep product facts current, solving with GPUs what a retrieval index solves with a cron job. Before any fine-tune, we make clients write down the answer to one question: how often does what this model is learning actually change? If the answer is "quarterly or faster," that material belongs in the corpus, not the weights — a sequencing question our AI & ML practice settles in the first week of an engagement.
If you're staring at this decision for a real workload, bring it to us — mapping a use case onto this framework takes an afternoon, not a discovery phase.