RAG vs Finetuning — The Decision Framework | AI Engineering for Builders

The question every team hits

You've built a working prompt. You've run evals. You know where it fails. And the failures cluster around a specific pattern: the model doesn't know something it needs to know, or it doesn't behave the way you need it to behave consistently enough.

At this point, most teams jump to one of two solutions — RAG or finetuning — based on instinct or what they saw in a blog post. Both are real tools with legitimate uses. But they solve different problems, and applying the wrong one wastes weeks on infrastructure that won't move your quality metrics.

This chapter gives you the decision framework. It starts with the question you should ask before reaching for either.

Start here: can prompting alone solve it?

Before investing in RAG or finetuning, exhaust prompting and context engineering. Both RAG and finetuning take days to implement and iterate on. A prompt change takes minutes. You want a strong prior that you've actually hit prompting's ceiling before spending engineering time on infrastructure.

Signs you've hit prompting's ceiling:

The model is failing on knowledge it genuinely doesn't have — not knowledge it's failing to recall, but information that postdates its training or never existed in public data
You've tried multiple prompt formulations and few-shot examples and quality has plateaued
Your system prompt is already long and dense with special-case handling
The failure mode is systematic, not random — the same class of input fails every time

If you're still seeing random failures across diverse inputs, that's usually a prompt design or evaluation problem, not a prompting limitation. Fix the specification before reaching for bigger tools.

When to use RAG

RAG (Retrieval-Augmented Generation) solves one class of problem: the model lacks the information it needs to answer correctly, and that information exists in a corpus you control.

The three cases where RAG is the right answer:

Private data. The model has never seen your internal documentation, your customer records, your proprietary knowledge base, or your product catalog. It can't know this through any amount of prompting — the information was never in its training data. RAG makes this information available at inference time.

Current information. Models have training cutoffs. GPT-4o's knowledge ends in early 2024. If your application needs to reason about recent events, current pricing, or anything that changes faster than model release cycles, you need retrieval.

Precise factual grounding. For applications where hallucination is costly — legal, medical, compliance — you want the model working from a defined set of source documents rather than its internal approximation of the world. RAG constrains the model to a known source, and you can verify citations.

How RAG works

A RAG pipeline has three stages:

User query
    ↓
RETRIEVAL: Query → embedding → vector search → top-k documents
    ↓
AUGMENTATION: Retrieved documents → formatted and inserted into context
    ↓
GENERATION: Model produces answer grounded in retrieved content

Retrieval starts by embedding the user's query — converting it to a vector — and searching a vector store for the most semantically similar document chunks. The top-k results (typically 3–10) are returned as candidates.

Augmentation formats those candidates into the context. Usually: a prompt instruction ("Answer based only on the following documents"), the retrieved documents in order, then the user's question.

Generation produces the response. If retrieval worked correctly, the model has what it needs in context and can answer accurately without falling back to training-data approximations.

What makes retrieval fail

Most RAG failures aren't generation failures. The model is generating fine — but what it's generating from is wrong. Retrieval is the component that breaks most often, in predictable ways.

Bad chunking. Documents fed into a vector store need to be split into chunks for indexing. Chunk too small and individual chunks lose meaning — a sentence fragment isn't enough context to be useful. Chunk too large and you waste context window space and dilute the relevance signal. The right chunk size depends on your document structure. Code documentation chunks differently than legal contracts. There's no universal answer — test a few strategies against your eval set.

Mismatched embedding model. The model you use to embed documents and the model you use to embed queries need to share a semantic space. Use the same embedding model for both. Beyond that: embedding models are trained on general text. If your corpus is highly specialized — medical literature, legal documents, proprietary terminology — a general embedding model may not capture domain-specific similarity well. Specialized embedding models exist for some domains.

Top-k too small. If the right document isn't in your top-k results, the model can't use it. A common failure mode: you set k=3 to keep context short, but the relevant document is consistently ranked 4th or 5th. Increasing k improves recall at the cost of context window space and noise. Reranking — a second retrieval step that re-scores the top-k candidates more precisely — is a useful middle ground.

Query-document mismatch. Users ask questions. Your documents contain answers. Embedding a question and an answer doesn't always produce similar vectors — they're different kinds of text. Techniques like HyDE (Hypothetical Document Embeddings) work around this by generating a hypothetical answer to the query, then searching for documents similar to the hypothetical answer rather than the original question.

No retrieval signal at all. Some queries don't benefit from retrieval because no document in your corpus is relevant. The model should recognize this and say so — not hallucinate an answer because you're forcing it to use the retrieved context. Your system prompt should instruct the model explicitly: "If the retrieved documents don't contain information sufficient to answer the question, say so. Do not use information not present in the documents."

Evaluating retrieval separately from generation

A common mistake: evaluating the end-to-end RAG system as a unit. When quality is bad, you can't tell if retrieval failed (wrong documents) or generation failed (right documents, wrong answer).

Evaluate them separately:

Retrieval evaluation: For a set of test queries, does the correct document appear in the top-k results? This is measurable — it's a recall metric over your document corpus. Fix retrieval quality before optimizing generation.
Generation evaluation: Given the correct documents in context, does the model produce a correct answer? This is your standard AI-as-judge evaluation from Chapter 3.

Most RAG quality problems are retrieval problems. Investing in generation quality before retrieval is solid will produce diminishing returns.

When to use finetuning

Finetuning updates the model's weights — you're literally changing the parameters to encode your examples. The result is a model that produces a different distribution of outputs than the base model, not just a base model conditioned on a long prompt.

Most teams that try finetuning end up disappointed. The reasons cluster into predictable failure patterns:

Teams finetune to teach knowledge. This is the most common mistake. You want the model to know about your company, your product, your domain. So you collect documents and finetune on them. The model doesn't reliably learn facts from finetuning data — it learns patterns and style. For knowledge, you need RAG.

Teams finetune without a baseline. You can't know if finetuning helped without measuring what the base model + prompting achieves first. Teams skip evaluation, finetune, get qualitatively "better-seeming" responses, and ship. Whether they actually improved — and whether the improvement was worth the infrastructure cost — is unknown.

Teams use too little data. Finetuning on a few hundred examples often degrades the model's general capabilities without producing consistent target behavior. Meaningful finetuning typically requires thousands of high-quality examples.

Three conditions that justify finetuning

Despite these failure modes, finetuning is the right tool in specific situations.

Consistent style, tone, or format that's hard to specify. You want every response to sound like it was written by a specific persona. Or you want output that reliably matches a document format your users recognize. Or you're producing code in a framework with conventions that few-shot examples can't capture at the required consistency. When the target behavior is a style rather than a knowledge gap, and few-shot examples aren't producing it reliably, finetuning is the lever.

Latency or cost constraints at scale. A long system prompt with extensive few-shot examples costs tokens on every call. At high enough volume, those input tokens are expensive. Finetuning can encode behavior that would otherwise live in a long prompt — meaning you can use a shorter prompt with a finetuned model and get equivalent or better quality.

Specialized task performance with sufficient data. For a narrow, well-defined task where you have thousands of high-quality examples and you've verified that prompting + RAG has hit a ceiling, finetuning can close the remaining gap. Medical coding, legal clause classification, domain-specific entity extraction — these are cases where task specialization genuinely justifies the investment.

What finetuning won't fix: knowledge gaps (use RAG), prompt design problems (fix the prompt), or low-quality training data (finetuning amplifies the pattern in your data, including the mistakes).

PEFT and LoRA: the practical entry point

Full finetuning — updating all of a model's parameters — is expensive and usually unnecessary. GPT-4 class models have tens of billions of parameters. Retraining all of them on your dataset requires compute that isn't available through most APIs and infrastructure that most teams don't have.

PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that update a small fraction of model parameters while leaving the rest frozen. The most widely used method is LoRA (Low-Rank Adaptation).

LoRA works by inserting small trainable matrices (adapters) at specific points in the transformer architecture. During finetuning, only these adapter weights are updated — the original model weights stay frozen. In practice this means:

Training data requirements are lower
Compute requirements are a fraction of full finetuning
The base model's general capabilities are mostly preserved (a common failure mode of full finetuning is "catastrophic forgetting" — the model gets better at your task and worse at everything else)
Adapters can be switched, combined, or disabled without re-downloading the full model

Major platforms have made LoRA finetuning accessible: OpenAI offers supervised finetuning via API for GPT-4o mini; Together AI and Replicate provide LoRA finetuning infrastructure for open-source models; Hugging Face's PEFT library makes it straightforward to run locally on models like Llama 3.

The practical implication: if you're going to explore finetuning, start with LoRA/PEFT on a smaller open-source model rather than full finetuning on a frontier model. The cost and iteration speed are more forgiving, and the technique transfers.

The decision tree

Can prompting alone solve it?
    YES → Improve prompts and context design. Evaluate.
    NO  ↓

Does it fail because the model lacks information?
    YES → RAG
    NO  ↓

Is the failure a knowledge gap (recency, private data)?
    YES → RAG
    NO  ↓

Is the failure a style, format, or behavior consistency problem?
    Has prompting + few-shot hit a ceiling?
        YES, and you have thousands of quality examples → Finetuning (start with LoRA)
        NO  → More prompt iteration and eval

One more check before finetuning: make sure you have an eval set, a baseline measurement, and a clear quality target. If you can't answer "how will I know if finetuning worked?" you're not ready to start.

What the options cost

Not just in dollars — in iteration speed. Every engineering decision about data infrastructure has a time cost that doesn't show up in pricing tables.

Approach	Time to first result	Ongoing iteration speed	Primary cost
Prompting	Minutes	Fast	API tokens
RAG	Days (pipeline)	Moderate	Infrastructure + tokens
Finetuning	Weeks (data + training)	Slow	Data labeling + training compute

This is why the decision tree starts with prompting. Even if RAG or finetuning is ultimately the right answer, validating the problem against a simple prompting baseline first is almost always worth the few hours it takes.

The teams that ship AI products fastest are the ones who stay in the prompting loop as long as possible — and move to retrieval and finetuning only when they've confirmed those tools are actually necessary for their specific quality problem.

Chapter 6 takes you to production: how all of these pieces fit into a system that can be monitored, improved, and relied on.

Primary sources: Chip Huyen, AI Engineering (O'Reilly, 2025), Chapters 6–8. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Meta AI, 2020). Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (Microsoft, 2021). Gao et al., "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE, CMU, 2022).

RAG vs Finetuning — The Decision Framework