You need an AI system that knows your company's data. Should you fine-tune a model or use retrieval-augmented generation (RAG)? Here's a decision framework based on what actually works in production.
Quick definitions
RAG (Retrieval-Augmented Generation): Store your documents in a vector database. At runtime, search for relevant chunks and pass them to a general-purpose LLM in the prompt. The model answers based on the retrieved context.
Fine-tuning: Take a base model (GPT-3.5, Llama, Mistral) and train it further on your specific examples. The model internalizes your data and style.
When to choose RAG
Use cases that fit RAG
- Knowledge bases and documentation: Product manuals, internal wikis, support articles.
- Frequently changing content: Pricing, policies, compliance docs that update monthly.
- Large, diverse documents: Legal contracts, research papers, technical specifications.
- Need for citations: You must show which document supported an answer.
- Multi-tenant systems: Each customer has separate data that shouldn't leak.
Advantages of RAG
- Fast to build: Production-ready in 2–4 weeks with proper chunking and retrieval.
- Easy updates: Add new documents by indexing them—no retraining required.
- Transparency: You can audit which chunks were retrieved for each answer.
- Lower upfront cost: No expensive training runs.
- Flexible: Swap models (GPT-4 → Claude) without rebuilding everything.
Disadvantages of RAG
- Latency: Two-step process (retrieve + generate) adds 200–800ms.
- Retrieval quality: If search fails, the model hallucinates or refuses to answer.
- Ongoing inference cost: Passing long retrieved chunks every request costs tokens.
- Limited style control: Can't deeply change the model's tone or structure.
Typical RAG costs
- Development: €8,000–€25,000 (parsing, chunking, vector DB, retrieval tuning)
- Hosting: €100–€500/month (vector DB, caching)
- Inference: €0.02–€0.10 per query depending on context length
When to choose fine-tuning
Use cases that fit fine-tuning
- Consistent style or format: Generate reports, emails, or summaries in a specific voice.
- Domain-specific language: Medical codes, legal jargon, technical terminology.
- Behavior modification: Make the model more concise, formal, or creative by default.
- Classification and extraction: Label customer intent, extract entities, or score sentiment.
- Cost-sensitive, high-volume: Millions of inferences where prompt length matters.
Advantages of fine-tuning
- Lower latency: Single inference, no retrieval step.
- Smaller prompts: Knowledge is baked in, so you use fewer input tokens.
- Better style consistency: The model "speaks" your language natively.
- Works offline: Deploy a self-hosted model without external dependencies.
Disadvantages of fine-tuning
- Upfront effort: Prepare 500–10,000 high-quality training examples.
- Expensive: Training runs cost €1,000–€10,000+ depending on model size.
- Hard to update: New information requires retraining the entire model.
- Risk of overfitting: Small datasets can make the model brittle.
- No citations: You can't trace why the model gave a specific answer.
Typical fine-tuning costs
- Data preparation: €5,000–€15,000 (labeling, formatting, validation)
- Training: €1,000–€10,000 per run (more for larger models)
- Hosting: €200–€2,000/month (self-hosted) or pay-per-token (API fine-tunes)
- Inference: Lower per-query cost if prompts are short
Comparison matrix
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Time to production | 2–4 weeks | 6–12 weeks |
| Upfront cost | €8k–€25k | €15k–€50k |
| Ongoing cost | Higher (long prompts) | Lower (short prompts) |
| Latency | Higher (2 steps) | Lower (1 step) |
| Updates | Easy (add docs) | Hard (retrain) |
| Citations | Yes | No |
| Style control | Limited | Strong |
| Best for | Knowledge Q&A | Format/style tasks |
Can you combine them?
Yes. Fine-tune for style and structure, then use RAG to inject current facts. Example: A customer support bot fine-tuned to write concise, empathetic responses, with RAG pulling the latest product documentation.
This gives you style consistency (fine-tuning) and up-to-date information (RAG), but adds complexity and cost.
Decision framework
Start with RAG if:
- Your data changes frequently
- You need citations and transparency
- You want to ship fast and iterate
- You don't have a large labeled dataset
Choose fine-tuning if:
- You need a specific tone, format, or style
- You have 500+ high-quality training examples
- You're doing classification or extraction at scale
- Latency and token costs are critical
Use both if:
- You need style + current facts
- You have budget and time for complexity
Real examples
Company A – Legal document Q&A: 10,000 contracts, updated quarterly. Used RAG with GPT-4. Built in 3 weeks, €12k dev cost, €800/month run cost. Citations required for compliance.
Company B – Customer support email generation: 50,000 past tickets, consistent tone required. Fine-tuned GPT-3.5 Turbo. Built in 8 weeks, €30k total cost, €400/month hosting. Responses feel "on-brand."
Company C – Technical support bot: Combined fine-tuning (for structured troubleshooting format) + RAG (for latest KB articles). Built in 10 weeks, €40k dev cost, €1,200/month run cost. Best of both worlds.
Next steps
Identify your goal: Are you answering questions from documents (RAG) or generating content in a specific style (fine-tuning)? Start with RAG unless you have clear evidence that style/format matters more than content freshness.
Build a small proof-of-concept with 10–50 examples. Measure accuracy, latency, and user satisfaction. Expand only after you validate the approach.
Most businesses start with RAG and layer fine-tuning later if needed. It's faster, cheaper, and easier to explain to stakeholders.