Ever searched for "deployment guide" and got nothing, even though there's an article about "publishing to production"? RAG (Retrieval-Augmented Generation) solves this by understanding meaning, not just keywords. This series shows you how RAG came about, how it works under the hood, and how to build production systems. From semantic search to AI-powered Q&A with citations—all with working C# code examples.
📖 Series Navigation: This is Part 1 of the RAG (Retrieval-Augmented Generation) series:
RAG (Retrieval-Augmented Generation) was developed to make AI smarter—giving LLMs access to information they weren't trained on. But here's what's interesting: the technology opens opportunities far beyond AI chatbots. It powers semantic search on websites, content recommendation, writing assistance, and knowledge management.
The dual nature: RAG can help customers (better search, accurate answers with citations) or exploit them (manipulative recommendations, burying negative reviews, surfacing upsell content). The difference isn't the technology—it's intent. A semantic search that helps users find what they actually need? Great. One that prioritizes what makes you the most money while appearing helpful? That's dark pattern territory, and it's why understanding how this works matters.
Here's the truth about RAG: It sounds intimidating. Vector embeddings? Transformer models? KV caches? But like everything else in software, it's just about understanding how it works. You don't need to know the math behind transformer architectures any more than you need to understand assembly to write C#.
RAG in three steps:
That's it. The rest is implementation details.
This series shows you how to build RAG systems with working C# code. No handwaving. No assumptions. Just the pieces and how they fit together.
What you'll learn in this series:
Later, I'll also show you how to build complete RAG systems including:
Retrieval-Augmented Generation: Find relevant information, then use it.
flowchart LR
A[User Question] --> B[Retrieve Relevant Info]
B --> C[Retrieved Documents/Context]
C --> D[Generate Response]
A --> D
D --> E[Grounded, Accurate Answer]
style B stroke:#f9f,stroke-width:2px
style D stroke:#bbf,stroke-width:2px
Without RAG: User asks → LLM guesses from memory → might hallucinate
With RAG: User asks → Find relevant docs → LLM answers using those docs → grounded in reality
// Without RAG: Hope the LLM knows
var answer = await llm.GenerateAsync("How do I deploy Docker?");
// Risk: Might make up outdated or wrong steps
// With RAG: Give it the docs
var relevantDocs = await vectorSearch.FindSimilar("How do I deploy Docker?");
var context = string.Join("\n", relevantDocs.Select(d => d.Text));
var answer = await llm.GenerateAsync($"Context: {context}\n\nQuestion: How do I deploy Docker?");
// Result: Answer based on YOUR actual Docker deployment docs
Key insight: Separate knowledge storage (search) from reasoning (LLM). Update your docs, search stays current. No retraining needed.
RAG builds on decades of search and NLP research. Understanding this history helps you appreciate why RAG is designed the way it is—and what problems it solves.
Keyword-based search:
The problem: These matched characters, not meaning. Search "container orchestration" and you won't find "Docker Swarm" unless those exact words appear. They could handle typos but not semantics.
Watson (IBM, 2011):
Reading comprehension models:
Transformers (2017): "Attention is All You Need"
BERT (2018):
GPT-2/3 (2019/2020):
Dense vector representations:
BART (Facebook AI, October 2019):
M2M-100 (Facebook AI, October 2020):
Real-world example: My neural machine translation tool uses BART as a fallback translation model when primary services are unavailable, demonstrating how these transformer-based models became practical building blocks for production systems.
The seminal paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis et al. (Facebook AI Research) formally introduced RAG, building directly on BART:
What they combined:
The results: RAG systems outperformed much larger models on knowledge-intensive tasks while being more efficient and up-to-date. You could update the knowledge base without retraining the model.
ChatGPT, GPT-4, and Claude made RAG essential:
Today (2024-2025): RAG is the de facto standard for production AI systems that need accuracy and auditability. Every major AI company offers RAG tooling.
Before diving deep into the technical details (which we'll cover in Part 2), let's understand the high-level workflow.
RAG systems operate in three distinct phases:
flowchart LR
A[Your Documents] --> B[Split into Chunks]
B --> C[Generate Embeddings]
C --> D[Store in Vector DB]
style C stroke:#f9f,stroke-width:2px
style D stroke:#bbf,stroke-width:2px
What happens:
Key concept: Similar meanings produce similar vectors, so "Docker container" and "containerization platform" end up close together in vector space.
flowchart LR
A[User Question] --> B[Generate Query Embedding]
B --> C[Search Vector DB]
C --> D[Top K Most Similar Chunks]
style B stroke:#f9f,stroke-width:2px
style C stroke:#bbf,stroke-width:2px
What happens:
Why it works: "How do I deploy containers?" (query) is semantically similar to chunks about Docker deployment, even if the exact words differ.
flowchart TB
A[User Question] --> B[Build Prompt]
C[Retrieved Context] --> B
B --> D[LLM]
D --> E[Generated Answer with Citations]
style B stroke:#f9f,stroke-width:2px
style D stroke:#bbf,stroke-width:2px
What happens:
The magic: The LLM can't hallucinate facts that aren't in the context. It can only synthesize and explain what's provided.
Let's trace a query through the system:
User asks: "How do I use Docker Compose?"
Step 1 - Retrieval:
Query embedding: [0.234, -0.891, 0.567, ...]
Search vector DB for similar embeddings...
Retrieved chunks:
1. "Docker Compose is a tool for defining multi-container applications..." (similarity: 0.92)
2. "To use Docker Compose, create a docker-compose.yml file..." (similarity: 0.87)
3. "The docker-compose up command starts all services..." (similarity: 0.83)
Step 2 - Generation:
Prompt to LLM:
"Context:
[1] Docker Compose is a tool for defining multi-container applications...
[2] To use Docker Compose, create a docker-compose.yml file...
[3] The docker-compose up command starts all services...
Question: How do I use Docker Compose?
Answer (use the context above):"
LLM Response:
"To use Docker Compose [1], start by creating a docker-compose.yml file [2] that
defines your services. Then run 'docker-compose up' to start all services [3]..."
Result: Accurate answer with implicit citations from your documentation.
Understanding when to use RAG (and when not to) requires comparing it to alternatives.
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge Updates | Instant (just update the knowledge base) | Requires retraining |
| Cost | Low (storage + embedding) | High (GPU training time) |
| Accuracy | Grounded in sources | Can hallucinate |
| Customization | Limited to retrieval | Deep model adaptation |
| Explainability | High (can cite sources) | Low (black box) |
| Best For | Knowledge-intensive tasks | Style/format adaptation |
When to use Fine-Tuning:
When to use RAG:
Can you combine both? Yes! Fine-tune for style, RAG for facts.
Modern LLMs boast huge context windows (GPT-4: 128K tokens, Claude: 200K tokens). Why not just dump all your documents into the context?
Problems with long context:
When long context makes sense:
When RAG makes sense:
Best practice: Use RAG to select the most relevant content, then use long context for that subset.
Few-shot prompting (giving examples in the prompt) is a simple baseline.
Example prompt:
Examples:
Q: What is Docker?
A: Docker is a containerization platform...
Q: How does Kubernetes work?
A: Kubernetes orchestrates containers...
Q: What is my new question?
A: [LLM generates answer]
Limitations:
RAG improvement:
You can think of RAG as "automated few-shot prompting at scale."
You can combine RAG with traditional full-text search using Reciprocal Rank Fusion (RRF).
Why hybrid?
public async Task<List<SearchResult>> HybridSearchAsync(string query)
{
// Run both searches in parallel
var semanticTask = SemanticSearchAsync(query, limit: 20);
var keywordTask = KeywordSearchAsync(query, limit: 20);
await Task.WhenAll(semanticTask, keywordTask);
var semanticResults = await semanticTask;
var keywordResults = await keywordTask;
// Combine using Reciprocal Rank Fusion
return ApplyRRF(semanticResults, keywordResults);
}
private List<SearchResult> ApplyRRF(
List<SearchResult> list1,
List<SearchResult> list2,
int k = 60)
{
var scores = new Dictionary<string, double>();
// Score from first list
for (int i = 0; i < list1.Count; i++)
{
var id = list1[i].Id;
scores[id] = scores.GetValueOrDefault(id, 0) + 1.0 / (k + i + 1);
}
// Score from second list
for (int i = 0; i < list2.Count; i++)
{
var id = list2[i].Id;
scores[id] = scores.GetValueOrDefault(id, 0) + 1.0 / (k + i + 1);
}
// Merge and sort by combined score
var allResults = list1.Concat(list2)
.GroupBy(r => r.Id)
.Select(g => g.First())
.OrderByDescending(r => scores[r.Id])
.ToList();
return allResults;
}
Now that you understand what RAG is, where it came from, and how it compares to alternatives, here's why it matters:
1. Democratization of AI
2. Practical Accuracy
3. Always Up-to-Date
4. Privacy and Control
5. Cost-Effective
6. Versatile Applications
We've traced RAG's evolution:
Key insights from Part 1:
The three-step mental model:
Everything else is optimization.
You now understand what RAG is, why it matters, and where it came from. But how does it actually work under the hood?
In Part 2: RAG Architecture and Internals, we dive deep into the technical details:
Complete RAG pipeline:
LLM internals:
Technical deep dives:
Continue to Part 2: Architecture and Internals →
After Part 2, you'll be ready for Part 3, where we build real systems, solve common challenges, and explore advanced techniques like HyDE, multi-query RAG, and contextual compression.
Foundational Papers:
Further Reading:
Next in this series:
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.