In Part 1 and Part 2 of this series, we covered RAG's origins, fundamentals, and technical architecture. You understand what RAG is, why it matters, and how it works under the hood. Now it's time to put that knowledge into practice. This article shows you how to build real RAG systems with working C# code, solve common challenges, and use advanced techniques from recent research.
📖 Series Navigation: This is Part 3 of the RAG series:
If you haven't read Parts 1 and 2, I recommend starting there to understand:
This article assumes you understand those fundamentals and focuses on implementation, optimization, and real-world patterns.
I've built several RAG-powered features on this blog. Let me show you concrete examples.
Every blog post can show "Related Posts" using semantic similarity.
How it works:
Why it's better than tags:
Code snippet:
public async Task<List<SearchResult>> GetRelatedPostsAsync(
string currentPostSlug,
string language,
int limit = 5)
{
// Get the current post's embedding
var currentPost = await _vectorStore.GetByIdAsync(currentPostSlug);
if (currentPost == null)
return new List<SearchResult>();
// Find similar posts
var similarPosts = await _vectorStore.SearchAsync(
currentPost.Embedding,
limit: limit + 1, // +1 because result includes the current post
filter: new Filter
{
Must =
{
new Condition
{
Field = "language",
Match = new Match { Keyword = language }
}
},
MustNot =
{
new Condition
{
Field = "slug",
Match = new Match { Keyword = currentPostSlug }
}
}
}
);
return similarPosts.Take(limit).ToList();
}
The search box on this blog uses RAG-style semantic search (though without the generation part - it's just retrieval).
User experience:
Implementation: I'll cover building this in an upcoming article on vector databases.
I'm building a complete RAG system to help me write new blog posts.
Use case: When I start writing about "adding authentication to ASP.NET Core", the system:
Full RAG pipeline:
public async Task<WritingAssistanceResponse> GetSuggestionsAsync(
string currentDraft,
string topic)
{
// 1. Embed the current draft
var draftEmbedding = await _embeddingService.GenerateEmbeddingAsync(
currentDraft
);
// 2. Retrieve related past content
var relatedPosts = await _vectorStore.SearchAsync(
draftEmbedding,
limit: 5
);
// 3. Build context for LLM
var prompt = BuildWritingAssistancePrompt(
currentDraft,
topic,
relatedPosts
);
// 4. Generate suggestions using local LLM
var suggestions = await _llmService.GenerateAsync(prompt);
// 5. Extract and format citations
var response = ExtractCitations(suggestions, relatedPosts);
return response;
}
This is RAG in action - retrieval (semantic search) + augmentation (adding context) + generation (LLM suggestions).
Building production RAG systems isn't trivial. Here are challenges I've encountered and how to solve them.
Problem: How do you split documents? Too small = loss of context. Too large = irrelevant information.
Solution: Hybrid chunking based on document structure.
public class SmartChunker
{
public List<Chunk> ChunkDocument(string markdown, string sourceId)
{
var chunks = new List<Chunk>();
// Parse markdown into sections
var document = Markdown.Parse(markdown);
var sections = ExtractSections(document);
foreach (var section in sections)
{
var wordCount = CountWords(section.Content);
if (wordCount < MinChunkSize)
{
// Merge small sections
MergeWithPrevious(chunks, section);
}
else if (wordCount > MaxChunkSize)
{
// Split large sections
var subChunks = SplitSection(section);
chunks.AddRange(subChunks);
}
else
{
// Just right
chunks.Add(CreateChunk(section, sourceId));
}
}
return chunks;
}
}
Best practices:
Problem: Generic embedding models may not capture domain-specific semantics.
Solutions:
Option 1: Fine-tune embeddings (advanced)
# Using sentence-transformers in Python
from sentence_transformers import SentenceTransformer, InputExample, losses
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create training examples from your domain
train_examples = [
InputExample(texts=['Docker Compose', 'container orchestration'], label=0.9),
InputExample(texts=['Entity Framework', 'ORM database'], label=0.9),
InputExample(texts=['Docker', 'apple fruit'], label=0.1)
]
# Fine-tune
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1)
Option 2: Hybrid embeddings (combine multiple models)
public async Task<float[]> GenerateHybridEmbeddingAsync(string text)
{
var semantic = await _semanticModel.GenerateEmbeddingAsync(text);
var keyword = await _keywordModel.GenerateEmbeddingAsync(text);
// Concatenate or weighted average
return CombineEmbeddings(semantic, keyword);
}
Option 3: Add metadata filtering
var results = await _vectorStore.SearchAsync(
queryEmbedding,
limit: 10,
filter: new Filter
{
Must =
{
new Condition { Field = "category", Match = new Match { Keyword = "ASP.NET" } },
new Condition { Field = "date", Range = new Range { Gte = "2024-01-01" } }
}
}
);
Problem: LLMs have token limits. How do you fit query + context + prompt in the window?
Solution: Dynamic context selection and summarization.
public string BuildContextAwarePrompt(
string query,
List<SearchResult> retrievedDocs,
int maxTokens = 4096)
{
var promptTemplate = GetPromptTemplate();
var queryTokens = CountTokens(query);
var templateTokens = CountTokens(promptTemplate);
// Reserve tokens for: prompt + query + response
var availableForContext = maxTokens - queryTokens - templateTokens - 500; // 500 for response
// Add context until we hit limit
var selectedContext = new List<SearchResult>();
var currentTokens = 0;
foreach (var doc in retrievedDocs.OrderByDescending(d => d.Score))
{
var docTokens = CountTokens(doc.Text);
if (currentTokens + docTokens <= availableForContext)
{
selectedContext.Add(doc);
currentTokens += docTokens;
}
else
{
// Try summarizing the doc if it's important
if (doc.Score > 0.85)
{
var summary = await SummarizeAsync(doc.Text, maxTokens: 200);
var summaryTokens = CountTokens(summary);
if (currentTokens + summaryTokens <= availableForContext)
{
selectedContext.Add(new SearchResult
{
Text = summary,
Title = doc.Title,
Score = doc.Score
});
currentTokens += summaryTokens;
}
}
}
}
return FormatPrompt(query, selectedContext);
}
Problem: Even with context, LLMs sometimes ignore it and hallucinate.
Solutions:
1. Prompt engineering:
var systemPrompt = @"
You are a technical assistant.
CRITICAL RULES:
1. ONLY use information from the provided CONTEXT sections
2. If the context doesn't contain the answer, say 'I don't have enough information in the provided context to answer that'
3. DO NOT use your training data to supplement answers
4. Always cite the source using [1], [2] notation
5. If you're unsure, say so
CONTEXT:
{context}
QUESTION: {query}
ANSWER (following all rules above):
";
2. Post-generation validation:
public async Task<bool> ValidateResponseAgainstContext(
string response,
List<SearchResult> context)
{
// Check if response contains claims not in context
var responseSentences = SplitIntoSentences(response);
foreach (var sentence in responseSentences)
{
var isSupported = await IsClaimSupportedByContext(sentence, context);
if (!isSupported)
{
_logger.LogWarning("Hallucination detected: {Sentence}", sentence);
return false;
}
}
return true;
}
3. Iterative refinement:
public async Task<string> GenerateWithValidationAsync(
string query,
List<SearchResult> context,
int maxAttempts = 3)
{
for (int attempt = 0; attempt < maxAttempts; attempt++)
{
var response = await _llm.GenerateAsync(
BuildPrompt(query, context)
);
var isValid = await ValidateResponseAgainstContext(response, context);
if (isValid)
return response;
// Refine prompt for next attempt
query = $"{query}\n\nPrevious attempt hallucinated. Stick strictly to the context.";
}
return "I couldn't generate a reliable answer. Please rephrase your question.";
}
Problem: As you add new documents, the vector database needs to stay current.
Solution: Automated indexing pipeline.
public class BlogIndexingBackgroundService : BackgroundService
{
private readonly IVectorStoreService _vectorStore;
private readonly IMarkdownService _markdownService;
private readonly ILogger<BlogIndexingBackgroundService> _logger;
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await IndexNewPostsAsync(stoppingToken);
// Check for updates every hour
await Task.Delay(TimeSpan.FromHours(1), stoppingToken);
}
catch (Exception ex)
{
_logger.LogError(ex, "Error in indexing service");
}
}
}
private async Task IndexNewPostsAsync(CancellationToken ct)
{
var allPosts = await _markdownService.GetAllPostsAsync();
foreach (var post in allPosts)
{
var existingDoc = await _vectorStore.GetByIdAsync(post.Slug);
// Check if content changed
var currentHash = ComputeHash(post.Content);
if (existingDoc == null || existingDoc.ContentHash != currentHash)
{
_logger.LogInformation("Indexing updated post: {Title}", post.Title);
var chunks = _chunker.ChunkDocument(post.Content, post.Slug);
foreach (var chunk in chunks)
{
var embedding = await _embeddingService.GenerateEmbeddingAsync(chunk.Text);
await _vectorStore.UpsertAsync(
id: $"{post.Slug}_{chunk.Index}",
embedding: embedding,
metadata: new Dictionary<string, object>
{
["slug"] = post.Slug,
["title"] = post.Title,
["chunk_index"] = chunk.Index,
["content_hash"] = currentHash
},
ct: ct
);
}
}
}
}
}
Let's explore cutting-edge RAG techniques from recent research.
Problem: User queries are often short and poorly formed. Document chunks are detailed and well-written. This mismatch hurts retrieval.
Solution: Generate a hypothetical ideal document that would answer the query, embed that, then search.
public async Task<List<SearchResult>> HyDESearchAsync(string query)
{
// Generate hypothetical answer (even if hallucinated)
var hypotheticalAnswer = await _llm.GenerateAsync($@"
Write a detailed, technical paragraph that would perfectly answer this question:
Question: {query}
Paragraph:"
);
// Embed the hypothetical answer
var embedding = await _embeddingService.GenerateEmbeddingAsync(
hypotheticalAnswer
);
// Search using this embedding
return await _vectorStore.SearchAsync(embedding);
}
Why it works: The hypothetical answer uses similar language and structure to actual documents, improving retrieval.
Problem: User queries often mix semantic search with metadata filters.
Example: "Recent posts about Docker" = semantic("Docker") + filter(date > 2024-01-01)
Solution: Use LLM to parse the query into semantic + metadata filters.
public async Task<SearchQuery> ParseSelfQueryAsync(string naturalLanguageQuery)
{
var parsingPrompt = $@"
Parse this search query into:
1. Semantic search query (what the user is looking for)
2. Metadata filters (category, date range, etc.)
User Query: {naturalLanguageQuery}
Output JSON:
{{
""semantic_query"": ""the core concept"",
""filters"": {{
""category"": ""...",
""date_after"": ""..."",
""date_before"": ""...""
}}
}}
";
var jsonResponse = await _llm.GenerateAsync(parsingPrompt);
var parsed = JsonSerializer.Deserialize<SearchQuery>(jsonResponse);
return parsed;
}
// Use parsed query
var parsedQuery = await ParseSelfQueryAsync("Recent ASP.NET posts about authentication");
// semantic_query: "authentication"
// filters: { category: "ASP.NET", date_after: "2024-01-01" }
var results = await _vectorStore.SearchAsync(
embedding: await _embeddingService.GenerateEmbeddingAsync(parsedQuery.SemanticQuery),
filter: BuildFilter(parsedQuery.Filters)
);
Problem: A single query might miss relevant documents due to phrasing.
Solution: Generate multiple variations of the query, search with all, combine results.
public async Task<List<SearchResult>> MultiQuerySearchAsync(string query)
{
// Generate query variations
var variations = await _llm.GenerateAsync($@"
Generate 3 different ways to phrase this search query:
Original: {query}
Variations (one per line):
");
var queries = variations.Split('\n', StringSplitOptions.RemoveEmptyEntries)
.Prepend(query) // Include original
.ToList();
// Search with all variations
var allResults = new List<SearchResult>();
foreach (var q in queries)
{
var embedding = await _embeddingService.GenerateEmbeddingAsync(q);
var results = await _vectorStore.SearchAsync(embedding, limit: 10);
allResults.AddRange(results);
}
// Deduplicate and merge scores
var merged = allResults
.GroupBy(r => r.Id)
.Select(g => new SearchResult
{
Id = g.Key,
Text = g.First().Text,
Title = g.First().Title,
Score = g.Max(r => r.Score) // Take best score
})
.OrderByDescending(r => r.Score)
.ToList();
return merged;
}
Problem: Retrieved chunks contain irrelevant information. Sending it all wastes tokens.
Solution: Use a smaller LLM to compress retrieved context to only relevant parts.
public async Task<string> CompressContextAsync(
string query,
List<SearchResult> retrievedDocs)
{
var compressed = new List<string>();
foreach (var doc in retrievedDocs)
{
var compressionPrompt = $@"
Extract only the sentences from this document that are relevant to answering the question.
Question: {query}
Document:
{doc.Text}
Relevant excerpts (maintain original wording):
";
var relevantExcerpt = await _smallLLM.GenerateAsync(compressionPrompt);
if (!string.IsNullOrWhiteSpace(relevantExcerpt))
{
compressed.Add($"From '{doc.Title}':\n{relevantExcerpt}");
}
}
return string.Join("\n\n", compressed);
}
Problem: Complex questions require information from multiple sources that need to be connected.
Example: "What database does the blog use and how is semantic search implemented?"
Solution: Iterative retrieval and synthesis.
public async Task<string> MultiHopRAGAsync(string complexQuery, int maxHops = 3)
{
var currentQuery = complexQuery;
var allContext = new List<SearchResult>();
for (int hop = 0; hop < maxHops; hop++)
{
// Retrieve for current query
var results = await SearchAsync(currentQuery, limit: 5);
allContext.AddRange(results);
// Check if we have enough information
var synthesisPrompt = $@"
Original question: {complexQuery}
Context so far:
{FormatContext(allContext)}
Can you answer the original question with this context?
If yes, provide the answer.
If no, what additional information do you need? (be specific)
";
var synthesis = await _llm.GenerateAsync(synthesisPrompt);
if (synthesis.Contains("yes", StringComparison.OrdinalIgnoreCase))
{
// We have enough information
return ExtractAnswer(synthesis);
}
// Extract what we need for next hop
currentQuery = ExtractNextQuery(synthesis);
}
// Final synthesis with all gathered context
return await GenerateFinalAnswerAsync(complexQuery, allContext);
}
Problem: How do you build AI systems that remember conversations from months or years ago? Traditional chatbots lose context after each session.
Solution: Combine RAG with progressive summarization to create persistent, searchable memory.
This is the approach used in DiSE (Directed Synthetic Evolution) - an advanced system I'm building that uses RAG-based context memory to maintain shared conversational history indefinitely.
Example scenario:
User (Today): "Remember George's specs?"
AI: "Yes, you discussed George's prescription requirements in our conversation
from 5 years ago (2019-03-15). He needed progressive lenses with..."
How it works:
flowchart TB
A[User Message] --> B[Store in RAG Memory]
B --> C[Extract Key Entities & Topics]
C --> D[Link to Past Conversations]
E[Periodic Summarization] --> F[Summarize Old Conversations]
F --> G[Store Summary with High-Level Tags]
G --> H[Keep Original for Retrieval]
I[Future Query: 'George's specs'] --> J[Semantic Search in RAG]
J --> K[Find: 2019 conversation]
K --> L[Retrieve Original Context]
L --> M[LLM generates response with 5-year-old context!]
style B stroke:#f9f,stroke-width:3px
style J stroke:#bbf,stroke-width:3px
Implementation approach:
public class LongTermConversationalMemory
{
private readonly IVectorStoreService _vectorStore;
private readonly IEmbeddingService _embeddingService;
public async Task StoreConversationAsync(
string conversationId,
string userId,
List<ConversationTurn> turns,
DateTime timestamp)
{
// Extract key entities and topics
var entities = await ExtractEntitiesAsync(turns);
var topics = await ExtractTopicsAsync(turns);
// Create searchable representation
var conversationText = string.Join("\n", turns.Select(t =>
$"{t.Speaker}: {t.Message}"));
// Generate embedding
var embedding = await _embeddingService.GenerateEmbeddingAsync(
conversationText);
// Store in RAG with rich metadata
await _vectorStore.IndexDocumentAsync(
id: $"conv_{conversationId}",
embedding: embedding,
metadata: new Dictionary<string, object>
{
["user_id"] = userId,
["timestamp"] = timestamp.ToString("O"),
["entities"] = entities, // ["George", "specs", "prescription"]
["topics"] = topics, // ["healthcare", "eyewear"]
["full_text"] = conversationText,
["turn_count"] = turns.Count
}
);
}
public async Task<List<PastContext>> RetrieveRelevantPastAsync(
string currentQuery,
string userId,
int limit = 5)
{
// Embed the current query
var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(
currentQuery);
// Search past conversations
var results = await _vectorStore.SearchAsync(
queryEmbedding,
limit: limit,
filter: new Filter
{
Must =
{
new Condition { Field = "user_id", Match = new Match { Keyword = userId } }
}
}
);
return results.Select(r => new PastContext
{
ConversationId = r.Id,
Timestamp = DateTime.Parse(r.Metadata["timestamp"].ToString()),
Entities = (List<string>)r.Metadata["entities"],
FullText = r.Metadata["full_text"].ToString(),
Relevance = r.Score
}).ToList();
}
// Periodic summarization to keep memory manageable
public async Task SummarizeOldConversationsAsync(DateTime olderThan)
{
var oldConversations = await _vectorStore.FindByDateRangeAsync(
endDate: olderThan);
foreach (var conv in oldConversations)
{
// Generate summary using LLM
var summary = await _llm.GenerateAsync($@"
Summarize this conversation, preserving key facts and entities:
{conv.FullText}
Summary:");
// Update document with summary while keeping original
await _vectorStore.UpdateAsync(
id: conv.Id,
additionalMetadata: new Dictionary<string, object>
{
["summary"] = summary,
["summarized_at"] = DateTime.UtcNow.ToString("O")
}
);
}
}
}
Why this is powerful:
Real-world example from DiSE:
DiSE uses this approach to remember:
This creates an AI system that genuinely "learns" from every interaction and builds institutional memory, rather than starting fresh each session.
Challenges to consider:
This technique transforms RAG from "search my documents" into "remember everything we've ever discussed" - a game-changer for long-term AI assistants.
RAG isn't always the answer. Here's when to avoid it:
1. General knowledge questions
2. Creative writing
3. Real-time data needs
4. Mathematical reasoning
5. Very small knowledge bases
6. When you control the LLM's training
Want to build your own RAG system? Here's a step-by-step approach.
Goal: Get basic retrieval working with no LLM.
// 1. Choose an embedding service (start with API for simplicity)
var openAI = new OpenAIClient(apiKey);
// 2. Embed a few test documents
var docs = new[]
{
"Docker is a containerization platform",
"Kubernetes orchestrates containers",
"Entity Framework is an ORM for .NET"
};
var embeddings = new List<float[]>();
foreach (var doc in docs)
{
var response = await openAI.GetEmbeddingsAsync(
new EmbeddingsOptions("text-embedding-3-small", new[] { doc })
);
embeddings.Add(response.Value.Data[0].Embedding.ToArray());
}
// 3. Implement basic search (in-memory for now)
var query = "container orchestration";
var queryEmbedding = await GetEmbeddingAsync(query);
var results = embeddings
.Select((emb, idx) => new
{
Text = docs[idx],
Score = CosineSimilarity(queryEmbedding, emb)
})
.OrderByDescending(r => r.Score)
.ToList();
// 4. Verify search works
foreach (var result in results)
{
Console.WriteLine($"{result.Score:F3}: {result.Text}");
}
// Expected: Kubernetes scores highest
Goal: Scale to real document collections.
Next steps to implement:
I'll cover this in detail in an upcoming article on vector databases.
Goal: Complete the RAG pipeline.
// 1. Retrieve context
var context = await SearchAsync(query, limit: 3);
// 2. Build prompt
var prompt = $@"
Answer the question using this context:
{FormatContext(context)}
Question: {query}
Answer:";
// 3. Generate (start with API)
var response = await openAI.GetChatCompletionsAsync(new ChatCompletionsOptions
{
Messages =
{
new ChatMessage(ChatRole.System, "You are a helpful assistant."),
new ChatMessage(ChatRole.User, prompt)
},
Temperature = 0.7f,
MaxTokens = 500
});
return response.Value.Choices[0].Message.Content;
Once the basics work, migrate to local inference (I'll cover this in upcoming articles):
RAG (Retrieval-Augmented Generation) is a powerful technique for making LLMs more accurate, up-to-date, and trustworthy by grounding their responses in actual documents. Instead of relying on the model's training data alone, RAG systems:
Key advantages of RAG:
When to use RAG:
When to avoid RAG:
The field is evolving rapidly with advanced techniques like HyDE, multi-query retrieval, and contextual compression, but the core concept remains simple: give LLMs access to the right information at the right time.
Start simple, measure results, and iterate. RAG is one of the most practical ways to build reliable AI systems today.
You've now completed the three-part RAG series:
Part 1: Origins and Fundamentals
Part 2: Architecture and Internals
Part 3: RAG in Practice (this article)
You now have the complete picture: From understanding RAG's origins to building production systems with advanced optimizations.
Now that you understand RAG from theory to practice, upcoming articles will show you how to build complete, production-ready RAG systems in C#:
Coming soon:
These articles will take you from theory to practice, with complete working code, deployment strategies, and real-world optimizations based on running these systems in production on this blog.
Stay tuned for hands-on implementation guides that turn this RAG knowledge into working systems!
Foundational Papers:
Tools and Frameworks:
Further Reading:
This RAG Series:
Happy building!
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.