In Part 1, we covered RAG's origins, fundamentals, and why it matters. You understand the high-level concept: retrieve relevant information, then use it to generate responses. Now we dive deep into the technical architecture—exactly how RAG systems work under the hood, from chunking strategies to LLM internals like tokens and KV caches.
📖 Series Navigation: This is Part 2 of the RAG series:
If you haven't read Part 1, I recommend starting there to understand:
This article assumes you understand those basics and focuses on technical architecture, implementation details, and LLM internals.
Let's break down exactly what happens in a RAG system, from the moment you add a document to when a user gets an answer.
Before RAG can retrieve anything, you need to index your knowledge base. This is a one-time process (though you can add new documents later).
flowchart TB
A[Source Documents] -->|1. Extract Text| B[Text Extraction]
B -->|2. Split into Chunks| C[Chunking Service]
C -->|3. Generate Embeddings| D[Embedding Model]
D -->|4. Store Vectors| E[Vector Database]
B -.Metadata.-> E
subgraph "Example: Blog Post"
F["Understanding Docker: A containerization platform..."]
end
subgraph "Chunks"
G["Chunk 1: Title + Intro"]
H["Chunk 2: Benefits Section"]
end
subgraph "Embeddings"
I["0.234, 0.891, 0.567, ..."]
J["0.445, 0.123, 0.789, ..."]
end
F --> G
F --> H
G --> I
H --> J
style D stroke:#f9f,stroke-width:2px
style E stroke:#bbf,stroke-width:2px
Extract plain text from your source documents. This could be:
Example from my blog:
// From MarkdownRenderingService
public string ExtractPlainText(string markdown)
{
// Remove code blocks
var withoutCode = Regex.Replace(markdown, @"```[\s\S]*?```", "");
// Convert markdown to plain text
var document = Markdown.Parse(withoutCode);
var plainText = document.ToPlainText();
return plainText.Trim();
}
This is where most RAG implementations fail. You can't just split on paragraph boundaries - you need semantically coherent chunks.
Why chunking matters:
Bad chunking:
Chunk 1: "Docker is a containerization platform. It allows you"
Chunk 2: "to package applications with their dependencies. This"
Chunk 3: "ensures consistency across environments."
Good chunking:
Chunk 1: "Docker is a containerization platform. It allows you to package applications with their dependencies. This ensures consistency across environments."
Chunk 2: "Benefits of Docker:
- Isolation: Each container runs in its own environment
- Portability: Containers run anywhere Docker is installed
- Efficiency: Lightweight compared to virtual machines"
Example from my semantic search implementation:
public class TextChunker
{
private const int TargetChunkSize = 500; // ~500 words
private const int ChunkOverlap = 50; // 50 words overlap
public List<Chunk> ChunkDocument(string text, string sourceId)
{
var chunks = new List<Chunk>();
// Split on section boundaries first (## headers in markdown)
var sections = SplitOnHeaders(text);
foreach (var section in sections)
{
// If section is small enough, keep it whole
if (section.WordCount < TargetChunkSize)
{
chunks.Add(new Chunk
{
Text = section.Text,
SourceId = sourceId,
SectionHeader = section.Header
});
}
else
{
// Split large sections on sentence boundaries
var subChunks = SplitOnSentences(section.Text, TargetChunkSize, ChunkOverlap);
chunks.AddRange(subChunks.Select(c => new Chunk
{
Text = c,
SourceId = sourceId,
SectionHeader = section.Header
}));
}
}
return chunks;
}
}
Common chunking strategies:
Embeddings are the magic that makes semantic search possible. An embedding is a vector (array of numbers) that represents the meaning of text.
Key concept: Similar meanings → similar vectors
"Docker container" → [0.234, -0.891, 0.567, ..., 0.123]
"containerization platform" → [0.221, -0.903, 0.534, ..., 0.119]
"apple fruit" → [0.891, 0.234, -0.567, ..., -0.789]
The first two vectors would be "close" in vector space (high cosine similarity), while the third is far away.
How embeddings are generated: Modern embedding models are neural networks trained on massive text datasets to learn semantic relationships. Popular models:
Example from my ONNX embedding service:
public async Task<float[]> GenerateEmbeddingAsync(string text)
{
// Tokenize the input text
var tokens = Tokenize(text);
// Create input tensors for ONNX model
var inputIds = CreateInputTensor(tokens);
var attentionMask = CreateAttentionMaskTensor(tokens.Length);
var tokenTypeIds = CreateTokenTypeIdsTensor(tokens.Length);
// Run ONNX inference
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", inputIds),
NamedOnnxValue.CreateFromTensor("attention_mask", attentionMask),
NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypeIds)
};
using var results = _session.Run(inputs);
// Extract the output (sentence embedding)
var output = results.First().AsTensor<float>();
var embedding = output.ToArray();
// L2 normalize the vector for cosine similarity
return NormalizeVector(embedding);
}
Why normalization matters: After L2 normalization, cosine similarity becomes a simple dot product, making search much faster.
Vector databases are optimized for storing and searching high-dimensional vectors. Unlike traditional databases that use SQL queries, vector databases use similarity search.
Key operations:
Example Qdrant implementation:
public async Task IndexDocumentAsync(
string id,
float[] embedding,
Dictionary<string, object> metadata)
{
var point = new PointStruct
{
Id = new PointId { Uuid = id },
Vectors = embedding,
Payload =
{
["title"] = metadata["title"],
["source"] = metadata["source"],
["chunk_index"] = metadata["chunk_index"],
["created_at"] = DateTime.UtcNow.ToString("O")
}
};
await _client.UpsertAsync(
collectionName: "blog_posts",
points: new[] { point }
);
}
Popular vector databases:
We'll explore setting up these databases in upcoming articles.
When a user asks a question, the RAG system needs to find the most relevant information from the knowledge base.
flowchart LR
A["User Query:<br/>'How do I use Docker Compose?'"] --> B[Generate Query Embedding]
B --> C["Query Vector:<br/>[0.445, -0.123, ...]"]
C --> D[Vector Search]
D --> E[Vector Database]
E --> F[Top K Similar Chunks]
F --> G["Results:<br/>1. Docker Compose Basics 0.92<br/>2. Multi-Container Setup 0.87<br/>3. Service Configuration 0.83"]
style B stroke:#f9f,stroke-width:3px
style D stroke:#bbf,stroke-width:3px
The user's question gets converted to a vector using the same embedding model used for indexing. This is critical - different models produce incompatible vectors.
public async Task<List<SearchResult>> SearchAsync(string query, int limit = 10)
{
// Same embedding model used for indexing
var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(query);
// Search in vector store
var results = await _vectorStoreService.SearchAsync(
queryEmbedding,
limit
);
return results;
}
The vector database computes similarity between the query vector and all stored vectors. Common metrics:
Cosine Similarity (most popular for normalized vectors):
similarity = (A · B) / (||A|| × ||B||)
Range: -1 to 1 (higher = more similar)
Euclidean Distance (for unnormalized vectors):
distance = sqrt(Σ(Ai - Bi)²)
Range: 0 to ∞ (lower = more similar)
Dot Product (when vectors are pre-normalized):
similarity = A · B
Range: -1 to 1 (higher = more similar)
Example from my Qdrant service:
var searchResults = await _client.SearchAsync(
collectionName: "blog_posts",
vector: queryEmbedding,
limit: (ulong)limit,
scoreThreshold: 0.7f, // Only return results with >70% similarity
payloadSelector: true // Include all metadata
);
return searchResults.Select(hit => new SearchResult
{
Text = hit.Payload["text"].StringValue,
Title = hit.Payload["title"].StringValue,
Score = hit.Score,
Source = hit.Payload["source"].StringValue
}).ToList();
The initial retrieval is fast but approximate. Reranking uses a more sophisticated model to rescore the top K results.
flowchart LR
A[Vector Search:<br/>Top 50 Results] --> B[Reranking Model]
B --> C[Reranked:<br/>Top 10 Results]
style B stroke:#f9f,stroke-width:3px
Why reranking helps:
Example reranking implementation:
public async Task<List<SearchResult>> SearchWithRerankAsync(
string query,
int initialLimit = 50,
int finalLimit = 10)
{
// Stage 1: Fast vector search
var candidates = await SearchAsync(query, initialLimit);
// Stage 2: Precise reranking
var rerankedResults = await _rerankingService.RerankAsync(
query,
candidates
);
return rerankedResults.Take(finalLimit).ToList();
}
Now that we have relevant information, we feed it to the LLM along with the user's question.
flowchart TB
A[User Query] --> B[Retrieved Context 1]
A --> C[Retrieved Context 2]
A --> D[Retrieved Context 3]
B --> E[Construct Prompt]
C --> E
D --> E
A --> E
E --> F["System: You are a helpful assistant...\n\nContext:\n1. Docker Compose allows...\n2. Services are defined...\n3. Volumes persist data...\n\nQuestion: How do I use Docker Compose?\n\nAnswer:"]
F --> G[LLM]
G --> H[Generated Answer with Citations]
style E stroke:#f9f,stroke-width:2px
style G stroke:#bbf,stroke-width:2px
This is where RAG becomes an art. You need to structure the prompt so the LLM:
Example prompt template from my Lawyer GPT system:
public string BuildRAGPrompt(string query, List<SearchResult> context)
{
var sb = new StringBuilder();
sb.AppendLine("You are a technical writing assistant. Your task is to answer the user's question using ONLY the provided context from past blog posts.");
sb.AppendLine();
sb.AppendLine("CONTEXT:");
sb.AppendLine("========");
for (int i = 0; i < context.Count; i++)
{
sb.AppendLine($"[{i + 1}] {context[i].Title}");
sb.AppendLine($"Source: {context[i].Source}");
sb.AppendLine($"Content: {context[i].Text}");
sb.AppendLine($"Relevance: {context[i].Score:P0}");
sb.AppendLine();
}
sb.AppendLine("========");
sb.AppendLine();
sb.AppendLine("INSTRUCTIONS:");
sb.AppendLine("- Answer the question using the provided context");
sb.AppendLine("- Cite sources using [1], [2], etc.");
sb.AppendLine("- If the context doesn't contain enough information, say so");
sb.AppendLine("- Maintain the technical, practical tone of the blog");
sb.AppendLine();
sb.AppendLine($"QUESTION: {query}");
sb.AppendLine();
sb.AppendLine("ANSWER:");
return sb.ToString();
}
The constructed prompt goes to the LLM for generation. This can be:
Example using local LLM:
public async Task<string> GenerateResponseAsync(string prompt)
{
var result = await _llamaSharp.InferAsync(prompt, new InferenceParams
{
Temperature = 0.7f, // Creativity (0 = deterministic, 1 = creative)
TopP = 0.9f, // Nucleus sampling
MaxTokens = 500, // Response length limit
StopSequences = new[] { "\n\n", "User:", "Question:" }
});
return result.Text.Trim();
}
Key parameters explained:
After the LLM generates a response, we often need to:
Example post-processing:
public RAGResponse PostProcess(string llmOutput, List<SearchResult> sources)
{
var response = new RAGResponse
{
Answer = llmOutput,
Sources = new List<Source>()
};
// Extract citations like [1], [2]
var citations = Regex.Matches(llmOutput, @"\[(\d+)\]");
foreach (Match match in citations)
{
int index = int.Parse(match.Groups[1].Value) - 1;
if (index >= 0 && index < sources.Count)
{
var source = sources[index];
response.Sources.Add(new Source
{
Title = source.Title,
Url = GenerateUrl(source.Source),
RelevanceScore = source.Score
});
}
}
// Convert markdown citations to hyperlinks
response.FormattedAnswer = Regex.Replace(
llmOutput,
@"\[(\d+)\]",
m => {
int index = int.Parse(m.Groups[1].Value) - 1;
if (index >= 0 && index < sources.Count)
{
var url = GenerateUrl(sources[index].Source);
return $"[[{m.Groups[1].Value}]]({url})";
}
return m.Value;
}
);
return response;
}
Before we move to practical applications, it's essential to understand how LLMs work internally. This knowledge helps you optimize RAG systems and avoid common pitfalls.
Tokens are the fundamental units that LLMs process. Text isn't fed directly to models - it's first broken down into tokens.
Example tokenization:
Input: "Understanding Docker containers"
Tokens: ["Under", "standing", " Docker", " containers"]
Different models use different tokenization strategies:
Why tokenization matters for RAG:
public class TokenCounter
{
// Rough approximation: 1 token ≈ 0.75 words (English)
public int EstimateTokens(string text)
{
var wordCount = text.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length;
return (int)(wordCount / 0.75);
}
public int EstimateTokensAccurate(string text, ITokenizer tokenizer)
{
// Use actual tokenizer for precision
return tokenizer.Encode(text).Count;
}
}
Context window limits:
In RAG systems, you must fit:
Total tokens = System prompt + Retrieved context + User query + Response buffer
If your RAG retrieves 10 documents of 500 tokens each, that's 5,000 tokens just for context - before the query and response!
Practical RAG token management:
public class ContextWindowManager
{
private readonly int _maxContextTokens;
private readonly int _systemPromptTokens;
private readonly int _responseBufferTokens;
public ContextWindowManager(
int totalContextWindow = 4096,
int systemPromptTokens = 300,
int responseBufferTokens = 500)
{
_maxContextTokens = totalContextWindow;
_systemPromptTokens = systemPromptTokens;
_responseBufferTokens = responseBufferTokens;
}
public List<SearchResult> FitContextInWindow(
List<SearchResult> retrievedDocs,
string query)
{
var queryTokens = EstimateTokens(query);
// Available tokens for retrieved context
var availableForContext = _maxContextTokens
- _systemPromptTokens
- queryTokens
- _responseBufferTokens;
var selectedDocs = new List<SearchResult>();
var currentTokens = 0;
foreach (var doc in retrievedDocs.OrderByDescending(d => d.Score))
{
var docTokens = EstimateTokens(doc.Text);
if (currentTokens + docTokens <= availableForContext)
{
selectedDocs.Add(doc);
currentTokens += docTokens;
}
else
{
break; // Context window full
}
}
return selectedDocs;
}
private int EstimateTokens(string text)
{
// Rule of thumb: 1 token ≈ 4 characters
return text.Length / 4;
}
}
When an LLM generates text, it doesn't reprocess everything from scratch for each token. It uses a Key-Value (KV) cache to remember what it has already computed.
Transformers use an "attention" mechanism where each token "attends to" (looks at) all previous tokens to understand context.
flowchart TB
subgraph "Generation Step 1: 'Docker'"
A1[Input: 'Docker'] --> B1[Compute K,V for 'Docker']
B1 --> C1[Store in KV Cache]
C1 --> D1[Generate: 'is']
end
subgraph "Generation Step 2: 'is'"
A2[Input: 'is'] --> B2[Compute K,V for 'is']
B2 --> C2[Store in KV Cache]
C2 --> E2[Retrieve KV for 'Docker']
E2 --> F2[Attend: 'is' to 'Docker']
F2 --> D2[Generate: 'a']
end
subgraph "Generation Step 3: 'a'"
A3[Input: 'a'] --> B3[Compute K,V for 'a']
B3 --> C3[Store in KV Cache]
C3 --> E3[Retrieve KV for 'Docker', 'is']
E3 --> F3[Attend: 'a' to all previous]
F3 --> D3[Generate: 'container']
end
D1 --> A2
D2 --> A3
style C1 stroke:#f9f,stroke-width:3px
style C2 stroke:#f9f,stroke-width:3px
style C3 stroke:#f9f,stroke-width:3px
Without KV cache:
With KV cache:
This makes generation dramatically faster - the difference between 10 tokens/second and 100 tokens/second.
The KV cache forms a "tree" because of how attention works in transformers. Each layer in the model has its own K,V matrices.
graph TB
A[Input Tokens:<br/>'What is Docker?'] --> B[Layer 1 Attention]
B --> C[Layer 1 KV Cache]
B --> D[Layer 2 Attention]
D --> E[Layer 2 KV Cache]
D --> F[Layer 3 Attention]
F --> G[Layer 3 KV Cache]
F --> H[... up to Layer N]
H --> I[Output: 'Docker is']
C -.Key-Value pairs<br/>for all input tokens.-> C
E -.Key-Value pairs<br/>for all input tokens.-> E
G -.Key-Value pairs<br/>for all input tokens.-> G
style C stroke:#bbf,stroke-width:2px
style E stroke:#bbf,stroke-width:2px
style G stroke:#bbf,stroke-width:2px
Each layer stores:
For a model with:
The KV cache for one sequence is:
2 (K and V) × 32 layers × 4096 dimensions × 8192 tokens × 2 bytes (FP16)
≈ 4.3 GB of VRAM!
This is why long context windows are memory-intensive.
RAG systems can leverage KV cache optimization in clever ways:
Prompt caching (supported by some APIs like Anthropic Claude):
public class CachedRAGService
{
// System prompt and retrieved context can be cached!
public async Task<string> GenerateWithCachedContextAsync(
string systemPrompt, // Cached
List<SearchResult> context, // Cached
string userQuery) // Not cached, changes each time
{
var contextText = FormatContext(context);
// The KV cache for systemPrompt + contextText is reused across queries
var prompt = $@"
{systemPrompt}
CONTEXT:
{contextText}
QUERY: {userQuery}
ANSWER:";
return await _llm.GenerateAsync(prompt, useCaching: true);
}
}
Why this is powerful:
Practical example:
Query 1: "How do I use Docker?" → 2 seconds (no cache)
Query 2: "What are Docker benefits?" → 0.2 seconds (cache hit!)
Query 3: "Docker vs VMs?" → 0.2 seconds (cache hit!)
All three queries use the same retrieved context, so the KV cache for that context is reused.
Understanding tokens and KV cache informs your RAG architecture decisions:
Smaller chunks = more precise retrieval, but more overhead:
// Option A: Small chunks (200 tokens each)
// Retrieve 20 chunks = 4,000 tokens
// Pro: Very precise, only relevant info
// Con: More KV cache entries, slower attention
// Option B: Larger chunks (500 tokens each)
// Retrieve 8 chunks = 4,000 tokens
// Pro: Better context coherence, fewer KV entries
// Con: More noise, less precise
public class AdaptiveChunker
{
public int DetermineChunkSize(int contextWindowSize)
{
if (contextWindowSize <= 4096)
return 200; // Small chunks for limited windows
if (contextWindowSize <= 16384)
return 500; // Medium chunks
return 1000; // Large chunks for big windows
}
}
Don't max out the context window - leave room for generation:
public class SafeContextManager
{
public int GetSafeContextLimit(int totalContextWindow)
{
// Use only 75% for input, reserve 25% for output
return (int)(totalContextWindow * 0.75);
}
// Example: 4K model
// Total: 4096 tokens
// Safe input: 3072 tokens
// Reserved for output: 1024 tokens
}
In chatbots, the conversation history grows with each turn:
Turn 1:
System + Context + Query1 = 3000 tokens
Response1 = 300 tokens
Total: 3300 tokens
Turn 2:
System + Context + Query1 + Response1 + Query2 = 3650 tokens
Response2 = 300 tokens
Total: 3950 tokens
Turn 3:
System + Context + Query1 + Response1 + Query2 + Response2 + Query3 = 4250 tokens
ERROR: Context window exceeded!
Solution: Sliding window with re-retrieval
public class ConversationalRAG
{
private readonly int _maxHistoryTokens = 1000;
public async Task<string> ChatAsync(
List<ConversationTurn> history,
string newQuery)
{
// Re-retrieve context based on current query
var context = await RetrieveContextAsync(newQuery);
// Keep only recent conversation history
var relevantHistory = TrimHistory(history, _maxHistoryTokens);
var prompt = BuildPrompt(context, relevantHistory, newQuery);
return await _llm.GenerateAsync(prompt);
}
private List<ConversationTurn> TrimHistory(
List<ConversationTurn> history,
int maxTokens)
{
var trimmed = new List<ConversationTurn>();
var currentTokens = 0;
// Keep most recent turns
foreach (var turn in history.Reverse())
{
var turnTokens = EstimateTokens(turn.Query) + EstimateTokens(turn.Response);
if (currentTokens + turnTokens <= maxTokens)
{
trimmed.Insert(0, turn);
currentTokens += turnTokens;
}
else
{
break;
}
}
return trimmed;
}
}
API-based LLMs charge per token. RAG can explode costs if not careful:
public class CostAwareRAG
{
// OpenAI GPT-4 pricing (example):
// Input: $0.03 per 1K tokens
// Output: $0.06 per 1K tokens
public decimal EstimateQueryCost(
int systemPromptTokens,
int retrievedContextTokens,
int queryTokens,
int expectedResponseTokens)
{
var inputTokens = systemPromptTokens + retrievedContextTokens + queryTokens;
var outputTokens = expectedResponseTokens;
var inputCost = (inputTokens / 1000m) * 0.03m;
var outputCost = (outputTokens / 1000m) * 0.06m;
return inputCost + outputCost;
}
// Example:
// System: 300 tokens
// Context: 3000 tokens (10 retrieved docs)
// Query: 50 tokens
// Response: 500 tokens
//
// Cost = ((300 + 3000 + 50) / 1000 * 0.03) + (500 / 1000 * 0.06)
// = (3350 / 1000 * 0.03) + (500 / 1000 * 0.06)
// = $0.1005 + $0.03
// = $0.1305 per query
//
// At 1000 queries/day = $130/day = $3,900/month!
}
Cost reduction strategies:
Here's how tokens, KV cache, and RAG fit together:
flowchart TB
A[User Query:<br/>'How does Docker work?'<br/>≈ 12 tokens] --> B[Generate Query Embedding]
B --> C[Vector Search]
C --> D[Retrieved Docs:<br/>5 docs × 500 tokens<br/>= 2,500 tokens]
D --> E[Construct Prompt]
A --> E
E --> F["Complete Prompt:<br/>System: 300 tokens<br/>Context: 2,500 tokens<br/>Query: 12 tokens<br/>Total: 2,812 tokens"]
F --> G[Tokenize Prompt]
G --> H["Token IDs:<br/>[245, 1034, 8829, ...]<br/>2,812 token IDs"]
H --> I[LLM Layer 1]
I --> J[Compute K,V]
J --> K[KV Cache Layer 1:<br/>2,812 K,V pairs]
I --> L[LLM Layer 2]
L --> M[Compute K,V]
M --> N[KV Cache Layer 2:<br/>2,812 K,V pairs]
L --> O[... Layers 3-32]
O --> P[Generate Token 1: 'Docker']
P --> Q[Add to KV Cache]
Q --> R[Generate Token 2: 'is']
R --> S[Add to KV Cache]
S --> T[... until completion]
T --> U["Response: 'Docker is a containerization platform...'<br/>≈ 400 tokens"]
style K stroke:#f9f,stroke-width:2px
style N stroke:#f9f,stroke-width:2px
style Q stroke:#bbf,stroke-width:2px
style S stroke:#bbf,stroke-width:2px
Key insights:
Understanding tokens and KV cache leads to better RAG design:
1. Pre-compute and cache common contexts:
// Cache KV for frequently used system prompts + static context
var cachedSystemContext = await _llm.PrecomputeKVCache(systemPrompt + staticContext);
// Reuse for each query (much faster)
foreach (var query in userQueries)
{
var response = await _llm.GenerateAsync(query, reuseKVCache: cachedSystemContext);
}
2. Optimize chunk boundaries:
// Bad: Arbitrary 500-character chunks
var chunks = text.Chunk(500);
// Good: Chunk on sentence boundaries, measure in tokens
public List<string> ChunkByTokens(string text, int maxTokensPerChunk)
{
var sentences = SplitIntoSentences(text);
var chunks = new List<string>();
var currentChunk = new StringBuilder();
var currentTokens = 0;
foreach (var sentence in sentences)
{
var sentenceTokens = EstimateTokens(sentence);
if (currentTokens + sentenceTokens > maxTokensPerChunk && currentTokens > 0)
{
chunks.Add(currentChunk.ToString());
currentChunk.Clear();
currentTokens = 0;
}
currentChunk.Append(sentence).Append(" ");
currentTokens += sentenceTokens;
}
if (currentTokens > 0)
chunks.Add(currentChunk.ToString());
return chunks;
}
3. Monitor token usage in production:
public class RAGTelemetry
{
public void LogRAGQuery(
string query,
List<SearchResult> retrievedDocs,
string response)
{
var queryTokens = EstimateTokens(query);
var contextTokens = retrievedDocs.Sum(d => EstimateTokens(d.Text));
var responseTokens = EstimateTokens(response);
var totalTokens = queryTokens + contextTokens + responseTokens;
_logger.LogInformation(
"RAG Query: {Query} | Context: {ContextTokens} tokens from {DocCount} docs | " +
"Response: {ResponseTokens} tokens | Total: {TotalTokens} tokens",
query, contextTokens, retrievedDocs.Count, responseTokens, totalTokens
);
// Alert if approaching context limit
if (totalTokens > _maxTokens * 0.9)
{
_logger.LogWarning("Approaching token limit: {TotalTokens}/{MaxTokens}",
totalTokens, _maxTokens);
}
}
}
We've covered the complete technical architecture of RAG systems:
Phase 1: Indexing
Phase 2: Retrieval
Phase 3: Generation
LLM Internals
Key technical insights:
You now understand how RAG works at the technical level. But theory only gets you so far. How do you actually build these systems? What challenges will you face? What advanced techniques can you use?
In Part 3: RAG in Practice, we move from architecture to implementation:
Real-world applications:
Common challenges and solutions:
Advanced techniques:
Getting started:
Continue to Part 3: RAG in Practice →
Foundational Papers:
Tools and Frameworks:
Further Reading:
Series navigation:
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.