If you're brand new to RAG, start with RAG Explained and RAG Architecture. This post is for the point where you've built a RAG pipeline that mostly works… and now you're paying for it in cost, fragility, and "why did it answer that?" debugging sessions.
What this covers: The architectural pattern behind production RAG. For hands-on implementation, see Building a Document Summarizer with RAG.
Where it comes from: I built DocSummarizer (Document RAG engine), DataSummarizer (Data RAG engine), ImageSummarizer (Image RAG engine), and lucidRAG (multi-document Q&A). The same pattern kept emerging across documents, data, and images.
NOTE: You may have rightly noticed the semantic search on this site is broken...and yeah it's a config issue I haven't had time to fix yet. NOT bad theory 🤓
| Term | What It Means | Example |
|---|---|---|
| RAG | Retrieval-Augmented Generation | Giving an LLM your company docs before answering |
| Chunking | Splitting documents into smaller pieces | Breaking a PDF into paragraphs |
| Top-k | Get the k "best" results from a search | "Show me the 5 most relevant chunks" |
| Vector search | Finding similar text using embeddings | "Find docs similar to this error message" |
| BM25 | Keyword search | "Find docs containing 'timeout' AND 'database'" |
| Context window | How much text you can fit in a prompt | Finite and expensive |
| Signals | Facts extracted without AI | "Created: 2025-01-15", "Priority: High" |
Retrieval-Augmented Generation (RAG) has become one of those phrases that means everything and nothing.
For a lot of teams, “doing RAG” is now:
It works. Until it doesn’t.
It’s expensive, it’s hard to reason about, and it quietly hands responsibility to the model: parse the structure, enforce the filters, decide what matters, reconcile contradictions, and be confident about it.
When people say “RAG hallucinates” or “RAG doesn’t filter well”, they’re usually blaming the wrong component.
The problem isn’t RAG. The problem is lazy RAG.
This post describes a different default: Reduced RAG - use LLMs less, not more, by extracting deterministic signals up front and treating models as synthesis engines, not datastores.
A context window is:
It is not a datastore.
When you “just stick more in the prompt”, you’re asking the model to:
That’s why RAG demos look great and production RAG systems quietly rot.
Reduced RAG flips the default.
Instead of:
Retrieve text and let the model decide what matters.
You do:
Decide what matters once, store it, and only involve the model when you need synthesis.
In practice that means: deterministic ingestion, cheap retrieval, bounded generation.
If this sounds familiar, it’s the same systems shape as Constrained Fuzziness: let probabilistic components propose; let deterministic systems decide.
First, let's see what most teams do (and why it's expensive):
flowchart LR
subgraph Traditional["❌ Traditional RAG: Everything Through the LLM"]
T1[Documents] --> T2[Chunk Everything]
T2 --> T3["Embed All Chunks<br/>(once)"]
T3 --> T4[Vector Search]
T4 --> T5["Paste Top-K<br/>(per query)"]
T5 --> T6["LLM Decides<br/>(per query)"]
T6 --> T7[Answer]
end
style Traditional stroke:#ef4444,stroke-width:3px
style T6 stroke:#ef4444,stroke-width:3px
style T5 stroke:#ef4444,stroke-width:2px
Problems:
Now here's the Reduced RAG approach:
flowchart TB
subgraph Ingestion["✅ Ingestion (Once - Pay Upfront)"]
I1[Source Docs] --> I2[Parse Structure]
I2 --> I3["Extract Signals<br/>(deterministic)"]
I3 --> I4[(Structured Fields)]
I2 --> I5[Semantic Units]
I5 --> I6[(Vector Store)]
I2 --> I7[(Evidence Store)]
end
subgraph Query["Query Time (Cheap Per Query)"]
Q1[User Question] --> Q2{Extract Filters}
Q2 --> Q3["Filter Database<br/>(no LLM)"]
Q2 --> Q4["BM25 Search<br/>(no LLM)"]
Q2 --> Q5["Vector Search<br/>(no LLM)"]
Q3 --> R[Candidate Set]
Q4 --> R
Q5 --> R
R --> S["Small Evidence Pack<br/>(5 sources not 50)"]
end
subgraph Generation["LLM (Bounded Synthesis Per Query)"]
S --> L[Synthesize Answer]
L --> A[Answer + Citations]
end
style Ingestion stroke:#22c55e,stroke-width:3px
style Query stroke:#3b82f6,stroke-width:3px
style Generation stroke:#f59e0b,stroke-width:2px
style I3 stroke:#22c55e,stroke-width:3px
style Q3 stroke:#3b82f6,stroke-width:3px
style L stroke:#f59e0b,stroke-width:3px
Key differences:
| Phase | Traditional RAG | Reduced RAG |
|---|---|---|
| Ingestion | Just chunk and embed | Extract dates, categories, entities, quality flags |
| Filtering | Ask LLM to filter in prompt | Database WHERE clauses |
| Search | Vector only | BM25 + Vector + Structured filters |
| LLM Role | Parse, filter, AND answer | Just synthesize answer |
| Cost | Every query pays for full context | Ingestion once, queries cheap |
At ingestion time, parse what you can without an LLM.
Example: Support Ticket System
Instead of just chunking tickets into text:
Traditional: "Ticket #1234: Customer complained about slow loading..."
→ Embed entire text
→ Hope the LLM figures out it's about performance
Extract signals up front:
// Parse once during ingestion
var ticket = new SupportTicket
{
Id = "1234",
CreatedDate = DateTime.Parse("2025-01-15"),
Category = "Performance", // ← Deterministic field
Product = "WebApp", // ← Filterable
Priority = "High", // ← Sortable
Customer = "Enterprise", // ← Segment filter
SentimentScore = -0.3, // ← Computed once
Tags = ["slow-loading", "timeout"], // ← Searchable
Text = "Customer complained about..." // ← Still keep for RAG
};
What you can extract deterministically:
| Signal Type | Examples | Why It Matters |
|---|---|---|
| Temporal | Created date, updated date, expiry | "Only show last week's errors" |
| Categorical | Status, priority, department, product | "Filter to Mobile team tickets" |
| Numeric | Price, quantity, score, confidence | "Sort by severity" |
| Identity | Author, customer ID, SKU, invoice # | "All tickets from Customer X" |
| Quality | OCR confidence, auto-generated flag | "Exclude low-confidence OCR" |
| Provenance | Source system, file path, URL | "Only from production logs" |
Do it once. Keep the output. That's the substrate your system can rely on.
The hard part of RAG isn't the model - it's everything before the model. For concrete implementations, see Building a Document Summarizer with RAG and DocSummarizer: Building RAG Pipelines.
Instead of treating "chunks" as the unit of truth, store structured fields, signal-dense units, embeddings, and pointers back to source.
Text is still part of the system - but it becomes evidence, not "whatever we happened to paste into the prompt".
This is the difference between "RAG answers are vibes" and "RAG answers are inspectable".
You don't. Not upfront.
That's the whole point of storing raw signals instead of LLM-processed summaries.
Traditional RAG (expensive to change):
// ❌ You asked the LLM to "summarize the key points"
var summary = await llm.Summarize(ticket.Text); // Expensive
await db.SaveAsync(summary); // Threw away the original structure
// Later: "Actually, we need sentiment scores too"
// 😱 Have to re-process 10,000 tickets through LLM again!
Reduced RAG (cheap to change):
// ✅ Store raw signals extracted deterministically
var signals = new TicketSignals
{
Text = ticket.Text, // ← Keep original
WordCount = ticket.Text.Split().Length, // ← Cheap to compute
ContainsErrorCode = Regex.IsMatch(ticket.Text, @"ERR-\d+"),
MentionedProducts = ExtractProducts(ticket.Text), // ← Heuristic
SentimentWords = CountSentimentWords(ticket.Text), // ← Word lists
CreatedHour = ticket.Created.Hour // ← Maybe useful later?
};
// Later: "We need to prioritize by sentiment"
// ✅ Just add a computed column - no LLM re-run needed!
await db.ExecuteSqlAsync(@"
ALTER TABLE Tickets ADD COLUMN SentimentScore AS
(SentimentWords->>'positive' - SentimentWords->>'negative')
");
What just happened:
This is the same pattern as:
The rule: Remember well before the LLM. Update your "memory" (retrieval logic) for free.
Key architectural benefit: Evolvability. Because you store deterministic signals alongside embeddings (not instead of them), each part of the system can change independently. Swap embedding models? Re-embed without touching signals. Add a new signal? Compute it from stored text without re-embedding. Tune ranking? Adjust signal weights and scoring logic without reindexing.
With multi-vector stores (like Qdrant), you can even add multiple embeddings per document. Want to try a new embedding model? Add it alongside the existing one and gradually transition. Test both in production, compare quality, then deprecate the old one — all without reindexing or disrupting service.
Each component can evolve without forcing a full pipeline rebuild — and without degrading the others.
Store anything you can compute cheaply and deterministically.
Important: Signals are just short text strings, numbers, and booleans—not giant data structures. You're storing "Performance" (8 chars), not "Customer complained about slow loading times..." (50+ words).
public class DocumentSignals
{
// Always extract (almost free)
public int CharCount { get; set; } // Example: 1247
public int WordCount { get; set; } // Example: 203
public int ParagraphCount { get; set; } // Example: 5
public string[] UniqueWords { get; set; } // Example: ["timeout", "error", "api"]
// Structural (parse once)
public bool HasCodeBlocks { get; set; } // true/false
public bool HasLinks { get; set; } // true/false
public int HeadingCount { get; set; } // Example: 3
// Heuristic (simple patterns)
public string[] MentionedProducts { get; set; } // Example: ["WebApp", "API"]
public string[] ErrorCodes { get; set; } // Example: ["ERR-404", "ERR-500"]
public Dictionary<string, int> SentimentWords { get; set; } // { "positive": 3, "negative": 7 }
// Metadata (already available)
public DateTime Created { get; set; } // Example: 2025-01-15T14:23:00
public string Author { get; set; } // Example: "user@example.com"
public string Category { get; set; } // Example: "Performance" (not essay-length)
// Computed (cheap math)
public double ReadingTimeMinutes { get; set; } // Example: 4.2
public double KeywordDensity { get; set; } // Example: 0.034
}
Storage cost comparison:
| What You Store | Size per Document | 10k Documents |
|---|---|---|
| Full text | ~5 KB | 50 MB |
| All these signals | ~500 bytes | 5 MB |
| LLM-generated summary | ~2 KB | 20 MB |
Signals are 10x smaller than text, 4x smaller than LLM summaries, and infinitely cheaper to recompute (because you don't need an LLM).
When you realize you needed a different signal: Just compute it from the stored raw data. Storage is cheap; LLM inference is expensive.
This is why "over-extraction" is safe: you pay once in storage (negligible), not repeatedly in inference (substantial).
Most "RAG queries" are really just search queries with constraints:
Traditional RAG (expensive, unreliable):
// ❌ Paste everything into prompt and hope
var chunks = await vectorSearch.SearchAsync(query, k: 50); // 50 chunks!
var prompt = $@"
Given these 50 chunks of text, answer the question but ONLY use
docs from last week and ONLY for UK customers.
Chunks: {string.Join("\n", chunks)}
Question: {userQuestion}
";
var answer = await llm.GenerateAsync(prompt); // Expensive + unreliable
Reduced RAG (cheap, deterministic):
// ✅ Filter first, retrieve less, synthesize last
var candidates = await db.Tickets
.Where(t => t.CreatedDate > DateTime.Now.AddDays(-7)) // ← Database does this
.Where(t => t.Region == "UK") // ← Not the LLM!
.ToListAsync();
// Hybrid search on the filtered set
// (In production: push BM25 to database/index, not in-memory LINQ)
var bm25Results = candidates.Where(c => c.Text.Contains(keyword));
var vectorResults = await vectorSearch.SearchAsync(query, k: 5, filter: candidates);
// Small evidence pack
var evidence = RRF.Merge(bm25Results, vectorResults).Take(5);
// LLM only synthesizes
var answer = await llm.GenerateAsync($@"
Synthesize an answer using ONLY these 5 sources:
{FormatEvidence(evidence)}
Question: {userQuestion}
");
What just happened:
This is why hybrid search matters in production. If you haven't built it yet, Hybrid Search & Auto-Indexing is the most practical piece of the whole RAG series.
If a query is genuinely messy, you can use a small model to extract intent/filters ("compare vs explain vs troubleshoot"):
// Use small model to propose filters (ephemeral - discarded after use)
var intent = await smallModel.ExtractIntent(userQuestion);
// Returns: { intent: "troubleshoot", filters: { priority: "high", product: "api" } }
// Validate the proposal against known fields (deterministic)
var validatedFilters = ValidateAgainstSchema(intent.filters);
// Use validated filters for retrieval
var results = await db.Tickets.Where(validatedFilters).ToListAsync();
Core principle: The LLM proposes, the deterministic layer validates.
This is the foundation of the Constrained Fuzziness pattern.
For a deeper dive into ephemeral execution patterns where you use LLMs temporarily without persisting their outputs, see Fire and Don't Quite Forget.
Only after retrieval do you call the LLM. You give it a small, explicit set of facts with evidence pointers and clear uncertainty instructions.
The model's job: synthesize, explain, compare. Not: enforce constraints, decide truth, invent structure.
If you care about “LLMs dragging the whole context window into the answer”, you’ve already seen the failure mode. I called that out explicitly in Constrained Fuzzy Context Dragging: it’s not malicious - it’s what you asked it to do.
Order-of-magnitude comparison for a typical customer support RAG system:
Scenario: 10,000 support tickets, 1,000 queries per day
| Approach | Ingestion Cost | Per-Query Cost | Daily Cost |
|---|---|---|---|
| Traditional RAG | Embed 10k tickets | 50 chunks × large context | ~$75/day |
| Reduced RAG | Embed + extract signals | 5 chunks × small context | ~$7.50/day |
Order of magnitude: ~10x cost reduction per day
Even better: With good evidence and deterministic filtering, you likely don't need a frontier model at all. A local Ollama model (free) with a small evidence pack often outperforms GPT-4 fed 50 chunks it has to parse itself.
Plus: reduced debugging time, fewer support escalations, and model flexibility (swap models without architectural changes).
Traditional RAG (hope-based filtering):
// ❌ Prompt says "only UK customers" but model can ignore it
var answer = await llm.Generate(prompt); // No guarantee
Reduced RAG (enforced filtering):
// ✅ Database physically prevents non-UK results
var results = db.Tickets.Where(t => t.Region == "UK"); // Guaranteed
Why it's safer: Filters are enforced before generation (model never owned them, can't forget them), and hallucinations are bounded by the evidence you control.
Traditional RAG: "The model said this, but I'm not sure why or which chunk it came from."
Reduced RAG:
// You know exactly why each result matched
var result = new SearchResult
{
Text = "Server timeout error",
MatchedBecause = new[]
{
"Region = UK (database filter)",
"Created in last 7 days (date filter)",
"BM25 score: 4.2 (keyword 'timeout')",
"Vector similarity: 0.89 (semantic match)"
},
SourceChunks = [chunk1, chunk2], // ← Audit trail
ConfidenceScore = 0.89
};
You can inspect why each result matched, show evidence when confidence is low, and debug production issues by examining the candidate set before LLM synthesis.
If you already have a working RAG system, here's how to migrate:
Before:
public class Document
{
public string Id { get; set; }
public string Text { get; set; } // ← Only unstructured text
public float[] Embedding { get; set; }
}
After:
public class Document
{
public string Id { get; set; }
public string Text { get; set; } // ← Keep for evidence
public float[] Embedding { get; set; }
// Add deterministic signals (extract once during ingestion)
public DateTime CreatedDate { get; set; } // ← Parse from metadata
public string Category { get; set; } // ← Extract from filename/tags
public string Author { get; set; } // ← From file properties
public string[] Tags { get; set; } // ← Parse from content/metadata
public double QualityScore { get; set; } // ← Compute heuristics
}
Before:
var prompt = "Only use docs from last month for product 'API'. Query: " + userQuery;
var chunks = await vectorStore.Search(userQuery, k: 50);
var answer = await llm.Generate(prompt + chunks); // ❌ LLM might ignore filters
After:
// ✅ Database enforces filters
var candidates = await db.Documents
.Where(d => d.CreatedDate > DateTime.Now.AddMonths(-1))
.Where(d => d.Category == "API")
.ToListAsync();
// Search only the filtered candidates
var results = await vectorStore.Search(userQuery, k: 5, filter: candidates.Select(c => c.Id));
var answer = await llm.Generate(FormatEvidence(results)); // Much smaller context
// Combine keyword and semantic search
var keywordResults = await db.Documents
.Where(d => EF.Functions.ToTsVector("english", d.Text)
.Matches(EF.Functions.ToTsQuery("english", keywords)))
.ToListAsync();
var vectorResults = await vectorStore.Search(userQuery, k: 20);
// Merge using Reciprocal Rank Fusion (RRF)
var merged = RRF.Merge(keywordResults, vectorResults, k: 5);
See Hybrid Search & Auto-Indexing for a complete implementation.
Track these metrics before and after:
public class RAGMetrics
{
public int PromptTokens { get; set; } // Should drop by 80-90%
public int CandidatesRetrieved { get; set; } // Should drop from 50+ to 5-10
public TimeSpan QueryLatency { get; set; } // Should improve
public bool FiltersEnforced { get; set; } // Should be true
public List<string> EvidenceSources { get; set; } // Should be traceable
}
If your system depends on ever-larger context windows to stay accurate, you don't have a retrieval problem. You have an ingestion problem.
| Intent | Make RAG predictable, cheap, and debuggable |
| Forces | High token costs, weak filtering, silent hallucinations, untyped prompts |
| Solution | Extract signals once → enforce constraints deterministically → retrieve evidence → let the LLM synthesise within a budget |
| Consequences | More engineering up front; dramatically better operational behaviour |
Reduced RAG isn't anti-LLM. It's anti-waste.
Think of it this way:
Traditional RAG = "LLM, here's 50 chunks. Figure out what matters and answer."
Reduced RAG = "Database, filter to 100. BM25, find keywords. Vector, find similar.
Now LLM, here are the 5 most relevant sources. Synthesize an answer."
The shift:
The primitives:
The pattern:
It's what happens when you apply normal software engineering discipline to systems that happen to include language models.
If you're ready to build this:
Reference implementations:
The infrastructure is already there. You just have to stop treating the context window like a database.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.