This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Saturday, 17 January 2026
NOTE: This is not a conventional blog article. It is a design spec written for a concrete feature in lucidRAG. I iteratively feed this document to code-focused LLMs during development to reason about trade-offs, validate assumptions, and converge on a rational implementation.
This document describes a subsystem from lucidRAG, a project I’m actively developing.
One core requirement of lucidRAG is the ability to extract segments of evidence - sentences, paragraphs, headings, captions, frames, or structured blocks — and ensure those segments are deduplicated without destroying useful signal.
lucidRAG works by analysing and extracting the best available evidence from documents, images, audio, and structured data. Unlike most RAG implementations, it does not store LLM-generated summaries as the primary artefact. In many cases, ingestion requires no LLM at all (though one can be used when escalation is justified).
Instead, lucidRAG applies a wide range of deterministic and probabilistic techniques to:
Deduplication is a critical part of this process.
Simple string equality is not enough. The same concept is frequently expressed using different wording, structure, or modality. Treating those as distinct leads to redundant storage and poor downstream behaviour.
The problem compounds at retrieval time.
When results are retrieved (via SQL, vector embeddings, BM25, or hybrids), feeding an LLM multiple segments that all express the same underlying idea produces dull, repetitive answers. Five near-identical chunks from different documents do not add clarity — they dilute it.
To address this, lucidRAG treats deduplication as a first-class compilation problem, not a post-hoc filter.
The remainder of this document describes how that deduplication strategy was designed.
lucidRAG uses a two-phase deduplication strategy to eliminate redundant content while preserving important signals. This is a signal-preserving filter, not content normalization.
┌─────────────────────────────────────────────────────────────────────────────┐
│ INGESTION (Per Document) │
│ │
│ Document → Extract → Embed → DEDUPE (intra-doc) → Index to Vector Store │
│ │ │
│ ├─ Near-duplicates: boost salience │
│ └─ Exact duplicates: drop (no boost) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ RETRIEVAL (Cross Document) │
│ │
│ Query → Search → Rank (RRF) → DEDUPE (cross-doc) → Top K → LLM Synthesis │
│ │ │
│ └─ Keep segment with highest RRF score │
└─────────────────────────────────────────────────────────────────────────────┘
These invariants are maintained by the deduplication system:
| Guarantee | Description |
|---|---|
| Ordering preserved | Deduplication never changes semantic ordering after RRF ranking |
| No concept loss | Deduplication never removes all instances of a concept |
| Document boundary respected | Deduplication never crosses document boundaries at ingestion |
| Content immutable | Deduplication never alters embeddings or text content, only selection and salience scores |
| Deterministic | Given identical inputs and config, deduplication produces identical outputs |
This system explicitly does not attempt to:
Deduplication is fully deterministic:
Why this matters: Users debugging retrieval results can trust that re-running with the same inputs produces the same outputs. This aligns with lucidRAG's broader "constrained fuzziness" philosophy — fuzzy matching with deterministic behavior.
Goal: Reduce storage and prevent intra-document redundancy.
Problem: Documents often contain repeated content:
Solution: Deduplicate within each document before indexing.
Key Insight: Near-duplicates are treated as independent evidence of importance, not redundancy. If an author explains a concept three different ways, that concept matters. We capture this signal by boosting salience.
Goal: Prevent the LLM from receiving redundant information across documents.
Problem: When querying across multiple documents, similar paragraphs may appear in different sources. The LLM shouldn't describe the same information multiple times.
Solution: After ranking by RRF (which combines semantic similarity, keyword match, salience, and freshness), deduplicate across documents keeping the highest-scoring segment.
Why after RRF? The RRF score represents the best holistic measure of relevance. Deduplicating before ranking would lose this signal.
The separation is intentional:
| Phase | What it captures |
|---|---|
| Ingestion boost | Author emphasis — how much the document stresses a concept |
| Retrieval score | Query relevance — how well content matches user intent |
Mixing these at retrieval would entangle document intent with user intent. A concept repeated 5 times in a document is important to that document, but may not be relevant to this query. By boosting at ingestion, we preserve the author's signal without biasing query results.
src/Mostlylucid.DocSummarizer.Core/Services/BertRagSummarizer.cs
DeduplicateSegments()
1. Filter segments by minimum salience threshold (0.05)
2. Sort by salience score (highest first)
3. For each segment:
a. If no embedding → keep (can't compare)
b. Check cosine similarity against all selected segments
c. If similarity >= 0.90:
- If same ContentHash → exact duplicate, drop silently
- If different ContentHash → near-duplicate, boost kept segment's salience
d. If no match → add to selected list
4. Apply salience boosts: +15% per near-duplicate merged
5. Cap salience at 1.0 to prevent any single concept from dominating
| Parameter | Default | Description |
|---|---|---|
similarityThreshold |
0.90 | Cosine similarity above which segments are considered duplicates |
salienceThreshold |
0.05 | Minimum salience to consider (filters noise) |
boostPerNearDuplicate |
0.15 | Salience boost per near-duplicate merged (+15%) |
Exact Duplicate (No Boost)
Segment A: "Contact us at support@example.com" [hash: abc123]
Segment B: "Contact us at support@example.com" [hash: abc123] ← same hash
Result: Keep A, drop B, no boost (likely boilerplate)
Near-Duplicate (Boost Applied)
Segment A: "Machine learning models require training data" [hash: abc123]
Segment B: "ML systems need data for training" [hash: def456] ← different hash, 0.92 similarity
Segment C: "Training data is essential for ML" [hash: ghi789] ← different hash, 0.91 similarity
Result: Keep A with +30% salience boost (concept emphasized 3 ways)
src/LucidRAG.Core/Services/AgenticSearchService.cs
DeduplicateByEmbeddingPostRanking()
1. Receive ranked results (already sorted by RRF or dense score)
2. For each segment (in score order):
a. If no embedding → keep
b. Check cosine similarity against all selected segments
c. If similarity >= 0.90 → skip (higher-scored duplicate already selected)
d. If no match → add to selected list
3. Return deduplicated list (maintains score ordering)
| Parameter | Default | Description |
|---|---|---|
similarityThreshold |
0.90 | Cosine similarity threshold for cross-doc dedup |
RRF (Reciprocal Rank Fusion) combines four signals:
Deduplicating AFTER RRF means we keep the segment that best matches the query across all dimensions, not just semantic similarity.
Query: "How do I configure authentication?"
Results before dedup:
1. [Doc A] "Authentication is configured via config.yaml..." (RRF: 0.052)
2. [Doc B] "Configure auth using the config.yaml file..." (RRF: 0.048, similarity to #1: 0.93)
3. [Doc A] "Set the API key in environment variables..." (RRF: 0.041)
Results after dedup:
1. [Doc A] "Authentication is configured via config.yaml..." (RRF: 0.052)
2. [Doc A] "Set the API key in environment variables..." (RRF: 0.041)
Doc B's similar paragraph dropped - Doc A's version had higher RRF score.
| Failure Mode | Description | Mitigation |
|---|---|---|
| False positive (over-dedup) | Two distinct but closely related concepts may exceed 0.90 similarity | Accepted trade-off favoring reduced redundancy; threshold is tunable |
| False negative (under-dedup) | Very short segments may embed poorly, missing semantic similarity | Hash-based exact duplicate detection catches identical text |
| Embedding drift | Changing embedding models invalidates dedup assumptions | Requires full re-ingestion; embeddings are immutable once stored |
| Order sensitivity | Greedy selection means first high-salience segment wins | Mitigated by stable sorting; highest-salience segment is kept |
Deduplication behavior with multilingual content:
| Scenario | Behavior | Rationale |
|---|---|---|
| Same language | Normal dedup applies | Embeddings capture semantic similarity |
| Cross-language paraphrases | NOT deduplicated | Preserves source-language diversity |
| Mixed-language document | Dedup within language clusters | Embedding similarity naturally separates languages |
Design choice: Cross-lingual dedup is explicitly not supported. This preserves the ability to retrieve the same fact in the user's preferred language or compare how different sources phrase things.
Future enhancement: Cross-lingual dedup could be layered via translation-invariant embeddings if needed.
The deduplication system includes implicit protections:
| Attack Vector | Protection |
|---|---|
| Salience inflation via repetition | Boost capped at 1.0; exact duplicates don't boost |
| Copy-paste spam across documents | Cross-doc dedup at retrieval removes redundant results |
| Score manipulation via duplicate injection | Dedup occurs AFTER ranking, preventing score inflation |
| Boilerplate flooding | Exact hash match detection drops without boost |
Note: Deduplication is not a security boundary. Malicious content that passes ingestion filters will be indexed. Content filtering should happen upstream.
Deduplication and GraphRAG are intentionally orthogonal:
| System | Operates on | Purpose |
|---|---|---|
| Deduplication | Segments (text chunks) | Remove redundant retrieval results |
| GraphRAG | Entities & Relations | Build knowledge graph, resolve references |
Why separate:
Ingestion:
[dim]Deduplication: 150 → 98 segments[/]
Retrieval:
Post-ranking deduplication: 50 → 42 segments (removed 8 cross-doc duplicates)
For production monitoring, consider tracking:
| Metric | Description | Healthy Range |
|---|---|---|
dedup_ratio_ingestion |
% segments removed at ingestion | 10-40% |
dedup_ratio_retrieval |
% segments removed at retrieval | 5-20% |
avg_salience_boost |
Average boost applied per document | 0.05-0.20 |
max_salience_boost |
Highest boost in a document | < 0.60 (else one concept dominates) |
dedup_by_doc_type |
Dedup rate segmented by document type | Varies |
Deduplication is configured via DocSummarizerConfig.Deduplication section in appsettings.json:
{
"DocSummarizer": {
"Deduplication": {
"Ingestion": {
"Enabled": true,
"SimilarityThreshold": 0.90,
"SalienceThreshold": 0.05,
"EnableSalienceBoost": true,
"BoostPerNearDuplicate": 0.15,
"MaxSalienceBoost": 1.0,
"BoostDecayMode": "Logarithmic",
"LogBase": 2.0
},
"Retrieval": {
"Enabled": true,
"SimilarityThreshold": 0.90,
"MinRelevanceScore": 0.25
},
"Analytics": {
"EnableLogging": true,
"EnableMetrics": true,
"HighIngestionDedupThreshold": 0.50,
"HighRetrievalDedupThreshold": 0.30,
"HighSalienceBoostThreshold": 0.60
}
}
}
}
| Parameter | Default | Description |
|---|---|---|
Enabled |
true |
Enable/disable ingestion deduplication |
SimilarityThreshold |
0.90 |
Cosine similarity threshold for duplicate detection |
SalienceThreshold |
0.05 |
Minimum salience to consider (filters noise) |
EnableSalienceBoost |
true |
Boost salience for near-duplicates |
BoostPerNearDuplicate |
0.15 |
Base boost per near-duplicate (+15%) |
MaxSalienceBoost |
1.0 |
Maximum salience cap |
BoostDecayMode |
Logarithmic |
Linear or Logarithmic decay |
LogBase |
2.0 |
Base for logarithmic decay |
| Parameter | Default | Description |
|---|---|---|
Enabled |
true |
Enable/disable retrieval deduplication |
SimilarityThreshold |
0.90 |
Cosine similarity threshold |
MinRelevanceScore |
0.25 |
Minimum RRF score to include |
| Parameter | Default | Description |
|---|---|---|
EnableLogging |
true |
Log deduplication operations |
EnableMetrics |
true |
Collect metrics for monitoring |
HighIngestionDedupThreshold |
0.50 |
Warn if >50% deduplicated at ingestion |
HighRetrievalDedupThreshold |
0.30 |
Warn if >30% deduplicated at retrieval |
HighSalienceBoostThreshold |
0.60 |
Warn if boost exceeds 60% |
Linear Mode (simple, predictable):
boost = boostPerNearDuplicate × count
Example: 3 near-dupes × 0.15 = +45% boost
Logarithmic Mode (diminishing returns, default):
boost = boostPerNearDuplicate × log₂(1 + count)
Example: 3 near-dupes → 0.15 × log₂(4) = +30% boost
The logarithmic mode is recommended because:
| Phase | Complexity | Typical Size | Impact |
|---|---|---|---|
| Ingestion | O(n²) | 50-500 segments | < 100ms |
| Retrieval | O(m²) | 20-100 segments | < 10ms |
For very large documents (10,000+ segments):
Current implementation is optimized for typical document sizes. LSH adds complexity without benefit for most use cases.
| Approach | Threshold | Source |
|---|---|---|
| lucidRAG | 0.90 | This implementation |
| NVIDIA NeMo Curator | 0.90-0.92 | SemDeDup docs |
| MinHash LSH (standard) | 0.80 Jaccard | Google C4, GPT-3 paper |
| SemHash | 0.90-0.95 | GitHub |
Our threshold of 0.90 aligns with industry best practices for semantic deduplication.
| Content Type | Reason |
|---|---|
| Cross-document at ingestion | Preserves per-source resolution and attribution |
| Low-similarity content (<0.90) | Considered semantically distinct |
| Different segment types | Heading vs paragraph have different structural roles |
| Cross-language paraphrases | Preserves language diversity |
| Feature | Status | Location |
|---|---|---|
| Configurable thresholds | ✅ Implemented | DeduplicationConfig class |
| Boost decay (log scale) | ✅ Implemented | BoostDecayMode.Logarithmic |
| Dedup analytics/metrics | ✅ Implemented | DeduplicationResult<T> record |
| DI service integration | ✅ Implemented | IDeduplicationService |
This deduplication strategy:
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.