﻿# Deduplication of Graphitic RAG Evidence Segments in ***lucid*RAG**

<datetime class="hidden">2026-01-17T14:00</datetime>
<!-- category -- C#, lucidRAG, Vector Search, Machine Learning, Knowledge Graphs -->

> **NOTE:** This is not a conventional blog article. It is a **design spec** written for a concrete feature in ***lucid*RAG**.
> I iteratively feed this document to code-focused LLMs during development to reason about trade-offs, validate assumptions, and converge on a rational implementation.

This document describes a subsystem from ***lucid*RAG**, a project I’m actively developing.

One core requirement of ***lucid*RAG** is the ability to extract **segments of evidence** - sentences, paragraphs, headings, captions, frames, or structured blocks — and ensure those segments are **deduplicated** without destroying useful signal.

***lucid*RAG** works by analysing and extracting the *best* available evidence from documents, images, audio, and structured data. Unlike most RAG implementations, it does **not** store LLM-generated summaries as the primary artefact. In many cases, ingestion requires no LLM at all (though one can be used when escalation is justified).

Instead, ***lucid*RAG** applies a wide range of deterministic and probabilistic techniques to:

* extract candidate segments
* evaluate their informational value
* and retain only the strongest representatives

**Deduplication is a critical part of this process.**

Simple string equality is not enough. The same concept is frequently expressed using different wording, structure, or modality. Treating those as distinct leads to redundant storage and poor downstream behaviour.

**The problem compounds at retrieval time.**

When results are retrieved (via SQL, vector embeddings, BM25, or hybrids), feeding an LLM multiple segments that all express the same underlying idea produces dull, repetitive answers. Five near-identical chunks from different documents do not add clarity — they dilute it.

To address this, ***lucid*RAG** treats deduplication as a **first-class compilation problem**, not a post-hoc filter.

The remainder of this document describes how that deduplication strategy was designed.


[See more about ***lucid*RAG** here. ](https://www.lucidrag.com)

# Deduplication Strategy

***lucid*RAG** uses a two-phase deduplication strategy to eliminate redundant content while preserving important signals. This is a **signal-preserving filter**, not content normalization.

## Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           INGESTION (Per Document)                          │
│                                                                             │
│  Document → Extract → Embed → DEDUPE (intra-doc) → Index to Vector Store   │
│                                   │                                         │
│                                   ├─ Near-duplicates: boost salience        │
│                                   └─ Exact duplicates: drop (no boost)      │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RETRIEVAL (Cross Document)                         │
│                                                                             │
│  Query → Search → Rank (RRF) → DEDUPE (cross-doc) → Top K → LLM Synthesis  │
│                                   │                                         │
│                                   └─ Keep segment with highest RRF score    │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Design Guarantees

These invariants are maintained by the deduplication system:

| Guarantee | Description |
|-----------|-------------|
| **Ordering preserved** | Deduplication never changes semantic ordering after RRF ranking |
| **No concept loss** | Deduplication never removes all instances of a concept |
| **Document boundary respected** | Deduplication never crosses document boundaries at ingestion |
| **Content immutable** | Deduplication never alters embeddings or text content, only selection and salience scores |
| **Deterministic** | Given identical inputs and config, deduplication produces identical outputs |

---

## Non-Goals

This system explicitly does **not** attempt to:

- **Detect factual contradiction** — Two segments saying opposite things are not deduplicated
- **Canonicalize truth** — We don't pick a "correct" version across sources
- **Collapse paraphrases across documents at ingestion** — Each document keeps its own segments
- **Normalize terminology** — "ML" and "machine learning" in different docs are preserved separately
- **Replace entity resolution** — That's GraphRAG's job, operating at a different level

---

## Determinism & Reproducibility

Deduplication is fully deterministic:

- **Embeddings are immutable** once computed at extraction time
- **Sorting is stable** — segments with equal salience maintain original order
- **No randomness** — no sampling, no approximate ANN, no probabilistic thresholds
- **No external state** — dedup decisions depend only on the current segment set

**Why this matters:** Users debugging retrieval results can trust that re-running with the same inputs produces the same outputs. This aligns with ***lucid*RAG**'s broader "constrained fuzziness" philosophy — fuzzy matching with deterministic behavior.

---

## Why Two Phases?

### Phase 1: Ingestion Deduplication

**Goal:** Reduce storage and prevent intra-document redundancy.

**Problem:** Documents often contain repeated content:
- Boilerplate text (headers, footers, disclaimers)
- Copy-pasted sections
- The same concept explained multiple ways

**Solution:** Deduplicate within each document before indexing.

**Key Insight:** Near-duplicates are treated as *independent evidence of importance*, not redundancy. If an author explains a concept three different ways, that concept matters. We capture this signal by boosting salience.

### Phase 2: Retrieval Deduplication

**Goal:** Prevent the LLM from receiving redundant information across documents.

**Problem:** When querying across multiple documents, similar paragraphs may appear in different sources. The LLM shouldn't describe the same information multiple times.

**Solution:** After ranking by RRF (which combines semantic similarity, keyword match, salience, and freshness), deduplicate across documents keeping the highest-scoring segment.

**Why after RRF?** The RRF score represents the best holistic measure of relevance. Deduplicating before ranking would lose this signal.

### Why Salience Boost Only at Ingestion

The separation is intentional:

| Phase | What it captures |
|-------|------------------|
| **Ingestion boost** | Author emphasis — how much the document stresses a concept |
| **Retrieval score** | Query relevance — how well content matches user intent |

Mixing these at retrieval would entangle document intent with user intent. A concept repeated 5 times in a document is important *to that document*, but may not be relevant *to this query*. By boosting at ingestion, we preserve the author's signal without biasing query results.

---

## Phase 1: Ingestion Deduplication

### Location
`src/Mostlylucid.DocSummarizer.Core/Services/BertRagSummarizer.cs`

### Method
`DeduplicateSegments()`

### Algorithm

```
1. Filter segments by minimum salience threshold (0.05)
2. Sort by salience score (highest first)
3. For each segment:
   a. If no embedding → keep (can't compare)
   b. Check cosine similarity against all selected segments
   c. If similarity >= 0.90:
      - If same ContentHash → exact duplicate, drop silently
      - If different ContentHash → near-duplicate, boost kept segment's salience
   d. If no match → add to selected list
4. Apply salience boosts: +15% per near-duplicate merged
5. Cap salience at 1.0 to prevent any single concept from dominating
```

### Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `similarityThreshold` | 0.90 | Cosine similarity above which segments are considered duplicates |
| `salienceThreshold` | 0.05 | Minimum salience to consider (filters noise) |
| `boostPerNearDuplicate` | 0.15 | Salience boost per near-duplicate merged (+15%) |

### Examples

**Exact Duplicate (No Boost)**
```
Segment A: "Contact us at support@example.com" [hash: abc123]
Segment B: "Contact us at support@example.com" [hash: abc123]  ← same hash
Result: Keep A, drop B, no boost (likely boilerplate)
```

**Near-Duplicate (Boost Applied)**
```
Segment A: "Machine learning models require training data" [hash: abc123]
Segment B: "ML systems need data for training" [hash: def456]  ← different hash, 0.92 similarity
Segment C: "Training data is essential for ML" [hash: ghi789]  ← different hash, 0.91 similarity
Result: Keep A with +30% salience boost (concept emphasized 3 ways)
```

### Rationale for Thresholds

- **0.90 similarity:** Based on research (NVIDIA NeMo uses 0.90-0.92, industry standard for semantic dedup)
- **0.05 salience:** Filters very low-value segments while keeping most content
- **15% boost:** Meaningful signal without over-weighting repeated concepts

---

## Phase 2: Retrieval Deduplication

### Location
`src/LucidRAG.Core/Services/AgenticSearchService.cs`

### Method
`DeduplicateByEmbeddingPostRanking()`

### Algorithm

```
1. Receive ranked results (already sorted by RRF or dense score)
2. For each segment (in score order):
   a. If no embedding → keep
   b. Check cosine similarity against all selected segments
   c. If similarity >= 0.90 → skip (higher-scored duplicate already selected)
   d. If no match → add to selected list
3. Return deduplicated list (maintains score ordering)
```

### Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `similarityThreshold` | 0.90 | Cosine similarity threshold for cross-doc dedup |

### Why Post-RRF?

RRF (Reciprocal Rank Fusion) combines four signals:
1. **Dense score:** Semantic similarity to query
2. **BM25 score:** Lexical/keyword match
3. **Salience score:** Importance within document
4. **Freshness score:** Recency boost

Deduplicating AFTER RRF means we keep the segment that best matches the query across all dimensions, not just semantic similarity.

### Example

```
Query: "How do I configure authentication?"

Results before dedup:
1. [Doc A] "Authentication is configured via config.yaml..." (RRF: 0.052)
2. [Doc B] "Configure auth using the config.yaml file..." (RRF: 0.048, similarity to #1: 0.93)
3. [Doc A] "Set the API key in environment variables..." (RRF: 0.041)

Results after dedup:
1. [Doc A] "Authentication is configured via config.yaml..." (RRF: 0.052)
2. [Doc A] "Set the API key in environment variables..." (RRF: 0.041)

Doc B's similar paragraph dropped - Doc A's version had higher RRF score.
```

---

## Failure Modes & Trade-offs

### Known Limitations

| Failure Mode | Description | Mitigation |
|--------------|-------------|------------|
| **False positive (over-dedup)** | Two distinct but closely related concepts may exceed 0.90 similarity | Accepted trade-off favoring reduced redundancy; threshold is tunable |
| **False negative (under-dedup)** | Very short segments may embed poorly, missing semantic similarity | Hash-based exact duplicate detection catches identical text |
| **Embedding drift** | Changing embedding models invalidates dedup assumptions | Requires full re-ingestion; embeddings are immutable once stored |
| **Order sensitivity** | Greedy selection means first high-salience segment wins | Mitigated by stable sorting; highest-salience segment is kept |

### Accepted Trade-offs

- **Precision over recall:** We prefer occasionally keeping near-duplicates over accidentally removing distinct content
- **Storage over accuracy:** We store per-document rather than global dedup to preserve source attribution
- **Simplicity over optimization:** O(n²) is acceptable for typical document sizes; LSH adds complexity

---

## Multilingual Considerations

Deduplication behavior with multilingual content:

| Scenario | Behavior | Rationale |
|----------|----------|-----------|
| **Same language** | Normal dedup applies | Embeddings capture semantic similarity |
| **Cross-language paraphrases** | NOT deduplicated | Preserves source-language diversity |
| **Mixed-language document** | Dedup within language clusters | Embedding similarity naturally separates languages |

**Design choice:** Cross-lingual dedup is explicitly not supported. This preserves the ability to retrieve the same fact in the user's preferred language or compare how different sources phrase things.

Future enhancement: Cross-lingual dedup could be layered via translation-invariant embeddings if needed.

---

## Security & Adversarial Considerations

The deduplication system includes implicit protections:

| Attack Vector | Protection |
|---------------|------------|
| **Salience inflation via repetition** | Boost capped at 1.0; exact duplicates don't boost |
| **Copy-paste spam across documents** | Cross-doc dedup at retrieval removes redundant results |
| **Score manipulation via duplicate injection** | Dedup occurs AFTER ranking, preventing score inflation |
| **Boilerplate flooding** | Exact hash match detection drops without boost |

**Note:** Deduplication is not a security boundary. Malicious content that passes ingestion filters will be indexed. Content filtering should happen upstream.

---

## Interaction with GraphRAG

Deduplication and GraphRAG are intentionally orthogonal:

| System | Operates on | Purpose |
|--------|-------------|---------|
| **Deduplication** | Segments (text chunks) | Remove redundant retrieval results |
| **GraphRAG** | Entities & Relations | Build knowledge graph, resolve references |

**Why separate:**
- A segment mentioning "Apple" and another mentioning "the company" may dedupe as similar text but represent the same entity — that's GraphRAG's job to resolve
- Dedup doesn't need entity awareness; it operates purely on semantic similarity
- Entity-aware dedup could be added later as an enhancement, not a replacement

---

## Observability & Metrics

### Current Logging

Ingestion:
```
[dim]Deduplication: 150 → 98 segments[/]
```

Retrieval:
```
Post-ranking deduplication: 50 → 42 segments (removed 8 cross-doc duplicates)
```

### Recommended Metrics

For production monitoring, consider tracking:

| Metric | Description | Healthy Range |
|--------|-------------|---------------|
| `dedup_ratio_ingestion` | % segments removed at ingestion | 10-40% |
| `dedup_ratio_retrieval` | % segments removed at retrieval | 5-20% |
| `avg_salience_boost` | Average boost applied per document | 0.05-0.20 |
| `max_salience_boost` | Highest boost in a document | < 0.60 (else one concept dominates) |
| `dedup_by_doc_type` | Dedup rate segmented by document type | Varies |

### Debugging Tips

- **High ingestion dedup (>50%):** Document may have excessive boilerplate or be auto-generated
- **Low ingestion dedup (<5%):** Document has diverse content (good) or embeddings are poor (investigate)
- **High retrieval dedup (>30%):** Query may be too broad, or corpus has many similar documents
- **Salience approaching 1.0:** Concept was heavily emphasized; verify it's legitimate, not spam

---

## Configuration

Deduplication is configured via `DocSummarizerConfig.Deduplication` section in `appsettings.json`:

```json
{
  "DocSummarizer": {
    "Deduplication": {
      "Ingestion": {
        "Enabled": true,
        "SimilarityThreshold": 0.90,
        "SalienceThreshold": 0.05,
        "EnableSalienceBoost": true,
        "BoostPerNearDuplicate": 0.15,
        "MaxSalienceBoost": 1.0,
        "BoostDecayMode": "Logarithmic",
        "LogBase": 2.0
      },
      "Retrieval": {
        "Enabled": true,
        "SimilarityThreshold": 0.90,
        "MinRelevanceScore": 0.25
      },
      "Analytics": {
        "EnableLogging": true,
        "EnableMetrics": true,
        "HighIngestionDedupThreshold": 0.50,
        "HighRetrievalDedupThreshold": 0.30,
        "HighSalienceBoostThreshold": 0.60
      }
    }
  }
}
```

### Ingestion Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `Enabled` | `true` | Enable/disable ingestion deduplication |
| `SimilarityThreshold` | `0.90` | Cosine similarity threshold for duplicate detection |
| `SalienceThreshold` | `0.05` | Minimum salience to consider (filters noise) |
| `EnableSalienceBoost` | `true` | Boost salience for near-duplicates |
| `BoostPerNearDuplicate` | `0.15` | Base boost per near-duplicate (+15%) |
| `MaxSalienceBoost` | `1.0` | Maximum salience cap |
| `BoostDecayMode` | `Logarithmic` | `Linear` or `Logarithmic` decay |
| `LogBase` | `2.0` | Base for logarithmic decay |

### Retrieval Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `Enabled` | `true` | Enable/disable retrieval deduplication |
| `SimilarityThreshold` | `0.90` | Cosine similarity threshold |
| `MinRelevanceScore` | `0.25` | Minimum RRF score to include |

### Analytics Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `EnableLogging` | `true` | Log deduplication operations |
| `EnableMetrics` | `true` | Collect metrics for monitoring |
| `HighIngestionDedupThreshold` | `0.50` | Warn if >50% deduplicated at ingestion |
| `HighRetrievalDedupThreshold` | `0.30` | Warn if >30% deduplicated at retrieval |
| `HighSalienceBoostThreshold` | `0.60` | Warn if boost exceeds 60% |

### Boost Decay Modes

**Linear Mode** (simple, predictable):
```
boost = boostPerNearDuplicate × count
```
Example: 3 near-dupes × 0.15 = +45% boost

**Logarithmic Mode** (diminishing returns, default):
```
boost = boostPerNearDuplicate × log₂(1 + count)
```
Example: 3 near-dupes → 0.15 × log₂(4) = +30% boost

The logarithmic mode is recommended because:
- First few duplicates have the strongest signal (author emphasis)
- Many duplicates may indicate boilerplate, not importance
- Prevents runaway salience inflation

---

## Performance Considerations

### Complexity

| Phase | Complexity | Typical Size | Impact |
|-------|------------|--------------|--------|
| Ingestion | O(n²) | 50-500 segments | < 100ms |
| Retrieval | O(m²) | 20-100 segments | < 10ms |

### Scaling Path

For very large documents (10,000+ segments):

1. **LSH (Locality-Sensitive Hashing):** O(n) approximate dedup
2. **Batch comparison:** Process in chunks to reduce memory
3. **Early filtering:** More aggressive salience threshold

Current implementation is optimized for typical document sizes. LSH adds complexity without benefit for most use cases.

---

## Comparison with Research

| Approach | Threshold | Source |
|----------|-----------|--------|
| lucidRAG | 0.90 | This implementation |
| NVIDIA NeMo Curator | 0.90-0.92 | [SemDeDup docs](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) |
| MinHash LSH (standard) | 0.80 Jaccard | [Google C4, GPT-3 paper](https://huggingface.co/blog/dedup) |
| SemHash | 0.90-0.95 | [GitHub](https://github.com/MinishLab/semhash) |

Our threshold of 0.90 aligns with industry best practices for semantic deduplication.

---

## What Is NOT Deduplicated

| Content Type | Reason |
|--------------|--------|
| Cross-document at ingestion | Preserves per-source resolution and attribution |
| Low-similarity content (<0.90) | Considered semantically distinct |
| Different segment types | Heading vs paragraph have different structural roles |
| Cross-language paraphrases | Preserves language diversity |

---

## Implementation Status

| Feature | Status | Location |
|---------|--------|----------|
| Configurable thresholds | ✅ Implemented | `DeduplicationConfig` class |
| Boost decay (log scale) | ✅ Implemented | `BoostDecayMode.Logarithmic` |
| Dedup analytics/metrics | ✅ Implemented | `DeduplicationResult<T>` record |
| DI service integration | ✅ Implemented | `IDeduplicationService` |

## Future Enhancements

1. **Dedup analytics dashboard:** Visual tracking of rates per document type
2. **Cross-lingual option:** Translation-invariant embeddings for multilingual dedup
3. **Entity-informed dedup:** Use GraphRAG entities as additional signal (not replacement)
4. **Prometheus/OpenTelemetry:** Export metrics for monitoring dashboards

---

## Summary

This deduplication strategy:

- **Preserves signal** — Near-duplicates boost importance rather than being discarded
- **Respects boundaries** — Documents maintain independent segment sets
- **Ranks then filters** — Uses full RRF signal before cross-doc dedup
- **Fails safely** — Prefers keeping content over aggressive removal
- **Stays deterministic** — Same inputs always produce same outputs
