Deduplication of Graphitic RAG Evidence Segments in lucidRAG (English)

Deduplication of Graphitic RAG Evidence Segments in lucidRAG

Saturday, 17 January 2026

//

12 minute read

NOTE: This is not a conventional blog article. It is a design spec written for a concrete feature in lucidRAG. I iteratively feed this document to code-focused LLMs during development to reason about trade-offs, validate assumptions, and converge on a rational implementation.

This document describes a subsystem from lucidRAG, a project I’m actively developing.

One core requirement of lucidRAG is the ability to extract segments of evidence - sentences, paragraphs, headings, captions, frames, or structured blocks — and ensure those segments are deduplicated without destroying useful signal.

lucidRAG works by analysing and extracting the best available evidence from documents, images, audio, and structured data. Unlike most RAG implementations, it does not store LLM-generated summaries as the primary artefact. In many cases, ingestion requires no LLM at all (though one can be used when escalation is justified).

Instead, lucidRAG applies a wide range of deterministic and probabilistic techniques to:

  • extract candidate segments
  • evaluate their informational value
  • and retain only the strongest representatives

Deduplication is a critical part of this process.

Simple string equality is not enough. The same concept is frequently expressed using different wording, structure, or modality. Treating those as distinct leads to redundant storage and poor downstream behaviour.

The problem compounds at retrieval time.

When results are retrieved (via SQL, vector embeddings, BM25, or hybrids), feeding an LLM multiple segments that all express the same underlying idea produces dull, repetitive answers. Five near-identical chunks from different documents do not add clarity — they dilute it.

To address this, lucidRAG treats deduplication as a first-class compilation problem, not a post-hoc filter.

The remainder of this document describes how that deduplication strategy was designed.

See more about lucidRAG here.

Deduplication Strategy

lucidRAG uses a two-phase deduplication strategy to eliminate redundant content while preserving important signals. This is a signal-preserving filter, not content normalization.

Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           INGESTION (Per Document)                          │
│                                                                             │
│  Document → Extract → Embed → DEDUPE (intra-doc) → Index to Vector Store   │
│                                   │                                         │
│                                   ├─ Near-duplicates: boost salience        │
│                                   └─ Exact duplicates: drop (no boost)      │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RETRIEVAL (Cross Document)                         │
│                                                                             │
│  Query → Search → Rank (RRF) → DEDUPE (cross-doc) → Top K → LLM Synthesis  │
│                                   │                                         │
│                                   └─ Keep segment with highest RRF score    │
└─────────────────────────────────────────────────────────────────────────────┘

Design Guarantees

These invariants are maintained by the deduplication system:

Guarantee Description
Ordering preserved Deduplication never changes semantic ordering after RRF ranking
No concept loss Deduplication never removes all instances of a concept
Document boundary respected Deduplication never crosses document boundaries at ingestion
Content immutable Deduplication never alters embeddings or text content, only selection and salience scores
Deterministic Given identical inputs and config, deduplication produces identical outputs

Non-Goals

This system explicitly does not attempt to:

  • Detect factual contradiction — Two segments saying opposite things are not deduplicated
  • Canonicalize truth — We don't pick a "correct" version across sources
  • Collapse paraphrases across documents at ingestion — Each document keeps its own segments
  • Normalize terminology — "ML" and "machine learning" in different docs are preserved separately
  • Replace entity resolution — That's GraphRAG's job, operating at a different level

Determinism & Reproducibility

Deduplication is fully deterministic:

  • Embeddings are immutable once computed at extraction time
  • Sorting is stable — segments with equal salience maintain original order
  • No randomness — no sampling, no approximate ANN, no probabilistic thresholds
  • No external state — dedup decisions depend only on the current segment set

Why this matters: Users debugging retrieval results can trust that re-running with the same inputs produces the same outputs. This aligns with lucidRAG's broader "constrained fuzziness" philosophy — fuzzy matching with deterministic behavior.


Why Two Phases?

Phase 1: Ingestion Deduplication

Goal: Reduce storage and prevent intra-document redundancy.

Problem: Documents often contain repeated content:

  • Boilerplate text (headers, footers, disclaimers)
  • Copy-pasted sections
  • The same concept explained multiple ways

Solution: Deduplicate within each document before indexing.

Key Insight: Near-duplicates are treated as independent evidence of importance, not redundancy. If an author explains a concept three different ways, that concept matters. We capture this signal by boosting salience.

Phase 2: Retrieval Deduplication

Goal: Prevent the LLM from receiving redundant information across documents.

Problem: When querying across multiple documents, similar paragraphs may appear in different sources. The LLM shouldn't describe the same information multiple times.

Solution: After ranking by RRF (which combines semantic similarity, keyword match, salience, and freshness), deduplicate across documents keeping the highest-scoring segment.

Why after RRF? The RRF score represents the best holistic measure of relevance. Deduplicating before ranking would lose this signal.

Why Salience Boost Only at Ingestion

The separation is intentional:

Phase What it captures
Ingestion boost Author emphasis — how much the document stresses a concept
Retrieval score Query relevance — how well content matches user intent

Mixing these at retrieval would entangle document intent with user intent. A concept repeated 5 times in a document is important to that document, but may not be relevant to this query. By boosting at ingestion, we preserve the author's signal without biasing query results.


Phase 1: Ingestion Deduplication

Location

src/Mostlylucid.DocSummarizer.Core/Services/BertRagSummarizer.cs

Method

DeduplicateSegments()

Algorithm

1. Filter segments by minimum salience threshold (0.05)
2. Sort by salience score (highest first)
3. For each segment:
   a. If no embedding → keep (can't compare)
   b. Check cosine similarity against all selected segments
   c. If similarity >= 0.90:
      - If same ContentHash → exact duplicate, drop silently
      - If different ContentHash → near-duplicate, boost kept segment's salience
   d. If no match → add to selected list
4. Apply salience boosts: +15% per near-duplicate merged
5. Cap salience at 1.0 to prevent any single concept from dominating

Parameters

Parameter Default Description
similarityThreshold 0.90 Cosine similarity above which segments are considered duplicates
salienceThreshold 0.05 Minimum salience to consider (filters noise)
boostPerNearDuplicate 0.15 Salience boost per near-duplicate merged (+15%)

Examples

Exact Duplicate (No Boost)

Segment A: "Contact us at support@example.com" [hash: abc123]
Segment B: "Contact us at support@example.com" [hash: abc123]  ← same hash
Result: Keep A, drop B, no boost (likely boilerplate)

Near-Duplicate (Boost Applied)

Segment A: "Machine learning models require training data" [hash: abc123]
Segment B: "ML systems need data for training" [hash: def456]  ← different hash, 0.92 similarity
Segment C: "Training data is essential for ML" [hash: ghi789]  ← different hash, 0.91 similarity
Result: Keep A with +30% salience boost (concept emphasized 3 ways)

Rationale for Thresholds

  • 0.90 similarity: Based on research (NVIDIA NeMo uses 0.90-0.92, industry standard for semantic dedup)
  • 0.05 salience: Filters very low-value segments while keeping most content
  • 15% boost: Meaningful signal without over-weighting repeated concepts

Phase 2: Retrieval Deduplication

Location

src/LucidRAG.Core/Services/AgenticSearchService.cs

Method

DeduplicateByEmbeddingPostRanking()

Algorithm

1. Receive ranked results (already sorted by RRF or dense score)
2. For each segment (in score order):
   a. If no embedding → keep
   b. Check cosine similarity against all selected segments
   c. If similarity >= 0.90 → skip (higher-scored duplicate already selected)
   d. If no match → add to selected list
3. Return deduplicated list (maintains score ordering)

Parameters

Parameter Default Description
similarityThreshold 0.90 Cosine similarity threshold for cross-doc dedup

Why Post-RRF?

RRF (Reciprocal Rank Fusion) combines four signals:

  1. Dense score: Semantic similarity to query
  2. BM25 score: Lexical/keyword match
  3. Salience score: Importance within document
  4. Freshness score: Recency boost

Deduplicating AFTER RRF means we keep the segment that best matches the query across all dimensions, not just semantic similarity.

Example

Query: "How do I configure authentication?"

Results before dedup:
1. [Doc A] "Authentication is configured via config.yaml..." (RRF: 0.052)
2. [Doc B] "Configure auth using the config.yaml file..." (RRF: 0.048, similarity to #1: 0.93)
3. [Doc A] "Set the API key in environment variables..." (RRF: 0.041)

Results after dedup:
1. [Doc A] "Authentication is configured via config.yaml..." (RRF: 0.052)
2. [Doc A] "Set the API key in environment variables..." (RRF: 0.041)

Doc B's similar paragraph dropped - Doc A's version had higher RRF score.

Failure Modes & Trade-offs

Known Limitations

Failure Mode Description Mitigation
False positive (over-dedup) Two distinct but closely related concepts may exceed 0.90 similarity Accepted trade-off favoring reduced redundancy; threshold is tunable
False negative (under-dedup) Very short segments may embed poorly, missing semantic similarity Hash-based exact duplicate detection catches identical text
Embedding drift Changing embedding models invalidates dedup assumptions Requires full re-ingestion; embeddings are immutable once stored
Order sensitivity Greedy selection means first high-salience segment wins Mitigated by stable sorting; highest-salience segment is kept

Accepted Trade-offs

  • Precision over recall: We prefer occasionally keeping near-duplicates over accidentally removing distinct content
  • Storage over accuracy: We store per-document rather than global dedup to preserve source attribution
  • Simplicity over optimization: O(n²) is acceptable for typical document sizes; LSH adds complexity

Multilingual Considerations

Deduplication behavior with multilingual content:

Scenario Behavior Rationale
Same language Normal dedup applies Embeddings capture semantic similarity
Cross-language paraphrases NOT deduplicated Preserves source-language diversity
Mixed-language document Dedup within language clusters Embedding similarity naturally separates languages

Design choice: Cross-lingual dedup is explicitly not supported. This preserves the ability to retrieve the same fact in the user's preferred language or compare how different sources phrase things.

Future enhancement: Cross-lingual dedup could be layered via translation-invariant embeddings if needed.


Security & Adversarial Considerations

The deduplication system includes implicit protections:

Attack Vector Protection
Salience inflation via repetition Boost capped at 1.0; exact duplicates don't boost
Copy-paste spam across documents Cross-doc dedup at retrieval removes redundant results
Score manipulation via duplicate injection Dedup occurs AFTER ranking, preventing score inflation
Boilerplate flooding Exact hash match detection drops without boost

Note: Deduplication is not a security boundary. Malicious content that passes ingestion filters will be indexed. Content filtering should happen upstream.


Interaction with GraphRAG

Deduplication and GraphRAG are intentionally orthogonal:

System Operates on Purpose
Deduplication Segments (text chunks) Remove redundant retrieval results
GraphRAG Entities & Relations Build knowledge graph, resolve references

Why separate:

  • A segment mentioning "Apple" and another mentioning "the company" may dedupe as similar text but represent the same entity — that's GraphRAG's job to resolve
  • Dedup doesn't need entity awareness; it operates purely on semantic similarity
  • Entity-aware dedup could be added later as an enhancement, not a replacement

Observability & Metrics

Current Logging

Ingestion:

[dim]Deduplication: 150 → 98 segments[/]

Retrieval:

Post-ranking deduplication: 50 → 42 segments (removed 8 cross-doc duplicates)

For production monitoring, consider tracking:

Metric Description Healthy Range
dedup_ratio_ingestion % segments removed at ingestion 10-40%
dedup_ratio_retrieval % segments removed at retrieval 5-20%
avg_salience_boost Average boost applied per document 0.05-0.20
max_salience_boost Highest boost in a document < 0.60 (else one concept dominates)
dedup_by_doc_type Dedup rate segmented by document type Varies

Debugging Tips

  • High ingestion dedup (>50%): Document may have excessive boilerplate or be auto-generated
  • Low ingestion dedup (<5%): Document has diverse content (good) or embeddings are poor (investigate)
  • High retrieval dedup (>30%): Query may be too broad, or corpus has many similar documents
  • Salience approaching 1.0: Concept was heavily emphasized; verify it's legitimate, not spam

Configuration

Deduplication is configured via DocSummarizerConfig.Deduplication section in appsettings.json:

{
  "DocSummarizer": {
    "Deduplication": {
      "Ingestion": {
        "Enabled": true,
        "SimilarityThreshold": 0.90,
        "SalienceThreshold": 0.05,
        "EnableSalienceBoost": true,
        "BoostPerNearDuplicate": 0.15,
        "MaxSalienceBoost": 1.0,
        "BoostDecayMode": "Logarithmic",
        "LogBase": 2.0
      },
      "Retrieval": {
        "Enabled": true,
        "SimilarityThreshold": 0.90,
        "MinRelevanceScore": 0.25
      },
      "Analytics": {
        "EnableLogging": true,
        "EnableMetrics": true,
        "HighIngestionDedupThreshold": 0.50,
        "HighRetrievalDedupThreshold": 0.30,
        "HighSalienceBoostThreshold": 0.60
      }
    }
  }
}

Ingestion Configuration

Parameter Default Description
Enabled true Enable/disable ingestion deduplication
SimilarityThreshold 0.90 Cosine similarity threshold for duplicate detection
SalienceThreshold 0.05 Minimum salience to consider (filters noise)
EnableSalienceBoost true Boost salience for near-duplicates
BoostPerNearDuplicate 0.15 Base boost per near-duplicate (+15%)
MaxSalienceBoost 1.0 Maximum salience cap
BoostDecayMode Logarithmic Linear or Logarithmic decay
LogBase 2.0 Base for logarithmic decay

Retrieval Configuration

Parameter Default Description
Enabled true Enable/disable retrieval deduplication
SimilarityThreshold 0.90 Cosine similarity threshold
MinRelevanceScore 0.25 Minimum RRF score to include

Analytics Configuration

Parameter Default Description
EnableLogging true Log deduplication operations
EnableMetrics true Collect metrics for monitoring
HighIngestionDedupThreshold 0.50 Warn if >50% deduplicated at ingestion
HighRetrievalDedupThreshold 0.30 Warn if >30% deduplicated at retrieval
HighSalienceBoostThreshold 0.60 Warn if boost exceeds 60%

Boost Decay Modes

Linear Mode (simple, predictable):

boost = boostPerNearDuplicate × count

Example: 3 near-dupes × 0.15 = +45% boost

Logarithmic Mode (diminishing returns, default):

boost = boostPerNearDuplicate × log₂(1 + count)

Example: 3 near-dupes → 0.15 × log₂(4) = +30% boost

The logarithmic mode is recommended because:

  • First few duplicates have the strongest signal (author emphasis)
  • Many duplicates may indicate boilerplate, not importance
  • Prevents runaway salience inflation

Performance Considerations

Complexity

Phase Complexity Typical Size Impact
Ingestion O(n²) 50-500 segments < 100ms
Retrieval O(m²) 20-100 segments < 10ms

Scaling Path

For very large documents (10,000+ segments):

  1. LSH (Locality-Sensitive Hashing): O(n) approximate dedup
  2. Batch comparison: Process in chunks to reduce memory
  3. Early filtering: More aggressive salience threshold

Current implementation is optimized for typical document sizes. LSH adds complexity without benefit for most use cases.


Comparison with Research

Approach Threshold Source
lucidRAG 0.90 This implementation
NVIDIA NeMo Curator 0.90-0.92 SemDeDup docs
MinHash LSH (standard) 0.80 Jaccard Google C4, GPT-3 paper
SemHash 0.90-0.95 GitHub

Our threshold of 0.90 aligns with industry best practices for semantic deduplication.


What Is NOT Deduplicated

Content Type Reason
Cross-document at ingestion Preserves per-source resolution and attribution
Low-similarity content (<0.90) Considered semantically distinct
Different segment types Heading vs paragraph have different structural roles
Cross-language paraphrases Preserves language diversity

Implementation Status

Feature Status Location
Configurable thresholds ✅ Implemented DeduplicationConfig class
Boost decay (log scale) ✅ Implemented BoostDecayMode.Logarithmic
Dedup analytics/metrics ✅ Implemented DeduplicationResult<T> record
DI service integration ✅ Implemented IDeduplicationService

Future Enhancements

  1. Dedup analytics dashboard: Visual tracking of rates per document type
  2. Cross-lingual option: Translation-invariant embeddings for multilingual dedup
  3. Entity-informed dedup: Use GraphRAG entities as additional signal (not replacement)
  4. Prometheus/OpenTelemetry: Export metrics for monitoring dashboards

Summary

This deduplication strategy:

  • Preserves signal — Near-duplicates boost importance rather than being discarded
  • Respects boundaries — Documents maintain independent segment sets
  • Ranks then filters — Uses full RRF signal before cross-doc dedup
  • Fails safely — Prefers keeping content over aggressive removal
  • Stays deterministic — Same inputs always produce same outputs
Finding related posts...
logo

© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.