GraphRAG Part 2: Minimum Viable GraphRAG (No Per-Chunk LLM Calls) (English)

GraphRAG Part 2: Minimum Viable GraphRAG (No Per-Chunk LLM Calls)

Saturday, 27 December 2025

//

11 minute read

In Part 1, we explored why GraphRAG matters. Now let's build a minimum viable GraphRAG that works without per-chunk LLM calls - pragmatic, offline-first, and cheap enough to run on a laptop:

  • DuckDB for unified storage (vectors + graph in a single file)
  • IDF + structural signals for entity extraction (no LLM per chunk)
  • BM25 + BERT hybrid search via RRF fusion
  • Ollama for synthesis and optional batch classification (zero API costs)

Series Navigation:

Code: Mostlylucid.GraphRag on GitHub

Architecture Overview

flowchart LR
    subgraph Indexing
        MD[Markdown Files] --> CH[Chunker]
        CH --> EMB[BERT Embeddings]
        CH --> EXT[Entity Extractor]
        EXT --> |heuristics + links| ENT[Entities]
        ENT --> REL[Relationships]
        REL --> COM[Communities]
    end
    
    subgraph Storage
        EMB --> DB[(DuckDB)]
        ENT --> DB
        REL --> DB
        COM --> DB
    end
    
    subgraph Query
        Q[Query] --> CLASS{Classify}
        CLASS --> |local| HS[Hybrid Search]
        CLASS --> |global| CS[Community Search]
        CLASS --> |drift| BOTH[Both + Synthesis]
        HS --> LLM[Ollama]
        CS --> LLM
        BOTH --> LLM
    end
    
    DB --> HS
    DB --> CS
    
    style MD stroke:#22c55e,stroke-width:2px
    style DB stroke:#3b82f6,stroke-width:2px
    style LLM stroke:#a855f7,stroke-width:2px

Why DuckDB?

Microsoft's GraphRAG uses separate storage for vectors (LanceDB), entities (Parquet), and relationships (more Parquet). DuckDB simplifies this:

  • Single .duckdb file for everything
  • Native HNSW vector search via VSS extension
  • SQL for both vector search and graph traversal
  • Zero deployment complexity

DuckDB is not a graph database - and that's the point. This isn't Neo4j. Traversals are shallow, SQL-based, and deliberate. That constraint keeps the system debuggable and cheap.

Storage Schema

The schema uses join tables for provenance - we can query "which chunks mention entity X?" directly:

erDiagram
    documents ||--o{ chunks : contains
    chunks ||--o{ entity_mentions : has
    chunks ||--o{ relationship_mentions : has
    entities ||--o{ entity_mentions : mentioned_in
    entities ||--o{ relationships : source
    entities ||--o{ relationships : target
    relationships ||--o{ relationship_mentions : mentioned_in
    communities ||--o{ community_members : contains
    entities ||--o{ community_members : belongs_to
    
    chunks {
        varchar id PK
        varchar document_id FK
        text text
        float[] embedding
    }
    
    entities {
        varchar id PK
        varchar name
        varchar type
        int mention_count
    }
    
    entity_mentions {
        varchar entity_id FK
        varchar chunk_id FK
    }

Key design decision: no VARCHAR[] for provenance. Join tables (entity_mentions, relationship_mentions) enable efficient queries like "get all chunks mentioning Docker".

Vector Search: The HNSW Gotcha

DuckDB's HNSW index only triggers with array_cosine_distance + ORDER BY + LIMIT:

// GraphRagDb.cs - SearchChunksAsync
cmd.CommandText = $"""
    SELECT id, document_id, text, chunk_index, 
           array_cosine_distance(embedding, $1::FLOAT[{_dim}]) as distance
    FROM chunks 
    WHERE embedding IS NOT NULL
    ORDER BY distance
    LIMIT $2
    """;
// Convert distance to similarity: 1.0f - distance

Using array_cosine_similarity will not use the index — it won’t trigger the HNSW index. On non-trivial corpora, this turns a ~5ms indexed query into a full table scan.

Entity Extraction

This is where we diverge from Microsoft's approach. Instead of the LLM-per-chunk extraction passes used in Microsoft's reference GraphRAG pipeline, we use IDF-based statistical extraction. The goal isn't perfect entities - it's stable, corpus-relative signals that don't require an LLM to produce. This trades some recall for determinism, auditability, and predictable cost — a deliberate choice for technical corpora:

flowchart TB
    subgraph "Phase 1: Signal Collection"
        TEXT[All Chunks] --> IDF[Compute IDF Scores]
        TEXT --> STRUCT[Structural Signals]
        STRUCT --> HEAD[Headings]
        STRUCT --> CODE[Inline Code]
        STRUCT --> LINKS[Links]
        IDF --> RARE[High-IDF = Rare Terms]
        RARE --> CAND[Candidates]
        HEAD --> CAND
        CODE --> CAND
        LINKS --> |explicit rels| LINKREL[Link Relationships]
    end
    
    subgraph "Phase 2: Dedup"
        CAND --> EMBED[BERT Embeddings]
        EMBED --> SIM[Similarity > 0.85]
        SIM --> MERGE[Merge Duplicates]
    end
    
    subgraph "Phase 3: Classify"
        MERGE --> LLM{LLM Available?}
        LLM --> |yes| BATCH[Single Batch Call]
        LLM --> |no| HEUR[Heuristic Types]
    end
    
    style IDF stroke:#f59e0b,stroke-width:2px
    style BATCH stroke:#a855f7,stroke-width:2px

Why IDF, not Hardcoded Lists?

The naive approach is a hardcoded HashSet<string> KnownTech = { "Docker", "Kubernetes", ... }. This breaks for:

  • New technologies (you'd need to update the list)
  • Domain-specific terms (different corpus = different entities)
  • Misspellings and variations

IDF (Inverse Document Frequency) solves this statistically. A term's IDF is:

\(\text{IDF}(t) = \log\frac{N}{df(t)}\)

Where:

  • \(N\) = total chunks
  • \(df(t)\) = documents containing term \(t\)

High IDF = rare term = likely an entity. "Docker" appearing in 5 of 100 chunks has higher IDF than "the" appearing in 100 of 100.

For more on TF-IDF and BM25, see my post on hybrid search with BM25.

Structural Signals

Markdown structure tells us what's important:

  • Headings (## Docker Setup) → entity
  • Inline code (`docker-compose`) → entity
  • Links ([Docker](https://docker.com)) → entity + relationship
// EntityExtractor.cs - structural signal extraction
private void ExtractStructuralEntities(string chunk, string chunkId)
{
    // Headings: ## Docker Compose Setup → "Docker Compose Setup"
    foreach (Match m in Regex.Matches(chunk, @"^#{1,3}\s+(.+)$", RegexOptions.Multiline))
    {
        var heading = m.Groups[1].Value.Trim();
        AddCandidate(heading, chunkId, weight: 2.0); // Higher weight
    }
    
    // Inline code: `docker-compose` → "docker-compose"
    foreach (Match m in Regex.Matches(chunk, @"`([^`]+)`"))
    {
        AddCandidate(m.Groups[1].Value, chunkId, weight: 1.5);
    }
}

Markdown links provide explicit relationships that don't require LLM inference:

// EntityExtractor.cs - ExtractLinks  
foreach (Match m in Regex.Matches(chunk, @"\[([^\]]+)\]\((/blog/[^)]+)\)"))
{
    var linkText = m.Groups[1].Value;  // "semantic search"
    var slug = m.Groups[2].Value;       // "/blog/semantic-search-with-qdrant"
    yield return new Relationship(linkText, $"blog:{slug}", "references", chunkId);
}

Deduplication via BERT Embeddings

Entity names like "Docker Compose", "docker-compose", and "DockerCompose" should be merged. We use BERT embeddings to detect semantic similarity:

// EntityExtractor.cs - DeduplicateAsync
var embeddings = await _embedder.EmbedBatchAsync(candidates.Select(c => c.Name), ct);

for (int i = 0; i < candidates.Count; i++)
{
    for (int j = i + 1; j < candidates.Count; j++)
    {
        var similarity = CosineSimilarity(embeddings[i], embeddings[j]);
        if (similarity > 0.85)
        {
            // Merge into canonical entity (keep higher mention count)
            canonical.MentionCount += duplicate.MentionCount;
            canonical.ChunkIds.UnionWith(duplicate.ChunkIds);
        }
    }
}

This step is O(n²) within a bounded candidate set, but candidate counts are bounded by IDF filtering and structural signals - not corpus size. For details on BERT embeddings, see semantic search with ONNX and BERT.

Hybrid Search: BM25 + BERT

Hybrid search combines two complementary approaches:

Dense (BERT): Understands meaning. "Docker containers" matches "containerization". Sparse (BM25): Matches exact terms. "HNSW" only matches "HNSW".

flowchart LR
    Q[Query] --> BERT[BERT Embedding]
    Q --> BM25[BM25 Tokenize]
    
    BERT --> DENSE[Dense Search<br/>HNSW Index]
    BM25 --> SPARSE[Sparse Search<br/>TF-IDF Scoring]
    
    DENSE --> RRF[RRF Fusion]
    SPARSE --> RRF
    
    RRF --> TOP[Top K Results]
    TOP --> ENR[Enrich with<br/>Entities + Rels]
    
    style RRF stroke:#f59e0b,stroke-width:2px

What is BM25?

BM25 (Best Match 25) scores documents based on query term frequency. The formula:

\(\text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}\)

Key intuitions:

  • IDF term: Rare words matter more ("HNSW" > "the")
  • TF saturation: Word appearing 10x isn't 10x more relevant than 1x
  • Length normalization: Long documents don't get unfair advantage

For a full BM25 implementation, see hybrid search and indexing.

Reciprocal Rank Fusion (RRF)

RRF merges rankings from different retrieval systems. Each rank position gets a score:

\(\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}\)

Where \(k\) (typically 60) prevents overweighting the top result. Documents appearing in both rankings get boosted:

// SearchService.cs - RRF fusion
const int k = 60;

foreach (var (chunk, rank) in denseResults.Select((c, i) => (c, i)))
    scores[chunk.Id] = 1.0 / (k + rank + 1);

foreach (var (chunk, rank) in sparseResults.Select((c, i) => (c, i)))
{
    var rrfScore = 1.0 / (k + rank + 1);
    if (scores.TryGetValue(chunk.Id, out var existing))
        scores[chunk.Id] = existing + rrfScore;  // Boost for appearing in both!
    else
        scores[chunk.Id] = rrfScore;
}

Example: A document ranked #1 in dense and #3 in sparse:

  • Dense: \(1/(60+1) = 0.0164\)
  • Sparse: \(1/(60+3) = 0.0159\)
  • Combined: 0.0323 (higher than either alone)

Query Modes

flowchart TB
    Q[Query] --> CLASS[Classify Query]
    
    CLASS --> |"How do I use X?"| LOCAL[Local Search]
    CLASS --> |"What are the themes?"| GLOBAL[Global Search]  
    CLASS --> |"How does X relate to Y?"| DRIFT[DRIFT Search]
    
    LOCAL --> HS[Hybrid Search] --> CTX1[Chunk + Entity Context]
    GLOBAL --> CS[Community Summaries] --> MAP[Map-Reduce]
    DRIFT --> BOTH[Local + Communities] --> SYN[Synthesize]
    
    CTX1 --> LLM[LLM Answer]
    MAP --> LLM
    SYN --> LLM
    
    style LOCAL stroke:#22c55e,stroke-width:2px
    style GLOBAL stroke:#3b82f6,stroke-width:2px
    style DRIFT stroke:#a855f7,stroke-width:2px

Query Classification

// QueryEngine.cs
private static QueryMode ClassifyQuery(string query)
{
    var q = query.ToLowerInvariant();
    if (q.Contains("main theme") || q.Contains("summarize") || q.Contains("overview"))
        return QueryMode.Global;
    if (q.Contains("relate") || q.Contains("connect") || q.Contains("compare"))
        return QueryMode.Drift;
    return QueryMode.Local;
}

This classifier is deliberately simple - and easy to replace with a small intent model later. If no entities match, the system degrades cleanly to pure hybrid retrieval.

CLI Usage

Indexing

dotnet run --project Mostlylucid.GraphRag -- index ./test-markdown
GraphRAG Indexer
  Source: test-markdown
  Database: graphrag.duckdb
  Model: llama3.2:3b

Initializing...
Indexing docker-development-deep-dive.md: 0%
Indexing docker-swarm-cluster-guide.md: 40%
Indexing dockercomposedevdeps.md: 80%
Indexing complete: 100%
Classifying entities...: 0%
Extracted 168 entities, 315 relationships: 100%
Found 10 communities: 100%
Summarizing c_0_2 (12 entities): 20%
Summarizing c_0_8 (4 entities): 80%

────────────────── Indexing Complete ───────────────────
┌───────────────┬───────┐
│ Metric        │ Count │
├───────────────┼───────┤
│ Documents     │ 5     │
│ Chunks        │ 62    │
│ Entities      │ 168   │
│ Relationships │ 312   │
│ Communities   │ 10    │
└───────────────┴───────┘

Querying

dotnet run --project Mostlylucid.GraphRag -- query "How do I use Docker Compose?"
──────────────────── Local Search ────────────────────

Query: How do I use Docker Compose?

╭─Answer────────────────────────────────────────────────╮
│ To run the services defined in the                    │
│ devdeps-docker-compose.yml file, you need to run the  │
│ following command in the same directory as the file:  │
│                                                       │
│ docker compose -f .\devdeps-docker-compose.yml up -d  │
│                                                       │
│ This command will start the containers in detached    │
│ mode.                                                 │
╰───────────────────────────────────────────────────────╯

Related Entities: Docker, container, services, image

Sources: 5 chunks (top score: 0.016)

Stats

dotnet run --project Mostlylucid.GraphRag -- stats
─────────────── GraphRAG Database Stats ────────────────
┌───────────────┬───────┐
│ Metric        │ Count │
├───────────────┼───────┤
│ Documents     │     5 │
│ Chunks        │    62 │
│ Entities      │   168 │
│ Relationships │   312 │
│ Communities   │    10 │
└───────────────┴───────┘

Database size: 7.76 MB

Cost Comparison

For 100 blog posts (~500 chunks):

Operation Microsoft GraphRAG This Implementation
Entity extraction 2 × 500 = 1,000 LLM calls 0-1 (optional batch)
Relationships Included above Co-occurrence (free)
Community summaries ~20 calls Same
Total indexing ~1,020 calls ~20 calls
Cost (gpt-4o-mini) ~$5-10 ~$0.10
Cost (Ollama) N/A $0

One to two orders of magnitude cheaper for structured technical content (markdown with headings, code blocks, and links).

Rough order-of-magnitude estimate; exact cost depends on chunk size and prompt shape.

Tradeoffs

This Implementation LLM-Heavy GraphRAG
Fast, cheap indexing Higher entity quality
Good for tech docs Better for general text
Co-occurrence relationships Semantic relationships
Works offline Requires API access

Conceptually, this is the same pipeline as DocSummarizer: build structure first, then let an LLM narrate it.

Where this breaks down: Fiction or narrative text without structural markup. Implicit relationships with no lexical signal. Highly ambiguous entity names that require world knowledge to disambiguate.

Code

The implementation is minimal - 11 files, ~1,500 lines:

Mostlylucid.GraphRag/
├── Storage/GraphRagDb.cs        # DuckDB with HNSW + provenance
├── Services/EmbeddingService.cs # ONNX BERT wrapper
├── Services/OllamaClient.cs     # LLM client
├── Extraction/EntityExtractor.cs # Heuristic + link extraction
├── Search/SearchService.cs      # BM25 + BERT hybrid
├── Graph/CommunityDetector.cs   # Leiden + summarization
├── Query/QueryEngine.cs         # Local/Global/DRIFT
├── Indexing/MarkdownIndexer.cs  # Chunking
├── GraphRagPipeline.cs          # Orchestration
├── Models.cs                    # Shared types
└── Program.cs                   # CLI

Source: Mostlylucid.GraphRag/

External Resources

logo

© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.