In Part 1, we explored why GraphRAG matters. Now let's build a minimum viable GraphRAG that works without per-chunk LLM calls - pragmatic, offline-first, and cheap enough to run on a laptop:
Series Navigation:
Code: Mostlylucid.GraphRag on GitHub
flowchart LR
subgraph Indexing
MD[Markdown Files] --> CH[Chunker]
CH --> EMB[BERT Embeddings]
CH --> EXT[Entity Extractor]
EXT --> |heuristics + links| ENT[Entities]
ENT --> REL[Relationships]
REL --> COM[Communities]
end
subgraph Storage
EMB --> DB[(DuckDB)]
ENT --> DB
REL --> DB
COM --> DB
end
subgraph Query
Q[Query] --> CLASS{Classify}
CLASS --> |local| HS[Hybrid Search]
CLASS --> |global| CS[Community Search]
CLASS --> |drift| BOTH[Both + Synthesis]
HS --> LLM[Ollama]
CS --> LLM
BOTH --> LLM
end
DB --> HS
DB --> CS
style MD stroke:#22c55e,stroke-width:2px
style DB stroke:#3b82f6,stroke-width:2px
style LLM stroke:#a855f7,stroke-width:2px
Microsoft's GraphRAG uses separate storage for vectors (LanceDB), entities (Parquet), and relationships (more Parquet). DuckDB simplifies this:
.duckdb file for everythingDuckDB is not a graph database - and that's the point. This isn't Neo4j. Traversals are shallow, SQL-based, and deliberate. That constraint keeps the system debuggable and cheap.
The schema uses join tables for provenance - we can query "which chunks mention entity X?" directly:
erDiagram
documents ||--o{ chunks : contains
chunks ||--o{ entity_mentions : has
chunks ||--o{ relationship_mentions : has
entities ||--o{ entity_mentions : mentioned_in
entities ||--o{ relationships : source
entities ||--o{ relationships : target
relationships ||--o{ relationship_mentions : mentioned_in
communities ||--o{ community_members : contains
entities ||--o{ community_members : belongs_to
chunks {
varchar id PK
varchar document_id FK
text text
float[] embedding
}
entities {
varchar id PK
varchar name
varchar type
int mention_count
}
entity_mentions {
varchar entity_id FK
varchar chunk_id FK
}
Key design decision: no VARCHAR[] for provenance. Join tables (entity_mentions, relationship_mentions) enable efficient queries like "get all chunks mentioning Docker".
DuckDB's HNSW index only triggers with array_cosine_distance + ORDER BY + LIMIT:
// GraphRagDb.cs - SearchChunksAsync
cmd.CommandText = $"""
SELECT id, document_id, text, chunk_index,
array_cosine_distance(embedding, $1::FLOAT[{_dim}]) as distance
FROM chunks
WHERE embedding IS NOT NULL
ORDER BY distance
LIMIT $2
""";
// Convert distance to similarity: 1.0f - distance
Using array_cosine_similarity will not use the index — it won’t trigger the HNSW index. On non-trivial corpora, this turns a ~5ms indexed query into a full table scan.
This is where we diverge from Microsoft's approach. Instead of the LLM-per-chunk extraction passes used in Microsoft's reference GraphRAG pipeline, we use IDF-based statistical extraction. The goal isn't perfect entities - it's stable, corpus-relative signals that don't require an LLM to produce. This trades some recall for determinism, auditability, and predictable cost — a deliberate choice for technical corpora:
flowchart TB
subgraph "Phase 1: Signal Collection"
TEXT[All Chunks] --> IDF[Compute IDF Scores]
TEXT --> STRUCT[Structural Signals]
STRUCT --> HEAD[Headings]
STRUCT --> CODE[Inline Code]
STRUCT --> LINKS[Links]
IDF --> RARE[High-IDF = Rare Terms]
RARE --> CAND[Candidates]
HEAD --> CAND
CODE --> CAND
LINKS --> |explicit rels| LINKREL[Link Relationships]
end
subgraph "Phase 2: Dedup"
CAND --> EMBED[BERT Embeddings]
EMBED --> SIM[Similarity > 0.85]
SIM --> MERGE[Merge Duplicates]
end
subgraph "Phase 3: Classify"
MERGE --> LLM{LLM Available?}
LLM --> |yes| BATCH[Single Batch Call]
LLM --> |no| HEUR[Heuristic Types]
end
style IDF stroke:#f59e0b,stroke-width:2px
style BATCH stroke:#a855f7,stroke-width:2px
The naive approach is a hardcoded HashSet<string> KnownTech = { "Docker", "Kubernetes", ... }. This breaks for:
IDF (Inverse Document Frequency) solves this statistically. A term's IDF is:
\(\text{IDF}(t) = \log\frac{N}{df(t)}\)
Where:
High IDF = rare term = likely an entity. "Docker" appearing in 5 of 100 chunks has higher IDF than "the" appearing in 100 of 100.
For more on TF-IDF and BM25, see my post on hybrid search with BM25.
Markdown structure tells us what's important:
## Docker Setup) → entity`docker-compose`) → entity[Docker](https://docker.com)) → entity + relationship// EntityExtractor.cs - structural signal extraction
private void ExtractStructuralEntities(string chunk, string chunkId)
{
// Headings: ## Docker Compose Setup → "Docker Compose Setup"
foreach (Match m in Regex.Matches(chunk, @"^#{1,3}\s+(.+)$", RegexOptions.Multiline))
{
var heading = m.Groups[1].Value.Trim();
AddCandidate(heading, chunkId, weight: 2.0); // Higher weight
}
// Inline code: `docker-compose` → "docker-compose"
foreach (Match m in Regex.Matches(chunk, @"`([^`]+)`"))
{
AddCandidate(m.Groups[1].Value, chunkId, weight: 1.5);
}
}
Markdown links provide explicit relationships that don't require LLM inference:
// EntityExtractor.cs - ExtractLinks
foreach (Match m in Regex.Matches(chunk, @"\[([^\]]+)\]\((/blog/[^)]+)\)"))
{
var linkText = m.Groups[1].Value; // "semantic search"
var slug = m.Groups[2].Value; // "/blog/semantic-search-with-qdrant"
yield return new Relationship(linkText, $"blog:{slug}", "references", chunkId);
}
Entity names like "Docker Compose", "docker-compose", and "DockerCompose" should be merged. We use BERT embeddings to detect semantic similarity:
// EntityExtractor.cs - DeduplicateAsync
var embeddings = await _embedder.EmbedBatchAsync(candidates.Select(c => c.Name), ct);
for (int i = 0; i < candidates.Count; i++)
{
for (int j = i + 1; j < candidates.Count; j++)
{
var similarity = CosineSimilarity(embeddings[i], embeddings[j]);
if (similarity > 0.85)
{
// Merge into canonical entity (keep higher mention count)
canonical.MentionCount += duplicate.MentionCount;
canonical.ChunkIds.UnionWith(duplicate.ChunkIds);
}
}
}
This step is O(n²) within a bounded candidate set, but candidate counts are bounded by IDF filtering and structural signals - not corpus size. For details on BERT embeddings, see semantic search with ONNX and BERT.
Hybrid search combines two complementary approaches:
Dense (BERT): Understands meaning. "Docker containers" matches "containerization". Sparse (BM25): Matches exact terms. "HNSW" only matches "HNSW".
flowchart LR
Q[Query] --> BERT[BERT Embedding]
Q --> BM25[BM25 Tokenize]
BERT --> DENSE[Dense Search<br/>HNSW Index]
BM25 --> SPARSE[Sparse Search<br/>TF-IDF Scoring]
DENSE --> RRF[RRF Fusion]
SPARSE --> RRF
RRF --> TOP[Top K Results]
TOP --> ENR[Enrich with<br/>Entities + Rels]
style RRF stroke:#f59e0b,stroke-width:2px
BM25 (Best Match 25) scores documents based on query term frequency. The formula:
\(\text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}\)
Key intuitions:
For a full BM25 implementation, see hybrid search and indexing.
RRF merges rankings from different retrieval systems. Each rank position gets a score:
\(\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}\)
Where \(k\) (typically 60) prevents overweighting the top result. Documents appearing in both rankings get boosted:
// SearchService.cs - RRF fusion
const int k = 60;
foreach (var (chunk, rank) in denseResults.Select((c, i) => (c, i)))
scores[chunk.Id] = 1.0 / (k + rank + 1);
foreach (var (chunk, rank) in sparseResults.Select((c, i) => (c, i)))
{
var rrfScore = 1.0 / (k + rank + 1);
if (scores.TryGetValue(chunk.Id, out var existing))
scores[chunk.Id] = existing + rrfScore; // Boost for appearing in both!
else
scores[chunk.Id] = rrfScore;
}
Example: A document ranked #1 in dense and #3 in sparse:
flowchart TB
Q[Query] --> CLASS[Classify Query]
CLASS --> |"How do I use X?"| LOCAL[Local Search]
CLASS --> |"What are the themes?"| GLOBAL[Global Search]
CLASS --> |"How does X relate to Y?"| DRIFT[DRIFT Search]
LOCAL --> HS[Hybrid Search] --> CTX1[Chunk + Entity Context]
GLOBAL --> CS[Community Summaries] --> MAP[Map-Reduce]
DRIFT --> BOTH[Local + Communities] --> SYN[Synthesize]
CTX1 --> LLM[LLM Answer]
MAP --> LLM
SYN --> LLM
style LOCAL stroke:#22c55e,stroke-width:2px
style GLOBAL stroke:#3b82f6,stroke-width:2px
style DRIFT stroke:#a855f7,stroke-width:2px
// QueryEngine.cs
private static QueryMode ClassifyQuery(string query)
{
var q = query.ToLowerInvariant();
if (q.Contains("main theme") || q.Contains("summarize") || q.Contains("overview"))
return QueryMode.Global;
if (q.Contains("relate") || q.Contains("connect") || q.Contains("compare"))
return QueryMode.Drift;
return QueryMode.Local;
}
This classifier is deliberately simple - and easy to replace with a small intent model later. If no entities match, the system degrades cleanly to pure hybrid retrieval.
dotnet run --project Mostlylucid.GraphRag -- index ./test-markdown
GraphRAG Indexer
Source: test-markdown
Database: graphrag.duckdb
Model: llama3.2:3b
Initializing...
Indexing docker-development-deep-dive.md: 0%
Indexing docker-swarm-cluster-guide.md: 40%
Indexing dockercomposedevdeps.md: 80%
Indexing complete: 100%
Classifying entities...: 0%
Extracted 168 entities, 315 relationships: 100%
Found 10 communities: 100%
Summarizing c_0_2 (12 entities): 20%
Summarizing c_0_8 (4 entities): 80%
────────────────── Indexing Complete ───────────────────
┌───────────────┬───────┐
│ Metric │ Count │
├───────────────┼───────┤
│ Documents │ 5 │
│ Chunks │ 62 │
│ Entities │ 168 │
│ Relationships │ 312 │
│ Communities │ 10 │
└───────────────┴───────┘
dotnet run --project Mostlylucid.GraphRag -- query "How do I use Docker Compose?"
──────────────────── Local Search ────────────────────
Query: How do I use Docker Compose?
╭─Answer────────────────────────────────────────────────╮
│ To run the services defined in the │
│ devdeps-docker-compose.yml file, you need to run the │
│ following command in the same directory as the file: │
│ │
│ docker compose -f .\devdeps-docker-compose.yml up -d │
│ │
│ This command will start the containers in detached │
│ mode. │
╰───────────────────────────────────────────────────────╯
Related Entities: Docker, container, services, image
Sources: 5 chunks (top score: 0.016)
dotnet run --project Mostlylucid.GraphRag -- stats
─────────────── GraphRAG Database Stats ────────────────
┌───────────────┬───────┐
│ Metric │ Count │
├───────────────┼───────┤
│ Documents │ 5 │
│ Chunks │ 62 │
│ Entities │ 168 │
│ Relationships │ 312 │
│ Communities │ 10 │
└───────────────┴───────┘
Database size: 7.76 MB
For 100 blog posts (~500 chunks):
| Operation | Microsoft GraphRAG | This Implementation |
|---|---|---|
| Entity extraction | 2 × 500 = 1,000 LLM calls | 0-1 (optional batch) |
| Relationships | Included above | Co-occurrence (free) |
| Community summaries | ~20 calls | Same |
| Total indexing | ~1,020 calls | ~20 calls |
| Cost (gpt-4o-mini) | ~$5-10 | ~$0.10 |
| Cost (Ollama) | N/A | $0 |
One to two orders of magnitude cheaper for structured technical content (markdown with headings, code blocks, and links).
Rough order-of-magnitude estimate; exact cost depends on chunk size and prompt shape.
| This Implementation | LLM-Heavy GraphRAG |
|---|---|
| Fast, cheap indexing | Higher entity quality |
| Good for tech docs | Better for general text |
| Co-occurrence relationships | Semantic relationships |
| Works offline | Requires API access |
Conceptually, this is the same pipeline as DocSummarizer: build structure first, then let an LLM narrate it.
Where this breaks down: Fiction or narrative text without structural markup. Implicit relationships with no lexical signal. Highly ambiguous entity names that require world knowledge to disambiguate.
The implementation is minimal - 11 files, ~1,500 lines:
Mostlylucid.GraphRag/
├── Storage/GraphRagDb.cs # DuckDB with HNSW + provenance
├── Services/EmbeddingService.cs # ONNX BERT wrapper
├── Services/OllamaClient.cs # LLM client
├── Extraction/EntityExtractor.cs # Heuristic + link extraction
├── Search/SearchService.cs # BM25 + BERT hybrid
├── Graph/CommunityDetector.cs # Leiden + summarization
├── Query/QueryEngine.cs # Local/Global/DRIFT
├── Indexing/MarkdownIndexer.cs # Chunking
├── GraphRagPipeline.cs # Orchestration
├── Models.cs # Shared types
└── Program.cs # CLI
Source: Mostlylucid.GraphRag/
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.