This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Thursday, 29 January 2026
Deep Research without Deep Pockets.
When "Deep Research" landed in the premium AI tools, I was curious: what is this, actually? So I pulled it apart. And once I understood the architecture, I realised I'd already built most of it.
In the commercial tools, “Deep Research” usually means a workflow that plans a research approach, does multi-step searching/browsing/reading across lots of sources (often web + your uploaded/connected files), then writes a longer report with citations.
A big part of the value is improving the input: pull in fresher, higher-signal sources than any model’s training data cut-off date, ground claims in evidence, and reduce the chance of the model making things up.
Examples: OpenAI Deep Research, Gemini Deep Research, and Perplexity Deep Research (also called “Research mode” in their UI/docs).
DoomSummarizer follows the same workflow, but it’s engineered so most of the “read + reduce” work is deterministic and local, and the LLM is used mainly for synthesis.
If the reduce step is strong, you don’t need a big, expensive model to get a good result.
Between lucidRAG and the DocSummarizer series, I already had most of the building blocks: multi-source retrieval, entity extraction, knowledge graphs, embedding-based ranking.
The missing piece wasn't capability. It was orchestration. How do you take all these signals and fuse them into something coherent with a tiny model?
That's DoomSummarizer. Named after doom scrolling, because it doom scrolls so you don't have to. Point it at your sources, set a vibe, get a synthesis.
But this article isn't about the tool. It's about the ideas that make it work.
This is Part 5 of the DocSummarizer series.
If you want the groundwork first:
This is also a Time Boxed Tool.
Standard RAG already does semantic retrieval: chunk documents, embed them, and fetch the top-k most similar chunks.
The pain is the last step: you still end up dumping a noisy, overlapping pile of chunks into an LLM and asking it to reconcile duplicates, weigh sources, and keep a coherent thread. That works, but it pushes you toward big context windows and expensive models.
What if you did more work before the LLM sees anything?
That's the pattern: reduce before you generate. Take raw documents and deterministically reduce them to their essential signals: embeddings, ranks, salient segments, entity profiles. By the time the LLM sees anything, the hard work is done.
flowchart TD
subgraph REDUCE["REDUCE: Distil to Signals"]
Q[Query] --> Decompose[Composite Decomposition]
Decompose --> S1[Subquery 1]
Decompose --> S2[Subquery 2]
S1 & S2 --> Fetch[Multi-Source Fetch]
Fetch --> Embed[ONNX Embed Each Item]
Embed --> Rank[6-Signal RRF Fusion]
Rank --> TextRank[TextRank Extraction]
TextRank --> NER[Entity Extraction]
NER --> Profile[Entity Profile Vectors]
Profile --> Segments[Salient Segment Extraction]
end
subgraph SYNTH["SYNTHESIZE"]
Segments --> Standard["Standard Path\n1 LLM Call"]
Segments --> LongForm["Long-Form Path\nN+3 LLM Calls"]
end
style REDUCE fill:#1a1a2e,color:#e0e0e0
style SYNTH fill:#162447,color:#e0e0e0
The reduce phase is deterministic: embeddings, ranking, entity extraction, TextRank, segment scoring. No LLM calls.
Synthesis is the only LLM step. For a standard query the reduced signals are clean enough that one call does the job. For long-form documents, a sentinel plans the structure and the main model generates sections in parallel (N+3 calls for N sections).
For more detail on this pattern, see Reduced RAG.
Before anything gets reduced, the query needs to become search actions.
A small local sentinel model (0.6B in my case, JSON mode, temperature 0.1) interprets the raw query into a structured intent, then a YAML-driven source router turns that intent into concrete fetches.
flowchart TD
Q["User Query"] --> Sentinel["Sentinel: JSON Interpretation\ncategories, intent, entities,\ntemporal hints, tone"]
Sentinel --> Router["Source Router\nYAML category → source mapping"]
Router --> S1["gnews:AI safety"]
Router --> S2["search:AI regulation"]
Router --> S3["bbc:technology"]
Router --> S4["reddit • hn"]
S1 & S2 & S3 & S4 --> Fetch["Parallel Fetch\ncircuit breaking + rate limiting"]
Fetch --> Merge["Merge + Deduplicate\n→ reduce phase"]
style Sentinel fill:#1e3a5f,color:#e0e0e0
style Router fill:#0d3b2e,color:#e0e0e0
The sentinel extracts structured fields:
Source selection happens in six phases:
"from HackerNews" → hn goes straight inresearch or deep_dive intents add arxivgnews:OpenAI)The result: ~10 source identifiers fetched in parallel with per-source circuit breaking.
Vibes modify the search terms before they hit any API. For example, --vibe doom prefixes queries with "concerning problems risks issues in...", reshaping what comes back before any ranking happens.
If you want the background on the resilience bits, see Using Polly for Retries (circuit breakers) and Backpressure in Queueing Systems (rate limiting and backpressure).
Most search systems treat "What's new in AI safety and what are the latest regulations?" as one query. It's two. Treating it as one dilutes both.
The sentinel decomposes compound queries:
{
"is_composite": true,
"subqueries": [
"What's new in AI safety?",
"What are the latest AI regulations?"
]
}
The interesting choice is what happens next. Each subquery gets its own ONNX embedding (see ONNX: Running ML Models Locally). When scoring articles, I take the max cosine similarity across all subquery embeddings, not the average.
Why max? An article that perfectly answers "AI safety" shouldn't be penalised for saying nothing about "regulations". Averaging would bury it under mediocre partial matches. Max similarity means: if an article nails any part of your question, it rises. The alternative, averaging, rewards bland articles that vaguely touch everything.
I have six ranking signals. They have completely different scales:
flowchart LR
subgraph Signals
BM25[BM25\nKeyword Match]
Fresh[Freshness\n48h Half-Life]
Auth[Authority\nHN/Reddit Score]
QSim[Query Similarity\nCosine]
Vibe[Vibe Alignment\nCosine]
Qual[Quality\nClickbait Detector]
end
subgraph RRF[RRF Fusion]
direction TB
R[Rank Each Signal\nIndependently]
F["score = Σ weight × 1/(60 + rank)"]
end
BM25 & Fresh & Auth & QSim & Vibe & Qual --> R --> F
style RRF fill:#1e3a5f,color:#e0e0e0
HN points go to thousands. Cosine similarity is -1 to 1. Freshness decays exponentially. You can't safely add these together. The scales are incompatible.
The keyword signal here is BM25 across multiple fields (BM25F). If you want the background on BM25, see DocSummarizer Part 3.
Reciprocal Rank Fusion (Cormack et al., 2009) solves this by ignoring raw scores: each signal ranks items independently, then you combine the ranks: score = Σ weight × 1/(k + rank) (with k=60).
If you want the full background on RRF, Part 3 covers it. Here the twist is weights adapt to query type. The sentinel detects intent, and the scorer reshapes itself:
This started as a joke. Now it's my favourite feature.
A "vibe" isn't prompt wording. It's a first-class ranking signal embedded into the retrieval pipeline:
flowchart LR
V["--vibe doom"] --> Expand["Query Expansion\n'concerning risks issues in...'"]
V --> Embed["Vibe Embedding\n'vulnerability breach layoffs\nrecession crisis warning...'"]
Embed --> Score["Cosine Similarity\nper article"]
Score --> RRF["RRF Signal\nweight: 0.4"]
style V fill:#8b0000,color:#e0e0e0
Three things happen when you set a vibe:
Custom vibes work identically. --vibe "contemplative philosophical"-your text becomes both the query prefix and the ranking target. The system doesn't distinguish predefined from custom. It just embeds whatever you give it.
Here's a mistake I made first: counting shared entities between documents.
Two articles both mention "California":
Shared entity count: 1. Therefore related! Obviously wrong.
The fix was to think about entities as vectors, not strings.
This is the constrained fuzziness pattern: the NER model proposes entities, but deterministic weighting decides what matters.
Each document gets a weighted entity profile (a 384-dim vector encoding which entities appear and how distinctive they are):
weight = TF × IDF × confidence × type_weight
profile = L2_normalize(Σ entity_embedding × weight)
flowchart TD
subgraph Doc1["Article: OpenAI Safety Team"]
E1["OpenAI (ORG)\nIDF: high\nweight: 2.1"]
E2["California (LOC)\nIDF: low\nweight: 0.3"]
E3["Safety Team (MISC)\nIDF: high\nweight: 1.8"]
end
subgraph Doc2["Article: Almond Farming"]
E4["California (LOC)\nIDF: low\nweight: 0.3"]
E5["Almond Farmers (MISC)\nIDF: high\nweight: 1.9"]
E6["Drought (MISC)\nIDF: medium\nweight: 1.2"]
end
Doc1 --> P1["Profile Vector\nDominated by OpenAI + Safety"]
Doc2 --> P2["Profile Vector\nDominated by Farming + Drought"]
P1 -. "low similarity" .- P2
style Doc1 fill:#1a1a2e,color:#e0e0e0
style Doc2 fill:#2d1a2e,color:#e0e0e0
IDF is the key. "California" appears in tons of articles - low IDF, weak signal. "OpenAI" is distinctive - high IDF, dominates the profile. Now similarity search works. OpenAI articles cluster together. Farming articles cluster separately. The shared "California" becomes noise.
The clever part: saturating TF. Instead of raw mention count, I use 1 + log(mentions). An entity mentioned 50 times in a long article doesn't get 50× the weight of one mentioned once. It gets ~5×. This prevents boilerplate entities from flooding the profile.
Entity profiles aren't just for understanding documents. They're a retrieval dimension. Each document's entity profile gets indexed in an HNSW graph (DuckDB's VSS extension), enabling O(log N) discovery of semantically related articles that keyword search would miss entirely.
If you want more detail on HNSW and DuckDB VSS, see GraphRAG Part 2: Minimum Viable GraphRAG.
flowchart TD
subgraph Retrieval["Three-Layer Retrieval (UNION)"]
L1["Lucene FTS\nkeyword matches"]
L2["Embedding HNSW\nsemantic similarity"]
L3["Entity Profile HNSW\nentity fingerprint match"]
L1 & L2 & L3 --> Union["Union of all candidates"]
end
subgraph Enrich["Post-RRF Enrichment"]
Top["Top 5 Ranked Items"] --> Agg["Aggregate Profile\nmean of entity profiles"]
Agg --> Graph["HNSW Search\nmin similarity: 0.30"]
Graph --> Related["+3 Related Articles\nscored below existing items"]
end
Union --> RRF["RRF Fusion"] --> Enrich
style Retrieval fill:#1a1a2e,color:#e0e0e0
style Enrich fill:#0d3b2e,color:#e0e0e0
This works at two stages:
During retrieval: when the sentinel extracts 2+ entities from the query, a query entity profile is computed using the same TF×IDF formula. This searches the HNSW index for articles with similar entity fingerprints, with a minimum similarity of 0.25. An article about "OpenAI safety concerns" surfaces even if it never mentions the exact query terms, because its entity profile (dominated by high-IDF "OpenAI" + "safety") is close to the query's.
After ranking: the top 5 ranked items have their entity profiles averaged into an aggregate vector. This aggregate searches for related articles that the keyword and embedding layers missed entirely. Found articles get scored just below the lowest ranked item (×0.9) and tagged as discovered "via entities". This catches the primary sources that news articles reference but don't quote verbatim.
Three retrieval layers fuse via union: Lucene keyword matches ∪ embedding HNSW matches ∪ entity profile HNSW matches. Union is deliberately broader than intersection. It catches items that any single signal would surface. The RRF fusion then sorts out what's actually relevant.
The fallback path (for corpora without entity profiles yet) uses SQL entity co-occurrence counting (HAVING shared_count >= 2). It works, but it's O(N²) instead of O(log N). The --backfill-entity-profiles command migrates existing corpora to the HNSW path.
This is the normal path. scroll "AI safety and regulation" hits this. After the reduce phase distils your sources into ranked, deduplicated, entity-profiled segments, synthesis is a single call:
flowchart TD
subgraph REDUCE["Reduce Phase (ALL DETERMINISTIC)"]
Items[Fetched Items] --> Embed[ONNX Embed All Items]
Embed --> RRF[6-Signal RRF Fusion]
RRF --> TR[TextRank Compression]
TR --> Seg[Segment Extraction + Salience Scoring]
Seg --> Dedup[Deduplication + Relevance Floor]
end
subgraph SYNTH["Synthesis (1 LLM CALL)"]
Dedup --> Rerank["Semantic Re-Rank\nby query similarity"]
Rerank --> Budget["Smart Evidence Budgeting\nredistribute unused chars"]
Budget --> Gen["Single LLM Call\nwith curated evidence"]
Gen --> Output[Final Summary]
end
style REDUCE fill:#0d3b2e,color:#e0e0e0
style SYNTH fill:#1e3a5f,color:#e0e0e0
The reduce phase does the hard work. By the time the LLM sees anything, it receives:
[E1] Title | topic | relevance with content truncated to budgetOne LLM call. That's the payoff of reduced RAG: selection, ranking, deduplication, and budgeting all happen before the LLM sees anything.
Long-form (--template blog-article) is a different beast. You're generating a multi-section document from dozens of sources, and you need coherence across sections without expensive LLM compression calls.
The answer is constrained fuzzy context dragging. These are deterministic mechanisms that maintain cross-section coherence while allowing parallel generation.
flowchart TD
subgraph Phase1["Phase 1: Evidence Preparation (DETERMINISTIC)"]
Articles[Top 20 Articles] --> Segments[Chunk into Segments]
Segments --> Salience[Score Salience per Segment]
Salience --> EmbedSeg[Embed Each Segment]
end
subgraph Phase2["Phase 2: Document Planning (1 SENTINEL CALL)"]
EmbedSeg --> Summary[Build Evidence Summary]
Summary --> Sentinel["Sentinel: Generate Outline\nwith theme keywords per section"]
Sentinel --> EmbedThemes[Embed Section Themes]
end
subgraph Phase3["Phase 3: Evidence Assignment (DETERMINISTIC)"]
EmbedThemes --> Assign["Score: 60% theme similarity\n+ 25% salience + 15% relevance"]
Assign --> Dedup[Cross-Section Deduplication]
Dedup --> Gate["Quality Gates\n• Min salience 0.35\n• Min theme sim 0.45\n• Max 2 per source per section"]
end
subgraph Phase4["Phase 4: Section Generation (N+2 LLM CALLS)"]
Gate --> Intro["Intro (sequential)"]
Intro --> Body["N Body Sections (parallel)\nmax 3 concurrent"]
Body --> Conclusion["Conclusion (sequential)"]
end
subgraph Phase5["Phase 5: Validation (DETERMINISTIC)"]
Conclusion --> Validate[Citation Validation\nURL + Entity Grounding]
end
subgraph Phase6["Phase 6: Assembly (DETERMINISTIC)"]
Validate --> Assemble[Final Document Assembly]
end
style Phase1 fill:#0d3b2e,color:#e0e0e0
style Phase2 fill:#1e3a5f,color:#e0e0e0
style Phase3 fill:#0d3b2e,color:#e0e0e0
style Phase4 fill:#1e3a5f,color:#e0e0e0
style Phase5 fill:#0d3b2e,color:#e0e0e0
style Phase6 fill:#0d3b2e,color:#e0e0e0
Count the LLM calls: 1 (sentinel outline) + 1 (intro) + N (body sections) + 1 (conclusion) = N+3. Four of the six phases are deterministic.
The problem with parallel section generation is coherence. If sections don't know what other sections said, you get repetition and drift. The usual fix is sequential generation with full context. That's slow and wastes context window.
Instead, each section gets a constrained context that's built deterministically:
flowchart LR
subgraph Context["Per-Section Context (ALL DETERMINISTIC)"]
RS["Running Summary\n1400 char budget\nrecent 2 sections: full\nolder: heading only"]
NP["Negative Prompts\n'Do NOT discuss: X, Y, Z'\nfrom covered concepts"]
EC["Entity Continuity\nre-introduce entities\nlast seen 2+ sections ago"]
Props["Propositions\n~15 atomic facts\nper section"]
Evidence["Curated Evidence\nmax 12 segments\nquality-gated"]
end
Context --> LLM["LLM generates\nwith full awareness\nof document state"]
style Context fill:#1a1a2e,color:#e0e0e0
In parallel mode, the intro generates first (sets the baseline), body sections run concurrently (up to 3 via SemaphoreSlim to cap concurrency, each excluding intro topics but not each other), and the conclusion runs last (excluding everything).
Sequential mode gives tighter coherence (each section excludes all previous concepts cumulatively), but parallel is ~3x faster for large documents.
Each section needs evidence, but not any evidence. Every segment gets a composite score: 60% theme similarity (cosine distance to the section's theme embedding), 25% salience (how informative the segment is), and 15% article relevance (how relevant the source article is overall). This scoring is entirely deterministic. No LLM decides what's relevant.
Then the gates:
The LLM target word count adjusts based on evidence quality. Strong evidence (salience ≥ 0.6, 2+ sources, 4+ segments) → full word count. Weak evidence (salience < 0.45, sparse) → 60% of target. This prevents hallucination when the evidence is thin.
Before evidence reaches the LLM, it's atomised into propositions (inspired by the Dense-X Retrieval paper). Each segment is broken into six types of atomic fact: claims (general factual assertions), quotes (direct from source), statistics (numbers and metrics), definitions ("X is Y"), processes (step-by-step), and named entity facts (focused on a specific entity).
Each section gets ~15 propositions, deduplicated across sections using semantic similarity (not string matching). The LLM receives structured bullet points grouped by source, not raw paragraphs.
By the time the LLM writes a section, the reduce phase has given it:
A small model can do excellent work with this kind of pre-processing.
When articles are too long for the evidence pipeline, I need to compress them. But I don't want to spend an LLM call on summarisation.
TextRank (Mihalcea & Tarau, 2004) does it deterministically:
flowchart LR
Text[Article Text] --> Split[Split into Sentences]
Split --> EmbedS[Embed Each Sentence]
EmbedS --> Graph["Build Similarity Graph\n(cosine > 0.15 = edge)"]
Graph --> PR["PageRank\n(20 iterations, d=0.85)"]
PR --> Select["Select Top-K\nin Original Order"]
style Graph fill:#1e3a5f,color:#e0e0e0
Every sentence gets an embedding. Pairwise similarities above 0.15 become graph edges. PageRank finds the most central sentences, the ones most connected to everything else. Those are the sentences that best represent the document.
The key detail: selected sentences are returned in original document order. This preserves narrative flow. You get a coherent summary, not a random bag of important sentences.
No LLM call. Runs in milliseconds. Cosine similarity is SIMD-accelerated via TensorPrimitives (System.Numerics.Tensors) (AVX2, AVX-512, or ARM NEON, chosen at runtime).
Articles reference other articles. Those references often contain the best evidence - the primary source that a news article summarises.
DoomSummarizer follows links, but selectively. Each candidate link gets scored:
link_score = 0.7 × query_relevance(anchor_context) + 0.3 × segment_salience
Where segment salience combines:
Links below a relevance floor (0.15) are skipped. The system follows the journalistic inverted pyramid heuristic: important links tend to appear early in well-written articles.
Results are cached with content hashes and ETags. Second runs are fast. See Response Caching, ETags, and Conditional Requests.
Everything described so far assumes web fetching. But you can skip the web entirely.
crawl ingests a website into a local knowledge base. It does a breadth-first crawl (BFS) from a seed URL, respecting depth and page limits, with adaptive rate limiting that scales delay based on server response times:
doomsummarizer crawl https://docs.example.com --name example-docs --depth 3 --max-pages 200
Everything gets persisted: full content, ONNX embeddings, entity profiles, SQLite FTS5 keyword indexes (full-text search), sentiment and topic scores (computed via embedding anchors, no LLM). Incremental re-crawls send If-None-Match / If-Modified-Since headers. Unchanged pages return HTTP 304 and skip reprocessing. For servers without ETag support, SHA256 content hashes catch duplicates.
Once crawled, query entirely offline:
doomsummarizer scroll "how does authentication work?" --name example-docs
The --name flag routes to the local KB instead of web sources. The same three-layer retrieval fires: FTS5 full-text pre-filter, embedding HNSW search, entity profile HNSW search. They are union-fused and ranked. Same reduce pipeline, same synthesis. No network required.
The storage is lightweight: SQLite for metadata and FTS5 indexes, DuckDB with the VSS extension for HNSW vector indexes. No external services, no Docker, no API keys. A crawled knowledge base of hundreds of pages fits in a few megabytes and queries in milliseconds.
doomsummarizer scroll "AI safety and regulation" --vibe doom --debug
Composite query detected: 2 subqueries
• What's new in AI safety?
• What are the latest AI regulations?
Searching...
├─ Lucene: 18 keyword matches (regulation^3, safety^2)
├─ Embedding: 12 semantic matches (max-sim across 2 subqueries)
├─ RRF fusion: 22 candidates, 6 signals
├─ Entity HNSW: +3 related via entity profiles
├─ TextRank: compressed 4 long articles
└─ Final: 12 items
Long-form: Phase 1 - 187 segments from 12 articles
Long-form: Phase 2 - "AI Safety Landscape" - 5 sections
Long-form: Phase 3 - Evidence assigned (cross-section dedup: 8 removed)
Long-form: Phase 4 - Generating sections...
These are the insights. Take them, use them in your own systems:
Reduce before you generate: the standard path needs just 1 LLM call because the pipeline does all the heavy lifting deterministically. Reduced RAG is the pattern: embeddings, PageRank, heuristics, and quality gates produce such clean input that even small models synthesise well.
Sentinel-driven source routing: a small model interprets queries into structured intent, then YAML-driven routing selects and vibe-qualifies ~10 sources. The sentinel is cheap enough to call on every query.
Max similarity for compound queries: don't average subquery scores, take the best match. An article nailing one part of your question is better than one half-matching everything.
Combine ranks, not scores: RRF lets you fuse signals with incompatible scales. Authority, freshness, embedding similarity, vibe alignment; just rank and fuse.
Vibes as embeddings: any text can be a retrieval dimension. Embed it, score against it, weight it in RRF. No special cases needed.
Entity profiles as vectors: IDF-weighted entity embeddings, not entity string counting. Rare entities dominate the profile, common entities become noise, and HNSW search discovers related articles in O(log N).
Constrained fuzzy context dragging: deterministic context management enables parallel section generation. Running summaries, negative prompts, entity continuity, and drift detection maintain coherence without LLM compression calls.
Deterministic where possible: of the six long-form phases, four are deterministic. TextRank replaces LLM summarisation. Evidence assignment replaces LLM selection.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.