📌 Note: This article teaches the fundamentals of web content extraction with LLMs using the simplest possible approach. For production use cases (web summarization, document analysis, agent tools), see DocSummarizer
- it implements the architecture shown here but production-grade: BERT embeddings, hybrid search, Playwright for SPAs, SSRF protection, validated citations, and proper coverage tracking.
When you ask ChatGPT to "read this article and summarize it," what actually happens? If you imagine the AI opening a browser and reading like you would - that's not how it works.
LLMs don't browse the web. They reason over fragments your code selects.
This article shows you how to build this in C# with Ollama - no frameworks, just practical code you can debug.
TL;DR: Your code fetches → cleans → chunks → selects. The LLM only sees the fragments you inject. Selection is lossy by design - that's both the constraint and the architecture.
Here's the insight that changes how you build these systems:
Selection is the product decision. It's also where most failures originate.
The LLM is downstream of your selection. It cannot recover information you didn't show it. When an agent "fails to find the answer," the problem is almost never the model - it's that your selection logic chose the wrong chunks. (This is the same principle behind why I avoid frameworks like LangChain - they abstract away the selection logic you need to debug.)
This is why "agent browsing" fails silently. The agent confidently answers based on what it saw. You never know it missed the right content.
flowchart LR
URL[Full Page] --> Select[Your Selection]
Select --> LLM[LLM Sees This]
LLM --> Answer[Answer]
URL -.->|"50KB"| Select
Select -.->|"2KB"| LLM
Miss[Missed Content] -.->|"Never seen"| X[❌]
style Select stroke:#e74c3c,stroke-width:3px
style Miss stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 5 5
The model can only reason about what you gave it. Build accordingly.
Before writing anything, these are the constraints:
| Constraint | Why It Matters |
|---|---|
| No JS rendering | Static HTML only (Playwright for SPAs) |
| Bounded tokens | Hard budget per request (2-4K tokens typical) |
| Source-bounded answers | LLM must not hallucinate web knowledge |
| Deterministic selection | Same input → same chunks (debuggable) |
| Observable | Log what you selected, why, and what you discarded |
If you can't explain why a chunk was selected, you can't debug failures.
flowchart TB
URL[URL] --> Fetch[1. Fetch]
Fetch --> Clean[2. Clean]
Clean --> Chunk[3. Chunk]
Chunk --> Select[4. Select]
Select --> LLM[LLM]
LLM --> Answer[Answer]
Fetch -.->|"57KB HTML"| Clean
Clean -.->|"6KB text"| Chunk
Chunk -.->|"5 chunks"| Select
Select -.->|"2 chunks"| LLM
style Select stroke:#e74c3c,stroke-width:3px
style Clean stroke:#f39c12,stroke-width:3px
Each step reduces data. By the time the LLM sees it, you've gone from 57KB of HTML to maybe 2KB of relevant text. Every reduction is lossy. Every reduction can discard the answer. This is the same pattern I use for analysing large CSV files - the LLM reasons, your code computes and selects.
Throughout this article, we'll use one URL:
https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotnet-10/overview
And three questions of increasing specificity:
This keeps the examples grounded and shows how selection matters more as questions get specific.
# Install Ollama from https://ollama.ai
ollama pull llama3.2:3b
# NuGet packages
dotnet add package AngleSharp # HTML parsing
dotnet add package OllamaSharp # Ollama client (5.1.x)
OllamaSharp note: In version 5.x,
GenerateAsyncreturnsIAsyncEnumerable<GenerateResponseStream?>- it streams tokens as they're generated. You accumulate withawait foreach. The sample project pins 5.1.5.
Standard HTTP, but with the details that matter:
public class WebFetcher : IDisposable
{
private readonly HttpClient _http;
public WebFetcher()
{
var handler = new HttpClientHandler
{
AllowAutoRedirect = true,
MaxAutomaticRedirections = 5, // Cap redirects
AutomaticDecompression = DecompressionMethods.All
};
_http = new HttpClient(handler) { Timeout = TimeSpan.FromSeconds(30) };
_http.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
}
public async Task<string> FetchAsync(string url)
{
var response = await _http.GetAsync(url);
// Bail if not HTML
var contentType = response.Content.Headers.ContentType?.MediaType ?? "";
if (!contentType.Contains("html") && !contentType.Contains("text"))
throw new InvalidOperationException($"Not HTML: {contentType}");
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
public void Dispose() => _http.Dispose();
}
What this handles: Redirects (capped), compression, timeouts, User-Agent, content-type validation.
What it doesn't: JavaScript rendering, authentication, rate limiting, robots.txt. For production, add per-host delays and respect crawl policies.
Raw HTML is mostly noise. Clean it or you waste tokens on layout.
A typical page:
57KB HTML → 6KB useful text (90% reduction)
flowchart LR
HTML[Raw HTML] --> Remove[Remove Noise]
Remove --> Find[Find Main Content]
Find --> Extract[Extract Text + Headings]
Extract --> Normalize[Normalize]
Remove -.->|"script, style, nav, ads"| Find
Find -.->|"main, article, .content"| Extract
style Remove stroke:#e74c3c,stroke-width:3px
style Find stroke:#f39c12,stroke-width:3px
Your cleaner can absolutely delete the answer. Common mistakes:
.article-body, .post-text)Mitigations:
h1-h6)public class HtmlCleaner
{
private readonly HtmlParser _parser = new();
// Known noise - remove these entirely
private static readonly string[] NoiseElements =
{ "script", "style", "nav", "footer", "aside", "iframe", "noscript" };
// Boilerplate patterns - remove by role, not aggressive wildcards
private static readonly string[] NoiseSelectors =
{
"[role='navigation']", "[role='banner']", "[role='complementary']",
"[class*='cookie']", "[class*='newsletter']", "[aria-hidden='true']"
};
// Where to find content - order matters (most specific first)
private static readonly string[] ContentSelectors =
{ "main", "article", "[role='main']", ".content", ".post-content" };
public CleanResult Clean(string html)
{
var doc = _parser.ParseDocument(html);
// Remove noise
foreach (var tag in NoiseElements)
foreach (var el in doc.QuerySelectorAll(tag).ToList())
el.Remove();
foreach (var selector in NoiseSelectors)
foreach (var el in doc.QuerySelectorAll(selector).ToList())
el.Remove();
// Find main content
IElement? main = null;
string? matchedSelector = null;
foreach (var selector in ContentSelectors)
{
main = doc.QuerySelector(selector);
if (main != null) { matchedSelector = selector; break; }
}
// Fallback to body if main content is suspiciously short
var text = main?.TextContent ?? "";
if (text.Length < 500 && doc.Body != null)
{
main = doc.Body;
matchedSelector = "body (fallback)";
text = main.TextContent;
}
return new CleanResult
{
Text = NormalizeWhitespace(text),
MatchedSelector = matchedSelector ?? "none",
OriginalLength = html.Length
};
}
private string NormalizeWhitespace(string text)
{
text = Regex.Replace(text, @"[ \t]+", " ");
text = Regex.Replace(text, @"\n\s*\n+", "\n\n");
return text.Trim();
}
}
public record CleanResult(string Text, string MatchedSelector, int OriginalLength);
Readability extraction (scoring paragraphs, text density) is a whole rabbit hole. For now, selector-based extraction works for documentation and blogs. The sample project includes a scoring extractor if you need it.
You have 6KB of clean text. Why not send it all?
Chunking strategy matters more than you'd expect. I cover this in depth in the RAG architecture article - the same principles apply whether you're chunking web pages or documents.
This is a starting point, not production code:
public List<string> ChunkBySentence(string text, int maxTokens = 2000)
{
var chunks = new List<string>();
// WARNING: This breaks on abbreviations, decimals, URLs, code samples
var sentences = text.Split(new[] { ". ", ".\n", "! ", "? " },
StringSplitOptions.RemoveEmptyEntries);
var current = new StringBuilder();
var tokens = 0;
foreach (var sentence in sentences)
{
var sentenceTokens = EstimateTokens(sentence);
if (tokens + sentenceTokens > maxTokens && current.Length > 0)
{
chunks.Add(current.ToString().Trim());
current.Clear();
tokens = 0;
}
current.Append(sentence).Append(". ");
tokens += sentenceTokens;
}
if (current.Length > 0)
chunks.Add(current.ToString().Trim());
return chunks;
}
// Rough estimate - OK for demos, not for billing
private int EstimateTokens(string text)
=> (int)(text.Split(' ').Length * 1.3);
Why this is naive:
". " breaks on "Dr. Smith", "v1.0", URLsFor documentation, chunk by section:
public List<ContentChunk> ChunkByHeadings(string html)
{
var doc = new HtmlParser().ParseDocument(html);
var chunks = new List<ContentChunk>();
var headings = doc.QuerySelectorAll("h1, h2, h3");
foreach (var heading in headings)
{
var content = new StringBuilder();
content.AppendLine(heading.TextContent);
var sibling = heading.NextElementSibling;
while (sibling != null && !sibling.TagName.StartsWith("H"))
{
content.AppendLine(sibling.TextContent);
sibling = sibling.NextElementSibling;
}
chunks.Add(new ContentChunk
{
Heading = heading.TextContent.Trim(),
Content = content.ToString().Trim(),
HeadingLevel = int.Parse(heading.TagName[1..])
});
}
return chunks;
}
This preserves document structure and makes selection more meaningful.
This is where most failures happen. And where debugging should start.
You have 5 chunks. The user asked "What performance improvements are in .NET 10?" Only 1-2 chunks mention performance. Send those.
flowchart TB
Q["Question: What perf improvements?"] --> Score[Score Each Chunk]
subgraph Chunks
C1["Chunk 1: Overview..."]
C2["Chunk 2: Runtime perf..."]
C3["Chunk 3: Libraries..."]
C4["Chunk 4: SDK changes..."]
end
Score --> C1
Score --> C2
Score --> C3
Score --> C4
C2 -->|"score: 3"| Top[Selected]
C3 -->|"score: 1"| Top
style C2 stroke:#27ae60,stroke-width:3px
style C1 stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 5 5
style C4 stroke:#95a5a6,stroke-width:2px,stroke-dasharray: 5 5
public record ScoredChunk(string Content, string? Heading, int Score, List<string> MatchedKeywords);
public List<ScoredChunk> SelectByKeywords(
List<ContentChunk> chunks,
string question,
int topK = 3)
{
// Normalize and filter stopwords
var keywords = question.ToLower()
.Split(' ', StringSplitOptions.RemoveEmptyEntries)
.Where(w => w.Length > 3)
.Where(w => !Stopwords.Contains(w))
.Select(w => w.Trim(',', '.', '?', '!'))
.Distinct()
.ToList();
var scored = chunks.Select(chunk =>
{
var text = (chunk.Heading + " " + chunk.Content).ToLower();
var matched = keywords.Where(kw => text.Contains(kw)).ToList();
// Boost if keyword appears in heading
var headingBoost = chunk.Heading != null &&
keywords.Any(kw => chunk.Heading.ToLower().Contains(kw)) ? 2 : 0;
return new ScoredChunk(
chunk.Content,
chunk.Heading,
matched.Count + headingBoost,
matched
);
})
.OrderByDescending(x => x.Score)
.Take(topK)
.ToList();
// LOG THIS - it's your debugging lifeline
foreach (var s in scored)
Console.WriteLine($" [{s.Score}] {s.Heading ?? "(no heading)"}: {string.Join(", ", s.MatchedKeywords)}");
return scored;
}
private static readonly HashSet<string> Stopwords = new()
{ "what", "how", "does", "the", "are", "is", "in", "for", "of", "to", "and" };
Key improvements over naive counting:
Keywords fail on synonyms. "perf" won't match "performance improvements".
Embeddings find semantic similarity. If you want to go deeper on embeddings and vector search, I cover this extensively in the RAG primer series and semantic search with ONNX.
public async Task<List<ScoredChunk>> SelectByEmbedding(
List<ContentChunk> chunks,
string question,
int topK = 3)
{
var questionEmbed = await EmbedAsync(question);
// Cache these per URL in production
var scored = new List<(ContentChunk Chunk, double Score)>();
foreach (var chunk in chunks)
{
var chunkEmbed = await EmbedAsync(chunk.Content);
var similarity = CosineSimilarity(questionEmbed, chunkEmbed);
scored.Add((chunk, similarity));
}
return scored
.OrderByDescending(x => x.Score)
.Take(topK)
.Select(x => new ScoredChunk(x.Chunk.Content, x.Chunk.Heading, (int)(x.Score * 100), new()))
.ToList();
}
private async Task<double[]> EmbedAsync(string text)
{
var request = new EmbedRequest { Model = "nomic-embed-text", Input = [text] };
var response = await _ollama.EmbedAsync(request);
return response.Embeddings.First().ToArray();
}
Trade-off:
For production, cache embeddings per (URL, chunk hash) in SQLite or a vector database like Qdrant.
Structure the prompt to force source-bounded answers with citation:
public string BuildPrompt(string url, List<ScoredChunk> chunks, string question)
{
var sb = new StringBuilder();
sb.AppendLine("You are answering a question using ONLY the content below.");
sb.AppendLine("Rules:");
sb.AppendLine("- Answer ONLY from the provided sources");
sb.AppendLine("- Cite which SOURCE number supports each claim");
sb.AppendLine("- Include 1-2 brief quotes as evidence");
sb.AppendLine("- If the answer isn't in the sources, say 'Not enough information'");
sb.AppendLine("- End with Confidence: High/Medium/Low");
sb.AppendLine();
for (int i = 0; i < chunks.Count; i++)
{
sb.AppendLine($"=== SOURCE {i + 1} ===");
if (chunks[i].Heading != null)
sb.AppendLine($"Section: {chunks[i].Heading}");
sb.AppendLine($"From: {url}");
sb.AppendLine(chunks[i].Content);
sb.AppendLine();
}
sb.AppendLine($"Question: {question}");
sb.AppendLine();
sb.AppendLine("Answer (with citations and confidence):");
return sb.ToString();
}
This moves from "chatty summary" to "analysis with provenance."
public async Task<string> AskAsync(string prompt)
{
var request = new GenerateRequest { Model = "llama3.2:3b", Prompt = prompt };
var response = new StringBuilder();
await foreach (var chunk in _ollama.GenerateAsync(request))
{
if (chunk?.Response != null)
response.Append(chunk.Response);
}
return response.ToString().Trim();
}
What we've built is an agent pattern without the framework. I expand on why I prefer this approach over LangChain - explicit orchestration beats magic abstractions when debugging matters.
flowchart LR
subgraph Tools["Tools (Deterministic)"]
T1[fetch_url]
T2[clean_html]
T3[chunk_text]
T4[select_relevant]
end
subgraph LLM["LLM (Reasoning)"]
R[Interpret + Answer]
end
T1 --> T2 --> T3 --> T4 --> R
R -->|"Low confidence"| Retry[Retry with different selection]
Retry --> T4
style T4 stroke:#e74c3c,stroke-width:3px
style R stroke:#3498db,stroke-width:3px
The loop: if confidence is low, retry with more chunks or different keywords.
var answer = await AskAsync(prompt);
if (answer.Contains("Not enough information") || answer.Contains("Confidence: Low"))
{
// Retry with more chunks
var moreChunks = SelectByKeywords(allChunks, question, topK: 5);
answer = await AskAsync(BuildPrompt(url, moreChunks, question));
}
flowchart TB
subgraph Failures["Failure Modes"]
F1[Cleaner removes content]
F2[Chunking breaks mid-thought]
F3[Selection picks wrong chunks]
F4[LLM hallucinates connections]
end
F1 --> R1["'Not enough info' - answer existed"]
F2 --> R2["Partial answer - context lost"]
F3 --> R3["Wrong answer - right content skipped"]
F4 --> R4["Confident but wrong"]
style F3 stroke:#e74c3c,stroke-width:3px
The debugging rule:
If the answer is wrong, it's almost always because selection was wrong, not because the model failed.
The failure usually isn't the LLM. It's upstream.
public class WebAnalyzer : IDisposable
{
private readonly WebFetcher _fetcher = new();
private readonly HtmlCleaner _cleaner = new();
private readonly OllamaApiClient _ollama = new(new Uri("http://localhost:11434"));
public async Task<AnalysisResult> AnalyzeAsync(string url, string question)
{
// 1. Fetch
var html = await _fetcher.FetchAsync(url);
// 2. Clean (with observability)
var cleaned = _cleaner.Clean(html);
Console.WriteLine($"Cleaned: {cleaned.OriginalLength} → {cleaned.Text.Length} bytes ({cleaned.MatchedSelector})");
// 3. Chunk
var chunks = ChunkByHeadings(html);
Console.WriteLine($"Chunks: {chunks.Count}");
// 4. Select (with logging)
Console.WriteLine("Selection scores:");
var selected = SelectByKeywords(chunks, question, topK: 3);
// 5. Prompt + LLM
var prompt = BuildPrompt(url, selected, question);
var answer = await AskAsync(prompt);
return new AnalysisResult
{
Answer = answer,
ChunksUsed = selected.Count,
SelectionScores = selected.Select(s => s.Score).ToList()
};
}
public void Dispose() => _fetcher.Dispose();
}
Usage with our running example:
using var analyzer = new WebAnalyzer();
var result = await analyzer.AnalyzeAsync(
"https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotnet-10/overview",
"What performance improvements are in .NET 10?"
);
Console.WriteLine(result.Answer);
Output includes citations and confidence:
Based on SOURCE 1 and SOURCE 2:
.NET 10 includes several performance improvements:
- JIT improvements including better inlining and method devirtualization (SOURCE 1)
- "Enhanced loop inversion for better optimization" (SOURCE 1)
- NativeAOT enhancements for improved code generation (SOURCE 2)
Confidence: High
| Works Well | Doesn't Work |
|---|---|
| Documentation | JavaScript SPAs |
| Blog posts, articles | Dynamic/interactive content |
| Technical references | Multi-page research |
| Static HTML | Authenticated content |
For JS-heavy sites, you need Playwright for .NET.
flowchart LR
subgraph Your["Your Code's Job"]
direction TB
F[Fetch reliably]
C[Clean carefully]
S[Select correctly]
O[Observe everything]
end
subgraph LLM["LLM's Job"]
direction TB
R[Reason over what you gave it]
A[Admit when it doesn't know]
end
Your --> LLM
style S stroke:#e74c3c,stroke-width:3px
style R stroke:#3498db,stroke-width:3px
The LLM can only reason about what you gave it. Selection is your responsibility.
Don't ask the LLM to browse. Ask it to reason.
Complete working implementation: Mostlylucid.LlmWebFetcher
Includes:
WebFetcher - HTTP with proper handlingHtmlCleaner - Noise removal + fallback strategiesContentChunker - Sentence and heading-based chunkingWebContentAnalyzer - Full pipeline with loggingOllamaExtensions - Helper for streaming responsescd Mostlylucid.LlmWebFetcher
dotnet run
Libraries
LLMs
Related Articles
Microsoft AI Stack
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.