# Building a "Lawyer GPT" for Your Blog - Part 4: Building the Ingestion Pipeline ## Introduction Welcome to Part 4! We've set up our GPU ([Part 2](/blog/building-a-lawyer-gpt-for-your-blog-part2)), understand embeddings and vector databases ([Part 3](/blog/building-a-lawyer-gpt-for-your-blog-part3)), and have the overall architecture ([Part 1](/blog/building-a-lawyer-gpt-for-your-blog-part1)). Now it's time to build the ingestion pipeline - the system that processes all your blog posts and makes them searchable. > NOTE: This is part of my experiments with AI (assisted drafting) + my own editing. Same voice, same pragmatism; just faster fingers. This is where we take hundreds of [markdown](https://www.markdownguide.org/) files and transform them into semantically searchable chunks. It's not as simple as "split by paragraph" - we need intelligent chunking that preserves context, code blocks, and maintains relationships between sections. [TOC] ## What We're Building The ingestion pipeline has several stages: ```mermaid graph TB A[Markdown Files] --> B[File Watcher] B --> C[Markdown Parser] C --> D[Metadata Extractor] D --> E[Content Cleaner] E --> F[Intelligent Chunker] F --> G[Embedding Generator] G --> H[Vector Database] I[Change Detection] --> B J[Progress Tracking] --> F K[Error Handling] --> G class C,G processing class F chunking class H storage classDef processing stroke:#333,stroke-width:2px classDef chunking stroke:#333,stroke-width:4px classDef storage stroke:#333,stroke-width:2px ``` **Pipeline stages**: 1. **File Watcher** - Detects new/changed blog posts 2. **Markdown Parser** - Converts markdown to structured format 3. **Metadata Extractor** - Pulls title, categories, date, etc. 4. **Content Cleaner** - Removes unnecessary elements 5. **Intelligent Chunker** - Splits into semantically meaningful pieces 6. **Embedding Generator** - Creates vector representations 7. **Vector Database** - Stores for semantic search ## Project Structure Let's set up the project for the ingestion pipeline: ```bash mkdir Mostlylucid.BlogLLM cd Mostlylucid.BlogLLM # Create solution and projects dotnet new sln -n Mostlylucid.BlogLLM # Core library dotnet new classlib -n Mostlylucid.BlogLLM.Core dotnet sln add Mostlylucid.BlogLLM.Core # Ingestion service dotnet new worker -n Mostlylucid.BlogLLM.Ingestion dotnet sln add Mostlylucid.BlogLLM.Ingestion # Add references cd Mostlylucid.BlogLLM.Ingestion dotnet add reference ../Mostlylucid.BlogLLM.Core # Add NuGet packages dotnet add package Markdig --version 0.33.0 dotnet add package Microsoft.ML.OnnxRuntime.Gpu --version 1.16.3 dotnet add package Qdrant.Client --version 1.7.0 dotnet add package Microsoft.Extensions.Hosting dotnet add package Serilog.Extensions.Hosting dotnet add package Serilog.Sinks.Console ``` ## Markdown Parsing with MarkDig We're already using MarkDig in the blog. Let's leverage it for parsing. ### Blog Post Model ```csharp namespace Mostlylucid.BlogLLM.Core.Models { public class BlogPost { public string Slug { get; set; } = string.Empty; public string Title { get; set; } = string.Empty; public string[] Categories { get; set; } = Array.Empty(); public DateTime PublishedDate { get; set; } public string Language { get; set; } = "en"; public string MarkdownContent { get; set; } = string.Empty; public string PlainTextContent { get; set; } = string.Empty; public string HtmlContent { get; set; } = string.Empty; public int WordCount { get; set; } public string ContentHash { get; set; } = string.Empty; // For chunking public List Sections { get; set; } = new(); } public class ContentSection { public string Heading { get; set; } = string.Empty; public int Level { get; set; } // H1, H2, H3, etc. public string Content { get; set; } = string.Empty; public List CodeBlocks { get; set; } = new(); public int StartLine { get; set; } public int EndLine { get; set; } } public class CodeBlock { public string Language { get; set; } = string.Empty; public string Code { get; set; } = string.Empty; public int LineNumber { get; set; } } } ``` ### Markdown Parser Service ```csharp using Markdig; using Markdig.Syntax; using Markdig.Syntax.Inlines; using Mostlylucid.BlogLLM.Core.Models; using System.Security.Cryptography; using System.Text; using System.Text.RegularExpressions; namespace Mostlylucid.BlogLLM.Core.Services { public class MarkdownParserService { private readonly MarkdownPipeline _pipeline; public MarkdownParserService() { _pipeline = new MarkdownPipelineBuilder() .UseAdvancedExtensions() .Build(); } public BlogPost ParseMarkdownFile(string filePath) { var markdown = File.ReadAllText(filePath); var fileName = Path.GetFileNameWithoutExtension(filePath); // Parse the document var document = Markdown.Parse(markdown, _pipeline); // Extract components var post = new BlogPost { Slug = ExtractSlug(fileName), Title = ExtractTitle(document), Categories = ExtractCategories(markdown), PublishedDate = ExtractPublishedDate(markdown), Language = ExtractLanguage(fileName), MarkdownContent = markdown, PlainTextContent = ConvertToPlainText(document), HtmlContent = Markdown.ToHtml(markdown, _pipeline), ContentHash = ComputeHash(markdown), Sections = ExtractSections(document, markdown) }; post.WordCount = CountWords(post.PlainTextContent); return post; } private string ExtractSlug(string fileName) { // Handle translated files: "slug.es.md" -> "slug" var parts = fileName.Split('.'); return parts[0]; } private string ExtractTitle(MarkdownDocument document) { // Find first H1 heading var heading = document.Descendants() .FirstOrDefault(h => h.Level == 1); if (heading != null) { return heading.Inline?.FirstChild?.ToString() ?? "Untitled"; } return "Untitled"; } private string[] ExtractCategories(string markdown) { // Extract from comment: var match = Regex.Match(markdown, @"", RegexOptions.IgnoreCase); if (match.Success) { return match.Groups[1].Value .Split(',') .Select(c => c.Trim()) .Where(c => !string.IsNullOrWhiteSpace(c)) .ToArray(); } return Array.Empty(); } private DateTime ExtractPublishedDate(string markdown) { // Extract from: var match = Regex.Match(markdown, @"]*>(\d{4}-\d{2}-\d{2}T\d{2}:\d{2})", RegexOptions.IgnoreCase); if (match.Success && DateTime.TryParse(match.Groups[1].Value, out var date)) { return date; } // Fallback to file modification time return DateTime.Now; } private string ExtractLanguage(string fileName) { // Extract from "slug.es.md" -> "es" var parts = fileName.Split('.'); if (parts.Length == 3 && parts[1].Length == 2) { return parts[1]; } return "en"; } private string ConvertToPlainText(MarkdownDocument document) { var sb = new StringBuilder(); foreach (var block in document.Descendants()) { if (block is ParagraphBlock paragraph) { sb.AppendLine(ExtractTextFromInline(paragraph.Inline)); } else if (block is HeadingBlock heading) { sb.AppendLine(ExtractTextFromInline(heading.Inline)); } else if (block is ListItemBlock listItem) { sb.AppendLine(ExtractTextFromInline(listItem.LastChild as LeafBlock)); } } return sb.ToString(); } private string ExtractTextFromInline(ContainerInline? inline) { if (inline == null) return string.Empty; var sb = new StringBuilder(); foreach (var child in inline) { if (child is LiteralInline literal) { sb.Append(literal.Content.ToString()); } else if (child is CodeInline code) { sb.Append(code.Content); } } return sb.ToString(); } private string ExtractTextFromInline(LeafBlock? block) { if (block == null) return string.Empty; if (block.Inline == null) return string.Empty; return ExtractTextFromInline(block.Inline); } private List ExtractSections(MarkdownDocument document, string markdown) { var sections = new List(); ContentSection? currentSection = null; var lines = markdown.Split('\n'); foreach (var block in document) { if (block is HeadingBlock heading && heading.Level <= 3) { // Start new section if (currentSection != null) { currentSection.EndLine = heading.Line; sections.Add(currentSection); } currentSection = new ContentSection { Heading = ExtractTextFromInline(heading.Inline), Level = heading.Level, StartLine = heading.Line, Content = string.Empty }; } else if (currentSection != null) { // Add content to current section if (block is FencedCodeBlock codeBlock) { currentSection.CodeBlocks.Add(new CodeBlock { Language = codeBlock.Info ?? "text", Code = ExtractCodeBlockContent(codeBlock, lines), LineNumber = codeBlock.Line }); } else if (block is ParagraphBlock paragraph) { currentSection.Content += ExtractTextFromInline(paragraph.Inline) + "\n\n"; } } } // Add final section if (currentSection != null) { currentSection.EndLine = lines.Length; sections.Add(currentSection); } return sections; } private string ExtractCodeBlockContent(FencedCodeBlock codeBlock, string[] lines) { var startLine = codeBlock.Line + 1; // Skip opening ``` var endLine = startLine + codeBlock.Lines.Count; return string.Join("\n", lines.Skip(startLine).Take(codeBlock.Lines.Count)); } private int CountWords(string text) { return text.Split(new[] { ' ', '\n', '\r', '\t' }, StringSplitOptions.RemoveEmptyEntries).Length; } private string ComputeHash(string content) { using var sha256 = SHA256.Create(); var bytes = sha256.ComputeHash(Encoding.UTF8.GetBytes(content)); return Convert.ToBase64String(bytes); } } } ``` **Key features**: - Extracts metadata from markdown comments and HTML tags - Preserves code blocks with language info - Splits content into sections by headings - Computes hash for change detection - Handles translated files (e.g., `slug.es.md`) ### Markdown Fetch Preprocessing Before parsing, we use the **Markdown Fetch Extension** to process any `` tags in the markdown. This allows blog posts to include remote content dynamically! For example, if your markdown contains: ```markdown ``` The preprocessor will: 1. Fetch the remote markdown from the URL 2. Cache it in memory (for the duration of ingestion) 3. Replace the tag with the actual content 4. Then parse everything together **Updated Parser with Fetch Support**: ```csharp using Mostlylucid.Markdig.FetchExtension.Processors; using Microsoft.Extensions.Logging; public class MarkdownParserService { private readonly MarkdownPipeline _pipeline; private readonly IServiceProvider? _serviceProvider; private readonly ILogger? _logger; public MarkdownParserService( IServiceProvider? serviceProvider = null, ILogger? logger = null) { _serviceProvider = serviceProvider; _logger = logger; _pipeline = new MarkdownPipelineBuilder() .UseAdvancedExtensions() .Build(); } public BlogPost ParseMarkdownFile(string filePath) { var markdown = File.ReadAllText(filePath); var fileName = Path.GetFileNameWithoutExtension(filePath); // Preprocess markdown to fetch remote content if there are tags if (_serviceProvider != null) { var preprocessor = new MarkdownFetchPreprocessor(_serviceProvider, _logger); markdown = preprocessor.Preprocess(markdown); _logger?.LogInformation("Preprocessed {File} for fetch tags", filePath); } // Now parse the (potentially expanded) markdown var document = Markdown.Parse(markdown, _pipeline); // ... rest of parsing logic } } ``` **Simple Fetch Service for Ingestion**: Since the ingestion tool doesn't need database persistence, we use a simple in-memory implementation: ```csharp using Mostlylucid.Markdig.FetchExtension.Services; using Mostlylucid.Markdig.FetchExtension.Models; public class SimpleMarkdownFetchService : IMarkdownFetchService { private readonly IHttpClientFactory _httpClientFactory; private readonly ConcurrentDictionary _cache = new(); public SimpleMarkdownFetchService(IHttpClientFactory httpClientFactory) { _httpClientFactory = httpClientFactory; } public async Task FetchMarkdownAsync( string url, int pollFrequencyHours, int blogPostId) { // Check cache first if (_cache.TryGetValue(url, out var cached)) { var age = DateTimeOffset.UtcNow - cached.fetchedAt; if (age.TotalHours < pollFrequencyHours) { return new MarkdownFetchResult { Success = true, Content = cached.content }; } } // Fetch fresh content var client = _httpClientFactory.CreateClient(); client.Timeout = TimeSpan.FromSeconds(30); var response = await client.GetAsync(url); if (!response.IsSuccessStatusCode) { // Return cached content if available if (_cache.TryGetValue(url, out var staleCached)) { return new MarkdownFetchResult { Success = true, Content = staleCached.content }; } return new MarkdownFetchResult { Success = false, ErrorMessage = $"HTTP {(int)response.StatusCode}: {response.ReasonPhrase}" }; } var content = await response.Content.ReadAsStringAsync(); _cache[url] = (content, DateTimeOffset.UtcNow); return new MarkdownFetchResult { Success = true, Content = content }; } public Task RemoveCachedMarkdownAsync(string url, int blogPostId = 0) { var removed = _cache.TryRemove(url, out _); return Task.FromResult(removed); } } ``` **Benefits**: - Blog posts can reference external API documentation, shared content, or other markdown files - Content is fetched once per ingestion run and cached - Falls back to cached content if remote fetch fails - Embeddings include both local and remote content ## Intelligent Chunking Strategy This is critical! Bad chunking = bad search results. ### Chunking Principles ```mermaid graph TD A[Content Section] --> B{Size Check} B -->|< 512 tokens| C[Keep as Single Chunk] B -->|> 512 tokens| D[Smart Split] D --> E{Has Sub-sections?} E -->|Yes| F[Split by H3/H4] E -->|No| G[Split by Paragraph] F --> H[Chunk with Context] G --> H H --> I[Add Metadata] I --> J[Add Code Blocks] J --> K[Final Chunk] class A input class D,H process class K output classDef input stroke:#333 classDef process stroke:#333,stroke-width:2px classDef output stroke:#333,stroke-width:2px ``` **Goals**: 1. **Semantic Coherence** - Each chunk represents a complete thought 2. **Size Balance** - 200-800 tokens (optimal for search + context) 3. **Context Preservation** - Include heading hierarchy 4. **Code Block Handling** - Keep code with explanatory text 5. **Overlap** - Slight overlap between chunks for continuity ### Chunk Model ```csharp namespace Mostlylucid.BlogLLM.Core.Models { public class ContentChunk { public string ChunkId { get; set; } = Guid.NewGuid().ToString(); public string BlogPostSlug { get; set; } = string.Empty; public string BlogPostTitle { get; set; } = string.Empty; public int ChunkIndex { get; set; } // Content public string Text { get; set; } = string.Empty; public string[] Headings { get; set; } = Array.Empty(); // H1 > H2 > H3 public string SectionHeading { get; set; } = string.Empty; public List CodeBlocks { get; set; } = new(); // Metadata public string[] Categories { get; set; } = Array.Empty(); public DateTime PublishedDate { get; set; } public string Language { get; set; } = "en"; public int TokenCount { get; set; } public int StartLine { get; set; } public int EndLine { get; set; } // For vector search public float[]? Embedding { get; set; } } } ``` ### Chunking Service ```csharp using Microsoft.ML.Tokenizers; using Mostlylucid.BlogLLM.Core.Models; namespace Mostlylucid.BlogLLM.Core.Services { public class ChunkingService { private readonly Tokenizer _tokenizer; private const int MaxChunkTokens = 512; private const int MinChunkTokens = 100; private const int OverlapTokens = 50; public ChunkingService(string tokenizerPath) { _tokenizer = Tokenizer.CreateTokenizer(tokenizerPath); } public List ChunkBlogPost(BlogPost post) { var chunks = new List(); int chunkIndex = 0; // Build heading hierarchy var headingStack = new Stack(); headingStack.Push(post.Title); foreach (var section in post.Sections) { // Update heading stack UpdateHeadingStack(headingStack, section.Level, section.Heading); // Get section content + code blocks var sectionText = BuildSectionText(section); var tokenCount = CountTokens(sectionText); if (tokenCount <= MaxChunkTokens) { // Section fits in one chunk chunks.Add(CreateChunk(post, section, sectionText, headingStack, chunkIndex++)); } else { // Need to split section var subChunks = SplitSection(post, section, headingStack, ref chunkIndex); chunks.AddRange(subChunks); } } return chunks; } private void UpdateHeadingStack(Stack stack, int level, string heading) { // Remove headings at same or lower level while (stack.Count > level) { stack.Pop(); } stack.Push(heading); } private string BuildSectionText(ContentSection section) { var sb = new StringBuilder(); // Add section heading sb.AppendLine($"## {section.Heading}"); sb.AppendLine(); // Add content sb.AppendLine(section.Content); // Add code blocks with context foreach (var code in section.CodeBlocks) { sb.AppendLine($"```{code.Language}"); sb.AppendLine(code.Code); sb.AppendLine("```"); sb.AppendLine(); } return sb.ToString(); } private ContentChunk CreateChunk( BlogPost post, ContentSection section, string text, Stack headingStack, int chunkIndex) { return new ContentChunk { BlogPostSlug = post.Slug, BlogPostTitle = post.Title, ChunkIndex = chunkIndex, Text = text.Trim(), Headings = headingStack.Reverse().ToArray(), SectionHeading = section.Heading, CodeBlocks = section.CodeBlocks, Categories = post.Categories, PublishedDate = post.PublishedDate, Language = post.Language, TokenCount = CountTokens(text), StartLine = section.StartLine, EndLine = section.EndLine }; } private List SplitSection( BlogPost post, ContentSection section, Stack headingStack, ref int chunkIndex) { var chunks = new List(); // Split by paragraphs var paragraphs = section.Content.Split("\n\n", StringSplitOptions.RemoveEmptyEntries); var currentText = new StringBuilder(); var currentTokens = 0; var previousText = string.Empty; // For overlap foreach (var paragraph in paragraphs) { var paragraphTokens = CountTokens(paragraph); if (currentTokens + paragraphTokens > MaxChunkTokens && currentTokens > MinChunkTokens) { // Create chunk var chunkText = $"## {section.Heading}\n\n" + currentText.ToString(); chunks.Add(new ContentChunk { BlogPostSlug = post.Slug, BlogPostTitle = post.Title, ChunkIndex = chunkIndex++, Text = chunkText.Trim(), Headings = headingStack.Reverse().ToArray(), SectionHeading = section.Heading, Categories = post.Categories, PublishedDate = post.PublishedDate, Language = post.Language, TokenCount = currentTokens, StartLine = section.StartLine, EndLine = section.EndLine }); // Start new chunk with overlap previousText = GetLastSentences(currentText.ToString(), OverlapTokens); currentText.Clear(); currentText.AppendLine(previousText); currentTokens = CountTokens(previousText); } currentText.AppendLine(paragraph); currentText.AppendLine(); currentTokens += paragraphTokens; } // Add remaining content if (currentTokens > 0) { var chunkText = $"## {section.Heading}\n\n" + currentText.ToString(); chunks.Add(new ContentChunk { BlogPostSlug = post.Slug, BlogPostTitle = post.Title, ChunkIndex = chunkIndex++, Text = chunkText.Trim(), Headings = headingStack.Reverse().ToArray(), SectionHeading = section.Heading, CodeBlocks = section.CodeBlocks, // Add code blocks to last chunk Categories = post.Categories, PublishedDate = post.PublishedDate, Language = post.Language, TokenCount = currentTokens, StartLine = section.StartLine, EndLine = section.EndLine }); } return chunks; } private string GetLastSentences(string text, int maxTokens) { // Get last few sentences for overlap var sentences = text.Split('.').Reverse().ToArray(); var overlap = new StringBuilder(); int tokens = 0; foreach (var sentence in sentences) { var sentenceTokens = CountTokens(sentence); if (tokens + sentenceTokens > maxTokens) break; overlap.Insert(0, sentence + "."); tokens += sentenceTokens; } return overlap.ToString().Trim(); } private int CountTokens(string text) { var encoding = _tokenizer.Encode(text); return encoding.Ids.Count; } } } ``` **Chunking strategy**: 1. **Section-based** - Use document structure (H2, H3 headings) 2. **Token-aware** - Never exceed max token limit 3. **Overlap** - Last few sentences repeat in next chunk 4. **Code preservation** - Code blocks kept with context 5. **Metadata-rich** - Each chunk knows its origin ## Batch Embedding Generation Now let's generate embeddings for all chunks efficiently. ### Embedding Service with Batching ```csharp using Microsoft.ML.OnnxRuntime; using Microsoft.ML.OnnxRuntime.Tensors; using Microsoft.ML.Tokenizers; using Mostlylucid.BlogLLM.Core.Models; namespace Mostlylucid.BlogLLM.Core.Services { public class BatchEmbeddingService : IDisposable { private readonly InferenceSession _session; private readonly Tokenizer _tokenizer; private readonly int _embeddingDimension; private const int MaxBatchSize = 32; public BatchEmbeddingService(string modelPath, string tokenizerPath, bool useGpu = true) { var options = new SessionOptions(); if (useGpu) { options.AppendExecutionProvider_CUDA(0); } _session = new InferenceSession(modelPath, options); _tokenizer = Tokenizer.CreateTokenizer(tokenizerPath); _embeddingDimension = 768; // bge-base dimension } public async Task> GenerateEmbeddingsAsync( List chunks, IProgress? progress = null, CancellationToken cancellationToken = default) { var totalChunks = chunks.Count; var processedChunks = 0; for (int i = 0; i < totalChunks; i += MaxBatchSize) { cancellationToken.ThrowIfCancellationRequested(); var batch = chunks.Skip(i).Take(MaxBatchSize).ToList(); var embeddings = await GenerateBatchEmbeddingsAsync(batch, cancellationToken); for (int j = 0; j < batch.Count; j++) { batch[j].Embedding = embeddings[j]; } processedChunks += batch.Count; progress?.Report(processedChunks); } return chunks; } private async Task> GenerateBatchEmbeddingsAsync( List chunks, CancellationToken cancellationToken) { // This is CPU-bound, so offload to thread pool return await Task.Run(() => GenerateBatchEmbeddings(chunks), cancellationToken); } private List GenerateBatchEmbeddings(List chunks) { var embeddings = new List(); foreach (var chunk in chunks) { var embedding = GenerateEmbedding(chunk.Text); embeddings.Add(embedding); } return embeddings; } public float[] GenerateEmbedding(string text) { // Tokenize var encoding = _tokenizer.Encode(text); var inputIds = encoding.Ids.Select(id => (long)id).ToArray(); var attentionMask = Enumerable.Repeat(1L, inputIds.Length).ToArray(); // Pad/truncate to 512 const int maxLength = 512; var paddedInputIds = new long[maxLength]; var paddedAttentionMask = new long[maxLength]; int length = Math.Min(inputIds.Length, maxLength); Array.Copy(inputIds, paddedInputIds, length); Array.Fill(paddedAttentionMask, 1L, 0, length); // Create tensors var inputIdsTensor = new DenseTensor(paddedInputIds, new[] { 1, maxLength }); var attentionMaskTensor = new DenseTensor(paddedAttentionMask, new[] { 1, maxLength }); var inputs = new List { NamedOnnxValue.CreateFromTensor("input_ids", inputIdsTensor), NamedOnnxValue.CreateFromTensor("attention_mask", attentionMaskTensor) }; // Run inference using var results = _session.Run(inputs); var outputTensor = results.First().AsTensor(); // Mean pooling return MeanPooling(outputTensor, paddedAttentionMask); } private float[] MeanPooling(Tensor outputTensor, long[] attentionMask) { int seqLength = outputTensor.Dimensions[1]; int embeddingDim = outputTensor.Dimensions[2]; var embedding = new float[embeddingDim]; int tokenCount = 0; for (int seq = 0; seq < seqLength; seq++) { if (attentionMask[seq] == 0) continue; tokenCount++; for (int dim = 0; dim < embeddingDim; dim++) { embedding[dim] += outputTensor[0, seq, dim]; } } // Average and normalize for (int dim = 0; dim < embeddingDim; dim++) { embedding[dim] /= tokenCount; } return Normalize(embedding); } private float[] Normalize(float[] vector) { float magnitude = 0; foreach (var val in vector) { magnitude += val * val; } magnitude = MathF.Sqrt(magnitude); var normalized = new float[vector.Length]; for (int i = 0; i < vector.Length; i++) { normalized[i] = vector[i] / magnitude; } return normalized; } public void Dispose() { _session?.Dispose(); } } } ``` ## Qdrant Integration Now store the embeddings in Qdrant for semantic search. ### Qdrant Service ```csharp using Qdrant.Client; using Qdrant.Client.Grpc; using Mostlylucid.BlogLLM.Core.Models; namespace Mostlylucid.BlogLLM.Core.Services { public class QdrantVectorStore { private readonly QdrantClient _client; private const string CollectionName = "blog_embeddings"; public QdrantVectorStore(string host = "localhost", int port = 6334) { _client = new QdrantClient(host, port); } public async Task InitializeCollectionAsync() { var collections = await _client.ListCollectionsAsync(); if (collections.Any(c => c.Name == CollectionName)) { Console.WriteLine($"Collection '{CollectionName}' already exists"); return; } await _client.CreateCollectionAsync( collectionName: CollectionName, vectorsConfig: new VectorParams { Size = 768, // bge-base embedding dimension Distance = Distance.Cosine } ); Console.WriteLine($"Created collection '{CollectionName}'"); } public async Task UpsertChunksAsync( List chunks, IProgress? progress = null) { const int batchSize = 100; var totalChunks = chunks.Count; var processedChunks = 0; for (int i = 0; i < totalChunks; i += batchSize) { var batch = chunks.Skip(i).Take(batchSize).ToList(); await UpsertBatchAsync(batch); processedChunks += batch.Count; progress?.Report(processedChunks); } } private async Task UpsertBatchAsync(List chunks) { var points = chunks.Select((chunk, index) => new PointStruct { Id = GeneratePointId(chunk), Vectors = chunk.Embedding!, Payload = { ["chunk_id"] = chunk.ChunkId, ["blog_post_slug"] = chunk.BlogPostSlug, ["blog_post_title"] = chunk.BlogPostTitle, ["chunk_index"] = chunk.ChunkIndex, ["text"] = chunk.Text, ["section_heading"] = chunk.SectionHeading, ["headings"] = string.Join(" > ", chunk.Headings), ["categories"] = string.Join(", ", chunk.Categories), ["published_date"] = chunk.PublishedDate.ToString("yyyy-MM-dd"), ["language"] = chunk.Language, ["token_count"] = chunk.TokenCount, ["has_code"] = chunk.CodeBlocks.Any() } }).ToList(); await _client.UpsertAsync(CollectionName, points); } private ulong GeneratePointId(ContentChunk chunk) { // Generate deterministic ID from slug + chunk index var idString = $"{chunk.BlogPostSlug}_{chunk.ChunkIndex}"; var hash = idString.GetHashCode(); return (ulong)Math.Abs(hash); } public async Task DeletePostChunksAsync(string slug) { // Delete all chunks for a blog post await _client.DeleteAsync( collectionName: CollectionName, filter: new Filter { Must = { new Condition { Field = new FieldCondition { Key = "blog_post_slug", Match = new Match { Keyword = slug } } } } } ); } public async Task> SearchAsync( float[] queryEmbedding, int limit = 10, string? languageFilter = null, string[]? categoryFilter = null) { var filter = BuildFilter(languageFilter, categoryFilter); var searchResult = await _client.SearchAsync( collectionName: CollectionName, vector: queryEmbedding, filter: filter, limit: (ulong)limit, scoreThreshold: 0.7f ); return searchResult.Select(r => new SearchResult { ChunkId = r.Payload["chunk_id"].StringValue, BlogPostSlug = r.Payload["blog_post_slug"].StringValue, BlogPostTitle = r.Payload["blog_post_title"].StringValue, ChunkIndex = (int)r.Payload["chunk_index"].IntegerValue, Text = r.Payload["text"].StringValue, SectionHeading = r.Payload["section_heading"].StringValue, Score = r.Score }).ToList(); } private Filter? BuildFilter(string? language, string[]? categories) { var conditions = new List(); if (!string.IsNullOrEmpty(language)) { conditions.Add(new Condition { Field = new FieldCondition { Key = "language", Match = new Match { Keyword = language } } }); } if (categories?.Length > 0) { foreach (var category in categories) { conditions.Add(new Condition { Field = new FieldCondition { Key = "categories", Match = new Match { Text = category } } }); } } if (conditions.Count == 0) return null; return new Filter { Must = { conditions } }; } } public class SearchResult { public string ChunkId { get; set; } = string.Empty; public string BlogPostSlug { get; set; } = string.Empty; public string BlogPostTitle { get; set; } = string.Empty; public int ChunkIndex { get; set; } public string Text { get; set; } = string.Empty; public string SectionHeading { get; set; } = string.Empty; public float Score { get; set; } } } ``` ## Ingestion Worker Service Finally, let's create a background service that watches for changes and processes files. ### Configuration ```csharp namespace Mostlylucid.BlogLLM.Ingestion { public class IngestionConfig { public string MarkdownDirectory { get; set; } = string.Empty; public string EmbeddingModelPath { get; set; } = string.Empty; public string TokenizerPath { get; set; } = string.Empty; public string QdrantHost { get; set; } = "localhost"; public int QdrantPort { get; set; } = 6334; public bool UseGpu { get; set; } = true; public bool WatchForChanges { get; set; } = true; } } ``` ### Worker Service ```csharp using Microsoft.Extensions.Hosting; using Microsoft.Extensions.Logging; using Microsoft.Extensions.Options; using Mostlylucid.BlogLLM.Core.Services; using System.Collections.Concurrent; namespace Mostlylucid.BlogLLM.Ingestion { public class IngestionWorker : BackgroundService { private readonly ILogger _logger; private readonly IngestionConfig _config; private readonly MarkdownParserService _parser; private readonly ChunkingService _chunker; private readonly BatchEmbeddingService _embedder; private readonly QdrantVectorStore _vectorStore; private FileSystemWatcher? _watcher; private readonly ConcurrentQueue _processingQueue = new(); public IngestionWorker( ILogger logger, IOptions config, MarkdownParserService parser, ChunkingService chunker, BatchEmbeddingService embedder, QdrantVectorStore vectorStore) { _logger = logger; _config = config.Value; _parser = parser; _chunker = chunker; _embedder = embedder; _vectorStore = vectorStore; } protected override async Task ExecuteAsync(CancellationToken stoppingToken) { _logger.LogInformation("Ingestion Worker starting..."); // Initialize Qdrant collection await _vectorStore.InitializeCollectionAsync(); // Process all existing files await ProcessAllFilesAsync(stoppingToken); // Set up file watcher if enabled if (_config.WatchForChanges) { SetupFileWatcher(); } // Process queue while (!stoppingToken.IsCancellationRequested) { if (_processingQueue.TryDequeue(out var filePath)) { await ProcessFileAsync(filePath, stoppingToken); } else { await Task.Delay(1000, stoppingToken); } } } private async Task ProcessAllFilesAsync(CancellationToken cancellationToken) { _logger.LogInformation("Processing all markdown files in {Directory}", _config.MarkdownDirectory); var files = Directory.GetFiles(_config.MarkdownDirectory, "*.md", SearchOption.TopDirectoryOnly); var totalFiles = files.Length; _logger.LogInformation("Found {Count} markdown files", totalFiles); for (int i = 0; i < totalFiles; i++) { if (cancellationToken.IsCancellationRequested) break; var file = files[i]; _logger.LogInformation("Processing {Index}/{Total}: {File}", i + 1, totalFiles, Path.GetFileName(file)); await ProcessFileAsync(file, cancellationToken); } _logger.LogInformation("Completed processing all files"); } private async Task ProcessFileAsync(string filePath, CancellationToken cancellationToken) { try { // Parse markdown _logger.LogInformation("Parsing {File}", Path.GetFileName(filePath)); var post = _parser.ParseMarkdownFile(filePath); // Skip non-English posts (or process separately) if (post.Language != "en") { _logger.LogInformation("Skipping non-English post: {Slug} ({Language})", post.Slug, post.Language); return; } // Check if already processed (hash-based) // TODO: Implement hash checking against database // Chunk content _logger.LogInformation("Chunking {Slug} into sections", post.Slug); var chunks = _chunker.ChunkBlogPost(post); _logger.LogInformation("Created {Count} chunks", chunks.Count); // Generate embeddings _logger.LogInformation("Generating embeddings for {Count} chunks", chunks.Count); var progress = new Progress(processed => { _logger.LogInformation("Embedded {Processed}/{Total} chunks", processed, chunks.Count); }); await _embedder.GenerateEmbeddingsAsync(chunks, progress, cancellationToken); // Upsert to Qdrant _logger.LogInformation("Upserting {Count} chunks to Qdrant", chunks.Count); var uploadProgress = new Progress(processed => { _logger.LogInformation("Uploaded {Processed}/{Total} chunks", processed, chunks.Count); }); await _vectorStore.UpsertChunksAsync(chunks, uploadProgress); _logger.LogInformation("Successfully processed {Slug}", post.Slug); } catch (Exception ex) { _logger.LogError(ex, "Error processing file {File}", filePath); } } private void SetupFileWatcher() { _watcher = new FileSystemWatcher(_config.MarkdownDirectory) { Filter = "*.md", NotifyFilter = NotifyFilters.LastWrite | NotifyFilters.FileName }; _watcher.Changed += OnFileChanged; _watcher.Created += OnFileChanged; _watcher.Deleted += OnFileDeleted; _watcher.EnableRaisingEvents = true; _logger.LogInformation("File watcher enabled for {Directory}", _config.MarkdownDirectory); } private void OnFileChanged(object sender, FileSystemEventArgs e) { _logger.LogInformation("File changed: {File}", e.Name); _processingQueue.Enqueue(e.FullPath); } private void OnFileDeleted(object sender, FileSystemEventArgs e) { _logger.LogInformation("File deleted: {File}", e.Name); // Extract slug and delete from Qdrant var slug = Path.GetFileNameWithoutExtension(e.Name).Split('.')[0]; Task.Run(async () => await _vectorStore.DeletePostChunksAsync(slug)); } public override void Dispose() { _watcher?.Dispose(); _embedder?.Dispose(); base.Dispose(); } } } ``` ### Program.cs ```csharp using Microsoft.Extensions.DependencyInjection; using Microsoft.Extensions.Hosting; using Mostlylucid.BlogLLM.Core.Services; using Mostlylucid.BlogLLM.Ingestion; using Serilog; var builder = Host.CreateApplicationBuilder(args); // Configure Serilog Log.Logger = new LoggerConfiguration() .WriteTo.Console() .WriteTo.File("logs/ingestion-.txt", rollingInterval: RollingInterval.Day) .CreateLogger(); builder.Services.AddSerilog(); // Configure settings builder.Services.Configure(builder.Configuration.GetSection("Ingestion")); // Register markdown fetch service for preprocessing builder.Services.AddHttpClient(); builder.Services.AddSingleton(); // Register services (MarkdownParserService now gets IServiceProvider injected) builder.Services.AddSingleton(); builder.Services.AddSingleton(sp => { var config = sp.GetRequiredService>().Value; return new ChunkingService(config.TokenizerPath); }); builder.Services.AddSingleton(sp => { var config = sp.GetRequiredService>().Value; return new BatchEmbeddingService( config.EmbeddingModelPath, config.TokenizerPath, config.UseGpu ); }); builder.Services.AddSingleton(sp => { var config = sp.GetRequiredService>().Value; return new QdrantVectorStore(config.QdrantHost, config.QdrantPort); }); // Register worker builder.Services.AddHostedService(); var host = builder.Build(); host.Run(); ``` ### appsettings.json ```json { "Ingestion": { "MarkdownDirectory": "C:\\path\\to\\mostlylucidweb\\Mostlylucid\\Markdown", "EmbeddingModelPath": "C:\\models\\bge-base-en-onnx\\model.onnx", "TokenizerPath": "C:\\models\\bge-base-en-onnx\\tokenizer.json", "QdrantHost": "localhost", "QdrantPort": 6334, "UseGpu": true, "WatchForChanges": true }, "Logging": { "LogLevel": { "Default": "Information", "Microsoft.Hosting.Lifetime": "Information" } } } ``` ## Running the Ingestion Pipeline ```bash # Make sure Qdrant is running docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant # Run the ingestion service cd Mostlylucid.BlogLLM.Ingestion dotnet run ``` **Output**: ``` [12:00:00 INF] Ingestion Worker starting... [12:00:01 INF] Created collection 'blog_embeddings' [12:00:01 INF] Processing all markdown files in C:\path\to\Markdown [12:00:01 INF] Found 95 markdown files [12:00:01 INF] Processing 1/95: intro.md [12:00:01 INF] Parsing intro.md [12:00:02 INF] Chunking intro into sections [12:00:02 INF] Created 5 chunks [12:00:02 INF] Generating embeddings for 5 chunks [12:00:03 INF] Embedded 5/5 chunks [12:00:03 INF] Upserting 5 chunks to Qdrant [12:00:04 INF] Uploaded 5/5 chunks [12:00:04 INF] Successfully processed intro ... ``` ## Testing Semantic Search Let's test our ingestion pipeline with a search: ```csharp using Mostlylucid.BlogLLM.Core.Services; class Program { static async Task Main(string[] args) { // Setup var embedder = new BatchEmbeddingService( "C:\\models\\bge-base-en-onnx\\model.onnx", "C:\\models\\bge-base-en-onnx\\tokenizer.json", useGpu: true ); var vectorStore = new QdrantVectorStore("localhost", 6334); // Query string query = "How do I set up Docker Compose for development dependencies?"; Console.WriteLine($"Searching for: {query}\n"); // Generate query embedding var queryEmbedding = embedder.GenerateEmbedding(query); // Search var results = await vectorStore.SearchAsync( queryEmbedding, limit: 5, languageFilter: "en" ); // Display results Console.WriteLine($"Found {results.Count} results:\n"); foreach (var result in results) { Console.WriteLine($"Score: {result.Score:F4}"); Console.WriteLine($"Post: {result.BlogPostTitle}"); Console.WriteLine($"Section: {result.SectionHeading}"); Console.WriteLine($"Text: {result.Text.Substring(0, Math.Min(150, result.Text.Length))}..."); Console.WriteLine($"URL: /blog/{result.BlogPostSlug}#{result.SectionHeading.ToLower().Replace(" ", "-")}"); Console.WriteLine(); } } } ``` **Output**: ``` Searching for: How do I set up Docker Compose for development dependencies? Found 5 results: Score: 0.8924 Post: Using Docker Compose for Development Dependencies Section: Setting Up Docker Compose Text: In this section, I'll show you how to set up Docker Compose for all your development dependencies. First, create a docker-compose.yml file in... URL: /blog/dockercomposedevdeps#setting-up-docker-compose Score: 0.8756 Post: Using Docker Compose for Development Dependencies Section: Running Services Text: Once you have your docker-compose.yml configured, you can start all services with a single command: docker-compose up -d. This will start... URL: /blog/dockercomposedevdeps#running-services Score: 0.8432 Post: Docker Compose for Production Section: Introduction Text: Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure... URL: /blog/dockercompose#introduction Score: 0.8234 Post: Adding Entity Framework for Blog Posts (Part 1) Section: Setting up the Database Text: You can set it up either as a windows service or using Docker as I presented in a previous post on Docker... URL: /blog/addingentityframeworkforblogpostspt1#setting-up-the-database Score: 0.8156 Post: Self Hosting Seq Section: Docker Setup Text: I use Docker Compose to run all my services. Here's the relevant part of my docker-compose.yml file for Seq... URL: /blog/selfhostingseq#docker-setup ``` Perfect! Our semantic search is working beautifully! ## Summary We've built a complete ingestion pipeline: 1. ✅ Markdown parsing with metadata extraction 2. ✅ Intelligent chunking that preserves context 3. ✅ Batch embedding generation with progress tracking 4. ✅ Qdrant vector storage with rich metadata 5. ✅ File watching for automatic updates 6. ✅ Semantic search that finds conceptually related content ## What's Next? In **[Part 5: The Windows Client](/blog/building-a-lawyer-gpt-for-your-blog-part5)**, we'll build the Windows client application: - Choosing between [WPF](https://docs.microsoft.com/en-us/dotnet/desktop/wpf/) and [Avalonia](https://avaloniaui.net/) - Building a split-pane editor UI - Real-time semantic search as you type - Suggestion panel showing related content - Integration with the ingestion pipeline We'll create a writing assistant that actually helps you write! ## Series Navigation - [Part 1: Introduction & Architecture](/blog/building-a-lawyer-gpt-for-your-blog-part1) - [Part 2: GPU Setup & CUDA in C#](/blog/building-a-lawyer-gpt-for-your-blog-part2) - [Part 3: Understanding Embeddings & Vector Databases](/blog/building-a-lawyer-gpt-for-your-blog-part3) - **Part 4: Building the Ingestion Pipeline** (this post) - [Part 5: The Windows Client](/blog/building-a-lawyer-gpt-for-your-blog-part5) - [Part 6: Local LLM Integration](/blog/building-a-lawyer-gpt-for-your-blog-part6) - [Part 7: Content Generation & Prompt Engineering](/blog/building-a-lawyer-gpt-for-your-blog-part7) - [Part 8: Advanced Features & Production Deployment](/blog/building-a-lawyer-gpt-for-your-blog-part8) ## Resources - [Markdig Documentation](https://github.com/xoofx/markdig) - [Qdrant .NET Client](https://github.com/qdrant/qdrant-dotnet) - [ONNX Runtime](https://onnxruntime.ai/) - [FileSystemWatcher](https://learn.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher) See you in [Part 5](/blog/building-a-lawyer-gpt-for-your-blog-part5)!