This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Wednesday, 12 November 2025
Buckle in because this is going to be a long series! If you've been following along with this blog, you'll know I'm a bit obsessed with finding interesting ways to use LLMs and AI in practical applications. Well, I've got a new project that combines my love of blogging, C#, and AI: building a writing assistant that helps me draft new blog posts using my existing content as a knowledge base.
NOTE: This is part of my experiments with AI (assisted drafting) + my own editing. Same voice, same pragmatism; just faster fingers.
Think of how modern legal practices use LLMs trained on case law to draft briefs, motions, and contracts. They don't start from scratch - the system references relevant precedents, suggests language based on successful past documents, and maintains consistency with established patterns. That's exactly what we're building here, but for blog content.
The goal is to create an AI-powered writing assistant that:
This series will cover building a complete Retrieval Augmented Generation (RAG) system in C# that runs on Windows. We'll use the latest approaches and frameworks, and I'll explain each new technology as we encounter it.
My Development Machine:
This is my specific setup, but you don't need this hardware to follow along. Here are the minimum specs for different components:
Modern CPUs now include dedicated AI accelerators:
Important: NPUs are for inference only! You don't "build" or train models on NPUs - they're designed to run pre-trained models efficiently. Models are trained on cloud GPUs (or workstations), then downloaded and deployed to NPUs for inference.
Current Status for Running Models on NPUs (as of writing):
Why not NPUs for this series?
Can you use NPUs for inference? Yes, but:
How to try NPU inference (advanced users):
# Use DirectML execution provider (supports NPU)
dotnet add package Microsoft.ML.OnnxRuntime.DirectML
# In code:
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider("DML"); // DirectML
var session = new InferenceSession("model.onnx", sessionOptions);
Future consideration: Once ONNX Runtime and DirectML mature their NPU support (likely 2024-2025), these will become viable alternatives for inference!
Bottom line: I'll show the GPU-accelerated path, but will note CPU-only alternatives throughout. You can start CPU-only and upgrade later!
The final system will have several components:
Think of it as "GitHub Copilot meets Grammarly" but trained specifically on your blog's content and style.
Modern law firms use LLMs trained on vast libraries of case law to help draft legal documents. When writing a motion, the system:
That's our model. When I start writing "Adding Entity Framework for...", the system should:
Unlike generic AI writing assistants, our system is grounded in actual past content, so it won't suggest things inconsistent with what I've already written.
Here's what we'll cover over the coming weeks:
We'll establish what we're building and why, plus cover the architectural decisions.
Getting Windows set up for GPU-accelerated AI workloads, installing CUDA, cuDNN, and testing that C# can actually see and use your GPU.
Deep dive into what embeddings actually are, how they enable semantic search, and choosing the right vector database (spoiler: we'll probably use Qdrant or pgvector).
Processing markdown files, intelligent chunking strategies (you can't just split on paragraphs!), and generating embeddings for all our content.
Choosing the right framework (WPF, Avalonia, or MAUI?), building the UI, and making it actually pleasant to use.
Running models locally using ONNX Runtime, llama.cpp bindings, or other approaches. Making full use of that A4000!
Bringing it all together - semantic search for relevant content, context window management, prompt engineering for writing assistance, and generating coherent suggestions.
Auto-linking to related posts, style consistency checking, code snippet suggestions, and making the system actually useful for daily writing.
Before we dive into architecture, let's talk about why RAG (Retrieval Augmented Generation) is the right approach here.
You might think: "Why not just fine-tune an LLM on all the blog posts?" There are several issues with that:
RAG combines the best of both worlds: the power of LLMs with the precision of search to create context-aware content generation.
The flow is:
This means:
Let me break down the key components we'll be building:
graph TB
A[Markdown Files] -->|Ingest| B[Chunking Service]
B -->|Text Chunks| C[Embedding Model]
C -->|Vectors| D[Vector Database]
E[User Writing] -->|Current Draft| F[Windows Client]
F -->|Embed Context| C
C -->|Query Vector| D
D -->|Similar Content| G[Context Builder]
G -->|Relevant Past Articles| H[Prompt Engineer]
H -->|Prompt + Context| I[Local LLM]
I -->|Generated Suggestions| J[Link Generator]
J -->|Suggestions + Citations| F
F -->|Display| K[Editor with Suggestions]
class C,I embedding
class D,K output
classDef embedding stroke:#333,stroke-width:4px
classDef output stroke:#333,stroke-width:4px
This component:
Key Challenge: Chunking strategy matters enormously. Too small and you lose context. Too large and you waste the LLM's context window. We need chunks that are semantically meaningful - a complete thought or section, not arbitrary paragraph breaks.
Embeddings are the magic that makes semantic search work. An embedding model takes text and converts it into a high-dimensional vector (array of numbers) that captures semantic meaning.
Similar concepts end up "close" in vector space, even if they use different words.
For example:
Technology Choice: We'll probably use either:
The vector database stores embeddings and enables fast similarity search. When you're writing about "Docker compose", it finds the K most semantically similar past content - not just keyword matches, but conceptually related material.
Technology Choice: We'll evaluate:
I'm leaning toward Qdrant for its simplicity and performance, or pgvector to keep everything in Postgres.
We need a nice UI for writing with AI assistance. Think split-pane editor with suggestions. Options:
WPF (Windows Presentation Foundation)
Since we're Windows-focused and I want something stable, I'm leaning toward WPF with ModernWPF UI or Avalonia for that cross-platform potential.
This is where the A4000 GPU shines. We want to run the LLM locally for:
Technology Options:
llama.cpp bindings
I'm leaning toward LLamaSharp for its maturity and ease of use with popular models.
LLMs have limited context windows (e.g., 4K, 8K, 32K tokens). We need to:
This is trickier than it sounds. We'll explore strategies like:
Every chunk needs metadata:
When the LLM generates suggestions, we automatically create markdown links to the source posts and identify reusable code patterns.
There are lots of RAG tutorials out there, but this series will be different:
Here's the tech stack I'm planning:
Different hardware setups will have different capabilities:
We'll build this incrementally:
Each part will be deployable and testable on its own. No big-bang integration nightmares.
In Part 2: GPU Setup & CUDA in C#, we'll get hands-on with the GPU setup:
This might seem basic, but getting the GPU stack right is crucial. I've wasted hours debugging issues that came down to version mismatches or missing DLLs.
Beyond just being a cool project, this approach has real applications:
The principles we'll cover apply to any domain where you have existing content and need to maintain consistency while creating new material.
We're embarking on a journey to build a production-quality RAG-based writing assistant in C# that helps draft blog content using past articles as reference material - just like lawyers use LLMs trained on case law. We'll leverage modern GPU hardware, the latest .NET features, and battle-tested AI/ML approaches.
This isn't a toy project - we're building something that could genuinely be useful for anyone who writes regularly and wants to maintain consistency, style, and quality across a large body of work.
In the next part, we'll get our hands dirty with CUDA, GPUs, and making sure our development environment is ready for the challenges ahead.
Stay tuned, and get ready to learn about embeddings, vector databases, chunking strategies, prompt engineering, and all the other delightful complexities of modern AI systems!
If you want to get a head start, here are some resources I'll be referencing throughout this series:
See you in Part 2!
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.