Building a "Lawyer GPT" for Your Blog - Part 1: Introduction & Architecture (English)

Introduction

Buckle in because this is going to be a long series! If you've been following along with this blog, you'll know I'm a bit obsessed with finding interesting ways to use LLMs and AI in practical applications. Well, I've got a new project that combines my love of blogging, C#, and AI: building a writing assistant that helps me draft new blog posts using my existing content as a knowledge base.

NOTE: This is part of my experiments with AI (assisted drafting) + my own editing. Same voice, same pragmatism; just faster fingers.

Think of how modern legal practices use LLMs trained on case law to draft briefs, motions, and contracts. They don't start from scratch - the system references relevant precedents, suggests language based on successful past documents, and maintains consistency with established patterns. That's exactly what we're building here, but for blog content.

The goal is to create an AI-powered writing assistant that:

Helps draft new blog posts in my established style
Suggests relevant content from past articles to reference
Finds similar posts to maintain consistency
Auto-generates internal links to related articles
Acts like "GitHub Copilot for your blog"

This series will cover building a complete Retrieval Augmented Generation (RAG) system in C# that runs on Windows. We'll use the latest approaches and frameworks, and I'll explain each new technology as we encounter it.

My Hardware Setup (and Minimum Specs)

My Development Machine:

GPU: NVIDIA RTX A4000 (16GB VRAM)
CPU: AMD Ryzen 9 9950X
RAM: 96GB DDR5

This is my specific setup, but you don't need this hardware to follow along. Here are the minimum specs for different components:

GPU Acceleration (Recommended, not required)

Minimum: NVIDIA GPU with 8GB VRAM (e.g., RTX 3060, GTX 1070 Ti)
- Can run 7B parameter models with quantization
- Decent embedding generation speed
Comfortable: 12GB+ VRAM (e.g., RTX 3060 12GB, RTX 4060 Ti)
- Run larger models or higher quality quantizations
My setup: 16GB (A4000) - Can run 13B models comfortably

CPU-Only Alternative

Minimum: Modern quad-core CPU
Recommended: 8+ cores (for reasonable embedding generation)
Everything works on CPU-only, just slower:
- Embedding generation: ~5-10x slower
- LLM inference: ~10-50x slower
- Still totally usable for a writing assistant!

RAM Requirements

Minimum: 16GB system RAM
- CPU-only inference for 7B models
Comfortable: 32GB
- Better for larger models on CPU
My setup: 96GB - Overkill, 32GB is plenty

Storage

SSD: Recommended for model loading
Space: ~20GB for models and vector database

What About Intel/AMD NPUs?

Modern CPUs now include dedicated AI accelerators:

Intel Core Ultra (Meteor Lake+) - Intel AI Boost (NPU)
AMD Ryzen AI (7040/8040 series) - XDNA NPU
AMD Ryzen AI Max (Strix Point) - Up to 50 TOPS

Important: NPUs are for inference only! You don't "build" or train models on NPUs - they're designed to run pre-trained models efficiently. Models are trained on cloud GPUs (or workstations), then downloaded and deployed to NPUs for inference.

Current Status for Running Models on NPUs (as of writing):

✅ Great for: On-device inference, battery efficiency (laptops)
⚠️ Limited for our use: Immature .NET/ONNX Runtime support
❌ Not ready for: This project's primary path

Why not NPUs for this series?

Software ecosystem: CUDA has 15+ years of maturity, NPU support in .NET is nascent
Model compatibility: Most GGUF models target CUDA/CPU, NPU-optimized models are rare
Documentation: Limited resources for NPU development in C#
Performance: Currently slower than discrete GPUs for our workload
DirectML support: Still experimental for LLM inference

Can you use NPUs for inference? Yes, but:

Requires DirectML execution provider in ONNX Runtime
Models need to be in ONNX format (not GGUF)
C# support is experimental
Performance is currently underwhelming vs. CUDA

How to try NPU inference (advanced users):

# Use DirectML execution provider (supports NPU)
dotnet add package Microsoft.ML.OnnxRuntime.DirectML

# In code:
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider("DML"); // DirectML
var session = new InferenceSession("model.onnx", sessionOptions);

Future consideration: Once ONNX Runtime and DirectML mature their NPU support (likely 2024-2025), these will become viable alternatives for inference!

Bottom line: I'll show the GPU-accelerated path, but will note CPU-only alternatives throughout. You can start CPU-only and upgrade later!

What We're Building

The final system will have several components:

Markdown Ingestion Pipeline - Processes all the blog posts, chunks them intelligently, and generates embeddings
Vector Database - Stores embeddings and enables semantic search for similar content
Windows Client Application - A desktop writing assistant UI with editor and suggestion panel
LLM Integration - Local GPU-accelerated inference for content generation
Citation & Link Generation - Auto-suggests internal links and references to related posts
Style Consistency Engine - Learns patterns from existing posts to maintain voice and structure

Think of it as "GitHub Copilot meets Grammarly" but trained specifically on your blog's content and style.

Why "Lawyer GPT"?

Modern law firms use LLMs trained on vast libraries of case law to help draft legal documents. When writing a motion, the system:

References relevant precedents and past successful arguments
Suggests language patterns that have worked before
Maintains consistency with legal writing standards
Cites sources automatically

That's our model. When I start writing "Adding Entity Framework for...", the system should:

Find my previous EF-related posts
Suggest structural patterns I've used before
Offer relevant code snippets from past articles
Auto-generate links to related posts
Maintain my writing style and technical depth

Unlike generic AI writing assistants, our system is grounded in actual past content, so it won't suggest things inconsistent with what I've already written.

Series Overview

Here's what we'll cover over the coming weeks:

Part 1 (This Post): Introduction & Architecture

We'll establish what we're building and why, plus cover the architectural decisions.

Part 2: GPU Setup & CUDA in C#

Getting Windows set up for GPU-accelerated AI workloads, installing CUDA, cuDNN, and testing that C# can actually see and use your GPU.

Part 3: Understanding Embeddings & Vector Databases

Deep dive into what embeddings actually are, how they enable semantic search, and choosing the right vector database (spoiler: we'll probably use Qdrant or pgvector).

Part 4: Building the Ingestion Pipeline

Processing markdown files, intelligent chunking strategies (you can't just split on paragraphs!), and generating embeddings for all our content.

Part 5: The Windows Client

Choosing the right framework (WPF, Avalonia, or MAUI?), building the UI, and making it actually pleasant to use.

Part 6: Local LLM Integration

Running models locally using ONNX Runtime, llama.cpp bindings, or other approaches. Making full use of that A4000!

Part 7: Content Generation & Prompt Engineering

Bringing it all together - semantic search for relevant content, context window management, prompt engineering for writing assistance, and generating coherent suggestions.

Part 8: Advanced Features & Production Deployment

Auto-linking to related posts, style consistency checking, code snippet suggestions, and making the system actually useful for daily writing.

Why RAG?

Before we dive into architecture, let's talk about why RAG (Retrieval Augmented Generation) is the right approach here.

The Problem with Fine-Tuning

You might think: "Why not just fine-tune an LLM on all the blog posts?" There are several issues with that:

Cost & Complexity - Fine-tuning is expensive (both in compute and effort)
Staleness - Every new blog post means retraining
Black Box - Hard to understand what the model "learned"
Hallucination - No guarantee the model won't make things up
No Citations - Can't easily trace answers back to sources

How RAG Solves This

RAG combines the best of both worlds: the power of LLMs with the precision of search to create context-aware content generation.

The flow is:

User starts writing (e.g., "Building a REST API with ASP.NET Core...")
System finds semantically similar past articles
System feeds relevant chunks as context to the LLM
LLM generates suggestions/continuations based on past content
System offers suggestions with references to source posts

This means:

✅ Always up-to-date (just re-index new posts as you write them)
✅ Grounded in your actual writing (maintains consistency)
✅ Traceable (know which past posts influenced suggestions)
✅ Efficient (no expensive retraining for every new post)
✅ Flexible (can swap out LLMs or adjust search strategies)
✅ Privacy-preserving (everything runs locally)

System Architecture

Let me break down the key components we'll be building:

graph TB
    A[Markdown Files] -->|Ingest| B[Chunking Service]
    B -->|Text Chunks| C[Embedding Model]
    C -->|Vectors| D[Vector Database]

    E[User Writing] -->|Current Draft| F[Windows Client]
    F -->|Embed Context| C
    C -->|Query Vector| D
    D -->|Similar Content| G[Context Builder]

    G -->|Relevant Past Articles| H[Prompt Engineer]
    H -->|Prompt + Context| I[Local LLM]
    I -->|Generated Suggestions| J[Link Generator]
    J -->|Suggestions + Citations| F

    F -->|Display| K[Editor with Suggestions]

    class C,I embedding
    class D,K output

    classDef embedding stroke:#333,stroke-width:4px
    classDef output stroke:#333,stroke-width:4px

1. Markdown Ingestion Pipeline

This component:

Reads markdown files from the blog directory
Extracts metadata (title, categories, date, word count)
Intelligently chunks the content (preserving code blocks for reuse)
Identifies structural patterns (how I organize posts)
Tracks source information for citation generation

Key Challenge: Chunking strategy matters enormously. Too small and you lose context. Too large and you waste the LLM's context window. We need chunks that are semantically meaningful - a complete thought or section, not arbitrary paragraph breaks.

2. Embedding Model

Embeddings are the magic that makes semantic search work. An embedding model takes text and converts it into a high-dimensional vector (array of numbers) that captures semantic meaning.

Similar concepts end up "close" in vector space, even if they use different words.

For example:

"database migration" and "updating the DB schema" would have similar embeddings
"cat" and "kitten" would be closer than "cat" and "database"

Technology Choice: We'll probably use either:

sentence-transformers models (can run via ONNX Runtime in C#)
OpenAI's embedding models (via API)
BGE models (state-of-the-art open source)

3. Vector Database

The vector database stores embeddings and enables fast similarity search. When you're writing about "Docker compose", it finds the K most semantically similar past content - not just keyword matches, but conceptually related material.

Technology Choice: We'll evaluate:

Qdrant - Modern, written in Rust, excellent C# client, Docker-friendly
pgvector - Extension for PostgreSQL (we're already using Postgres!)
Weaviate - Another solid option with good .NET support
Chroma - Popular in Python land, less so in C#

I'm leaning toward Qdrant for its simplicity and performance, or pgvector to keep everything in Postgres.

4. Windows Client

We need a nice UI for writing with AI assistance. Think split-pane editor with suggestions. Options:

WPF (Windows Presentation Foundation)

✅ Mature, stable, lots of resources
✅ Native Windows performance
❌ Windows-only
❌ Looks dated unless you invest in UI libraries

Avalonia

✅ Cross-platform (XAML-based)
✅ Modern, actively developed
✅ Similar to WPF
❌ Smaller ecosystem

MAUI (Multi-platform App UI)

✅ Cross-platform
✅ Microsoft-backed
❌ Still maturing
❌ More mobile-focused

Since we're Windows-focused and I want something stable, I'm leaning toward WPF with ModernWPF UI or Avalonia for that cross-platform potential.

5. Local LLM Integration

This is where the A4000 GPU shines. We want to run the LLM locally for:

Privacy (no data sent to APIs)
Speed (local inference is fast)
Cost (no API fees)
Control (we choose the model)

Technology Options:

ONNX Runtime

Convert models to ONNX format
Excellent GPU acceleration
C# native support
Downside: Not all models convert well

llama.cpp bindings

C++ library with C# bindings (LLamaSharp)
Supports CUDA
Wide model support (Llama, Mistral, etc.)
Very active development

TorchSharp

PyTorch bindings for .NET
Most flexibility
Steeper learning curve

I'm leaning toward LLamaSharp for its maturity and ease of use with popular models.

6. Context Window Management & Prompt Engineering

LLMs have limited context windows (e.g., 4K, 8K, 32K tokens). We need to:

Retrieve the most relevant past content (top K from vector search)
Fit them into the context window with the current draft
Structure the prompt for writing assistance
Leave room for the generated suggestions

This is trickier than it sounds. We'll explore strategies like:

Dynamic K selection based on what you're currently writing
Re-ranking retrieved chunks by relevance
Compressing context intelligently
Multi-shot prompting with examples from past posts

7. Link & Citation Generation

Every chunk needs metadata:

Source file/post
Position in the original document
Publish date
Categories
Code snippets used

When the LLM generates suggestions, we automatically create markdown links to the source posts and identify reusable code patterns.

What Makes This Different?

There are lots of RAG tutorials out there, but this series will be different:

C# First - Most RAG examples are in Python. We're going full .NET
Windows & GPU - Leveraging NVIDIA CUDA on Windows, not Linux/WSL
Production Ready - Not just proof-of-concept, but actual usable code
Domain-Specific - Optimized for blog writing assistance, not generic content generation
Deep Explanations - We'll actually explain the "why" behind decisions

Technologies We'll Use

Here's the tech stack I'm planning:

Core Framework

.NET 9 (latest at the time of writing)
C# 13 - Modern language features

AI/ML Libraries

ONNX Runtime or LLamaSharp - LLM inference
Microsoft.ML - Potentially for some tasks
SentenceTransformers via ONNX - Embeddings

Vector Database

Qdrant or pgvector - To be determined

UI Framework

WPF with ModernWPF or Avalonia - Modern desktop UI

Supporting Tools

Markdig - Already using this for markdown parsing
Docker - For running Qdrant or other services
Entity Framework Core - If we use pgvector

GPU Stack

CUDA 12.x (latest at the time of writing)
cuDNN - Deep learning primitives

Performance Considerations

Different hardware setups will have different capabilities:

With 8GB VRAM (Minimum)

Model Size: 7B parameter models with Q4 quantization
Batch Processing: Process embeddings in smaller batches
Memory Management: Careful VRAM monitoring required
Works well for: Writing assistant, embedding generation

With 12GB+ VRAM (Comfortable)

Model Size: 7B with higher quality quantization (Q5/Q6)
Batch Processing: Larger batches for faster throughput
Can also run: Some 13B models with aggressive quantization

With 16GB+ VRAM (My setup)

Model Size: 7B-13B parameter models comfortably
Batch Processing: Full batches, minimal constraints
Fast inference: Sub-second response times
Headroom: Can experiment with different models

CPU-Only (Fallback)

Everything works, just slower
Embeddings: 5-10x slower than GPU
LLM inference: 10-50x slower than GPU
Still usable: For a writing assistant with patience!

Development Approach

We'll build this incrementally:

Start with simplest components (markdown reading, chunking)
Add embedding generation (could start with API-based before going local)
Get vector search working
Build basic UI
Integrate LLM
Polish and optimize

Each part will be deployable and testable on its own. No big-bang integration nightmares.

What's Next?

In Part 2: GPU Setup & CUDA in C#, we'll get hands-on with the GPU setup:

Installing CUDA and cuDNN on Windows
Setting up the development environment
Writing a simple C# program to verify GPU access
Running a basic inference test with ONNX Runtime
Benchmarking our A4000 to understand what we can do

This might seem basic, but getting the GPU stack right is crucial. I've wasted hours debugging issues that came down to version mismatches or missing DLLs.

Why This Matters

Beyond just being a cool project, this approach has real applications:

Technical Documentation - Maintaining consistent style across large doc sets
Legal Practice - Actual use case! Drafting briefs using case law precedents
Content Marketing - Maintaining brand voice across teams
Academic Writing - Consistency with past papers and research
Personal Knowledge Management - Writing assistant for your own corpus

The principles we'll cover apply to any domain where you have existing content and need to maintain consistency while creating new material.

Conclusion

We're embarking on a journey to build a production-quality RAG-based writing assistant in C# that helps draft blog content using past articles as reference material - just like lawyers use LLMs trained on case law. We'll leverage modern GPU hardware, the latest .NET features, and battle-tested AI/ML approaches.

This isn't a toy project - we're building something that could genuinely be useful for anyone who writes regularly and wants to maintain consistency, style, and quality across a large body of work.

In the next part, we'll get our hands dirty with CUDA, GPUs, and making sure our development environment is ready for the challenges ahead.

Stay tuned, and get ready to learn about embeddings, vector databases, chunking strategies, prompt engineering, and all the other delightful complexities of modern AI systems!

Part 1: Introduction & Architecture (this post)
Part 2: GPU Setup & CUDA in C#
Part 3: Understanding Embeddings & Vector Databases
Part 4: Building the Ingestion Pipeline
Part 5: The Windows Client
Part 6: Local LLM Integration
Part 7: Content Generation & Prompt Engineering
Part 8: Advanced Features & Production Deployment

Resources

If you want to get a head start, here are some resources I'll be referencing throughout this series:

See you in Part 2!