Welcome to Part 2! In Part 1, we laid out the vision for building a writing assistant that uses your blog as a knowledge base - like how lawyers use LLMs trained on case law to draft documents. Now it's time to get our hands dirty with the foundation: making sure your GPU is ready for AI workloads.
NOTE: This is part of my experiments with AI (assisted drafting) + my own editing. Same voice, same pragmatism; just faster fingers.
About my hardware: I'm using an NVIDIA RTX A4000 (16GB VRAM), AMD Ryzen 9 9950X, and 96GB DDR5 RAM. But you don't need this! As covered in Part 1, you can use any NVIDIA GPU with 8GB+ VRAM, or even run CPU-only (slower but functional). This part focuses on GPU setup, but I'll note CPU-only alternatives where relevant.
This part might seem basic if you're already familiar with CUDA, but trust me - I've wasted countless hours debugging mysterious errors that traced back to version mismatches, missing environment variables, or incorrect cuDNN installations. We'll do this right from the start.
Before we dive into installation, let's understand why we need GPU acceleration at all.
graph LR
A[AI Workload] --> B{Type?}
B -->|Matrix Operations| C[GPU: 100x+ faster]
B -->|Sequential Logic| D[CPU: Better]
C --> E[Embedding Generation]
C --> F[LLM Inference]
C --> G[Vector Search]
D --> H[Application Logic]
D --> I[File I/O]
class C gpu
class E,F,G aiTasks
classDef gpu stroke:#333,stroke-width:4px
classDef aiTasks stroke:#333
Why GPUs dominate for AI:
Real-world example (on my hardware): Generating embeddings for a blog post:
That's 16x faster, and it compounds when processing hundreds of posts!
Performance on other GPUs (approximate):
Before installing anything, let's understand what we're working with.
This is my specific GPU - a professional workstation card with:
Here's what various GPUs can handle:
| GPU | VRAM | Max Model Size | Good For |
|---|---|---|---|
| RTX 4090 | 24GB | 13B-30B | Overkill for this project |
| RTX 4070 Ti | 12GB | 7B-13B | Excellent choice |
| RTX 3060 | 12GB | 7B | Budget-friendly |
| RTX 4060 Ti | 16GB | 7B-13B | Great value |
| A4000 (mine) | 16GB | 7B-13B | Workstation GPU |
| GTX 1070 Ti | 8GB | 7B (tight) | Minimum viable |
You might have heard about Intel Core Ultra or AMD Ryzen AI chips with built-in NPUs (Neural Processing Units). Can you use those instead of NVIDIA?
Short answer: Still not recommended for this use case (as of late 2025), but getting closer!
Current state (late 2025):
Why we still use NVIDIA CUDA:
If you have an NPU-equipped CPU:
Future outlook: NPUs are improving rapidly. By early 2026, they may become viable for LLM inference, especially for smaller models (1B-3B parameters). Intel's upcoming Arrow Lake and AMD's next-gen XDNA 2 promise significant AI performance improvements.
This series focuses on NVIDIA CUDA because it delivers the best performance today, but the architecture concepts will translate to NPUs as the ecosystem matures!
First, verify Windows sees your GPU:
# PowerShell
wmic path win32_VideoController get name
You should see something like:
Name
NVIDIA RTX A4000
Or use NVIDIA Control Panel → System Information → Display tab.
Here's how the software stack works:
graph TB
A[Your C# Application] --> B[ONNX Runtime / LLamaSharp]
B --> C[CUDA Toolkit]
C --> D[cuDNN Libraries]
D --> E[NVIDIA Driver]
E --> F[GPU Hardware]
G[TensorRT] -.Optional.-> D
class A,F endpoints
class C,D,E install
classDef endpoints stroke:#333,stroke-width:4px
classDef install stroke:#333,stroke-width:2px
subgraph "What We'll Install"
C
D
E
end
Layer breakdown:
Here's the order we'll install things (order matters!):
graph LR
A[1. NVIDIA Driver] --> B[2. CUDA Toolkit]
B --> C[3. cuDNN]
C --> D[4. Verify Install]
D --> E[5. Test from C#]
class A,B,C,D,E steps
classDef steps stroke:#333,stroke-width:2px
Open PowerShell:
nvidia-smi
You should see output like:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 546.33 Driver Version: 546.33 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 WDDM | 00000000:01:00.0 Off | Off |
| 41% 32C P8 10W / 140W | 345MiB / 16376MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Key info:
Driver Version: Should be 545.xx or newerCUDA Version: This is the MAX CUDA version supported, not what's installedMemory-Usage: Currently used / total VRAMIf nvidia-smi doesn't work or your driver is old:
Go to NVIDIA Driver Downloads
Select:
Install and reboot
Verify again with nvidia-smi
CUDA provides the programming interface for GPU acceleration.
Critical: We need CUDA 12.x for modern models. Specifically:
Run the installer. When prompted:
Installation Type: Custom (Advanced)
Select Components:
✅ CUDA Toolkit
✅ CUDA Documentation
✅ CUDA Samples
✅ CUDA Visual Studio Integration
❌ GeForce Experience (not needed)
❌ NVIDIA Driver (already installed)
Why Custom?: We don't want to downgrade our driver or install gaming software.
Default path is fine:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
But note this - we'll need it for environment variables!
The installer usually sets these, but verify:
Check in PowerShell:
$env:CUDA_PATH
# Should show: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1
$env:PATH -split ';' | Select-String CUDA
# Should show CUDA bin and libnvvp paths
If missing, add manually:
Open Environment Variables:
This PC → Properties → Advanced System Settings → Environment VariablesUnder System Variables, verify/add:
CUDA_PATH = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1CUDA_PATH_V12_1 = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1In Path variable, ensure these exist:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\libnvvp
Restart your terminal for changes to take effect
# Check CUDA compiler
nvcc --version
# Should output:
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2023 NVIDIA Corporation
# Built on Tue_Feb__7_19:32:13_Pacific_Standard_Time_2023
# Cuda compilation tools, release 12.1, V12.1.66
# Check path resolves
where.exe nvcc
# Should show: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc.exe
cuDNN provides optimized implementations of deep learning operations.
Critical: cuDNN version must match CUDA version!
For CUDA 12.x, we need cuDNN 8.9+ for CUDA 12.x (at the time of writing, 8.9.7 or later)
cuDNN is just a set of files you copy into your CUDA installation.
Extract the ZIP, you'll see:
cudnn-windows-x86_64-8.9.7.29_cuda12-archive\
bin\
cudnn64_8.dll
cudnn_adv_infer64_8.dll
cudnn_adv_train64_8.dll
... (more DLLs)
include\
cudnn.h
... (header files)
lib\
x64\
cudnn.lib
... (lib files)
Copy files to CUDA installation:
# Assuming you extracted to Downloads and CUDA is in default location
# Run PowerShell as Administrator
$cudnnPath = "$env:USERPROFILE\Downloads\cudnn-windows-x86_64-8.9.7.29_cuda12-archive"
$cudaPath = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1"
# Copy DLLs
Copy-Item "$cudnnPath\bin\*.dll" -Destination "$cudaPath\bin\"
# Copy headers
Copy-Item "$cudnnPath\include\*.h" -Destination "$cudaPath\include\"
# Copy libs
Copy-Item "$cudnnPath\lib\x64\*.lib" -Destination "$cudaPath\lib\x64\"
Or do it manually:
bin\ to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\include\ to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\lib\x64\ to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\lib\x64\# Check DLLs exist
Test-Path "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\cudnn64_8.dll"
# Should return: True
# List all cuDNN DLLs
Get-ChildItem "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\cudnn*.dll"
Let's make sure everything works together.
nvidia-smi
Should show your GPU with no errors.
nvcc --version
Should show CUDA 12.1 (or whatever version you installed).
# All these should return paths
$env:CUDA_PATH
$env:CUDA_PATH_V12_1
# Check PATH includes CUDA bin
$env:PATH -split ';' | Select-String CUDA
The CUDA Toolkit includes sample programs. Let's compile and run one.
Navigate to samples:
cd "C:\ProgramData\NVIDIA Corporation\CUDA Samples\v12.1"
Find deviceQuery:
cd "1_Utilities\deviceQuery"
Compile it (requires Visual Studio):
# If you have VS 2022
"C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\MSBuild.exe" deviceQuery_vs2022.vcxproj /p:Configuration=Release /p:Platform=x64
Run it:
.\x64\Release\deviceQuery.exe
You should see output like:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA RTX A4000"
CUDA Driver Version / Runtime Version 12.3 / 12.1
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 16376 MBytes (17174683648 bytes)
(048) Multiprocessors, (128) CUDA Cores/MP: 6144 CUDA Cores
GPU Max Clock rate: 1560 MHz (1.56 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
...
Result = PASS
The key line: Result = PASS
If you see this, your GPU + CUDA + cuDNN stack is working!
Now the fun part - let's actually use CUDA from C#!
mkdir CudaTest
cd CudaTest
dotnet new console -n CudaTest
cd CudaTest
ONNX Runtime is the easiest way to use CUDA from C#.
dotnet add package Microsoft.ML.OnnxRuntime.Gpu # Latest version
Why this package?
Microsoft.ML.OnnxRuntime - CPU onlyMicrosoft.ML.OnnxRuntime.Gpu - Includes CUDA supportCreate Program.cs:
using Microsoft.ML.OnnxRuntime;
using System;
using System.Linq;
namespace CudaTest
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("=== CUDA Test from C# ===\n");
// Test 1: Can we create a CUDA execution provider?
Console.WriteLine("Test 1: CUDA Execution Provider");
try
{
var cudaProviderOptions = new OrtCUDAProviderOptions();
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_CUDA(cudaProviderOptions);
Console.WriteLine("✅ CUDA execution provider created successfully");
Console.WriteLine($" Device ID: {cudaProviderOptions.DeviceId}");
}
catch (Exception ex)
{
Console.WriteLine($"❌ Failed to create CUDA provider: {ex.Message}");
return;
}
// Test 2: Check available providers
Console.WriteLine("\nTest 2: Available Execution Providers");
var providers = OrtEnv.Instance().GetAvailableProviders();
foreach (var provider in providers)
{
Console.WriteLine($" - {provider}");
}
if (providers.Contains("CUDAExecutionProvider"))
{
Console.WriteLine("✅ CUDA provider is available");
}
else
{
Console.WriteLine("❌ CUDA provider NOT available");
}
// Test 3: Get CUDA device count and info
Console.WriteLine("\nTest 3: CUDA Device Information");
try
{
// ONNX Runtime doesn't expose deviceQuery directly,
// but we can test by trying to create a session
var opts = new SessionOptions();
opts.AppendExecutionProvider_CUDA(0); // Device 0
Console.WriteLine("✅ Successfully configured for CUDA device 0");
Console.WriteLine(" (Full device info requires native CUDA calls)");
}
catch (Exception ex)
{
Console.WriteLine($"❌ CUDA device configuration failed: {ex.Message}");
}
Console.WriteLine("\n=== Test Complete ===");
}
}
}
Code breakdown:
OrtCUDAProviderOptions - Configures CUDA execution
DeviceId - Which GPU to use (0 for first GPU)SessionOptions.AppendExecutionProvider_CUDA() - Tells ONNX Runtime to use GPU
OrtEnv.Instance().GetAvailableProviders() - Lists all available execution providers
dotnet run
Expected output:
=== CUDA Test from C# ===
Test 1: CUDA Execution Provider
✅ CUDA execution provider created successfully
Device ID: 0
Test 2: Available Execution Providers
- CUDAExecutionProvider
- CPUExecutionProvider
✅ CUDA provider is available
Test 3: CUDA Device Information
✅ Successfully configured for CUDA device 0
(Full device info requires native CUDA calls)
=== Test Complete ===
Cause: ONNX Runtime can't find CUDA DLLs.
Fix:
# Make sure CUDA bin is in PATH
$env:PATH += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin"
# Verify cudnn DLL exists
Test-Path "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\cudnn64_8.dll"
# Try running again
dotnet run
Cause: Using wrong ONNX Runtime package (CPU-only).
Fix:
dotnet remove package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.ML.OnnxRuntime.Gpu --version 1.16.3
Cause: Driver is too old for CUDA 12.x.
Fix: Update NVIDIA driver to 545.xx or newer.
Let's do something real - run a tiny neural network on GPU vs CPU and compare speeds.
We'll use a simple MNIST digit recognition model (ONNX format).
# Download sample model
Invoke-WebRequest -Uri "https://github.com/onnx/models/raw/main/vision/classification/mnist/model/mnist-8.onnx" -OutFile "mnist.onnx"
Update Program.cs:
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Diagnostics;
using System.Linq;
namespace CudaTest
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("=== GPU vs CPU Inference Test ===\n");
// Create dummy input (28x28 image flattened to 784 floats)
var inputData = Enumerable.Range(0, 784).Select(i => (float)i / 784).ToArray();
var tensor = new DenseTensor<float>(inputData, new[] { 1, 1, 28, 28 });
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("Input3", tensor)
};
// Test 1: CPU Inference
Console.WriteLine("Test 1: CPU Inference");
var cpuTime = TestInference(inputs, useCuda: false, iterations: 100);
Console.WriteLine($" Average time: {cpuTime:F2}ms\n");
// Test 2: GPU Inference
Console.WriteLine("Test 2: GPU Inference");
var gpuTime = TestInference(inputs, useCuda: true, iterations: 100);
Console.WriteLine($" Average time: {gpuTime:F2}ms\n");
// Compare
Console.WriteLine("Comparison:");
Console.WriteLine($" CPU: {cpuTime:F2}ms");
Console.WriteLine($" GPU: {gpuTime:F2}ms");
Console.WriteLine($" Speedup: {cpuTime / gpuTime:F2}x faster on GPU");
}
static double TestInference(List<NamedOnnxValue> inputs, bool useCuda, int iterations)
{
var options = new SessionOptions();
if (useCuda)
{
options.AppendExecutionProvider_CUDA(0);
}
using var session = new InferenceSession("mnist.onnx", options);
// Warmup run (first run is always slower)
session.Run(inputs);
// Timed runs
var sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
using var results = session.Run(inputs);
// Force evaluation
var output = results.First().AsEnumerable<float>().ToArray();
}
sw.Stop();
return sw.Elapsed.TotalMilliseconds / iterations;
}
}
}
Code explanation:
DenseTensor - ONNX Runtime's way of representing multi-dimensional arrays
[1, 1, 28, 28] = batch_size=1, channels=1, height=28, width=28NamedOnnxValue - Binds a tensor to an input name
Warmup run - First inference is always slower (model loading, optimization)
Timing methodology - Average over 100 iterations for stable results
dotnet run
Expected output (your numbers will vary):
=== GPU vs CPU Inference Test ===
Test 1: CPU Inference
Average time: 0.42ms
Test 2: GPU Inference
Average time: 0.15ms
Comparison:
CPU: 0.42ms
GPU: 0.15ms
Speedup: 2.80x faster on GPU
Why such a small speedup?
We've proven the entire stack works: ✅ Driver installed correctly ✅ CUDA Toolkit accessible ✅ cuDNN integrated ✅ ONNX Runtime finds CUDA ✅ C# can run GPU-accelerated inference
Here's what happens during inference:
sequenceDiagram
participant App as C# Application
participant CPU as CPU Memory
participant GPU as GPU Memory
participant Compute as GPU Cores
App->>CPU: Create input tensor
CPU->>GPU: Transfer input (PCIe)
Note over GPU: Slow! ~16GB/s
GPU->>Compute: Execute model
Note over Compute: Fast! TFLOPS
Compute->>GPU: Write output
GPU->>CPU: Transfer output (PCIe)
Note over GPU: Slow again!
CPU->>App: Return results
Performance lessons:
graph TD
A[Inference Request] --> B{Model Size}
B -->|< 100MB| C{Batch Size}
B -->|> 100MB| D[Use GPU]
C -->|Single Item| E[Use CPU]
C -->|Batch > 10| D
D --> F[10-100x Faster]
E --> G[Lower Latency for Single]
class D,E choice
classDef choice stroke:#333,stroke-width:2px
Rule of thumb:
For our blog writing assistant:
We've successfully:
nvidia-smi and nvccOur development environment is now ready for AI workloads!
In Part 3, we'll dive deep into embeddings and vector databases:
We'll finally start working with actual blog content and seeing RAG in action!
# Check GPU
nvidia-smi
# Check CUDA
nvcc --version
where.exe nvcc
# Check cuDNN
Test-Path "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\cudnn64_8.dll"
# Check environment
$env:CUDA_PATH
$env:PATH -split ';' | Select-String CUDA
# Test from C#
dotnet run
| Error | Cause | Fix |
|---|---|---|
nvidia-smi not found |
Driver not installed | Install NVIDIA driver |
nvcc not found |
CUDA not in PATH | Add CUDA bin to PATH |
| DLL load failed | cuDNN missing | Copy cuDNN files to CUDA dir |
| CUDA provider not found | Wrong NuGet package | Use Microsoft.ML.OnnxRuntime.Gpu |
| Driver version insufficient | Old driver | Update to 545.xx+ |
| CUDA Version | cuDNN Version | ONNX Runtime | Driver Required |
|---|---|---|---|
| 12.1 | 8.9.7 | 1.16.x | 545.xx+ |
| 12.3 | 9.0.0 | 1.17.x | 546.xx+ |
| 11.8 | 8.9.2 | 1.15.x | 520.xx+ |
See you in Part 3, where we finally start building the semantic search engine!
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.