RAG for Implementers: CPU-Friendly Semantic Search with ONNX and Qdrant

Tuesday, 25 November 2025

Introduction

📖 Part of the RAG Series: This is Part 4a - core implementation:

Part 1: RAG Origins and Fundamentals - What embeddings are, why they matter
Part 2: RAG Architecture and Internals - Chunking, tokenization, vector databases
Part 3: RAG in Practice - Building complete RAG systems
Part 4a: ONNX & Qdrant Implementation (this article) - CPU-friendly semantic search foundation
Part 4b: Semantic Search in Action - Typeahead, hybrid search, and UI components
Part 5: Hybrid Search & Auto-Indexing - Production integration patterns

Parts 1-3 explain why semantic search works. This article shows how to build the foundation - a zero-cost, CPU-friendly implementation using ONNX Runtime and Qdrant. Part 4b covers the search UI and hybrid search implementation, and Part 5 covers production auto-indexing.

The Challenge: Most semantic search solutions require expensive GPU infrastructure or costly managed services. What if you're an indie developer running a blog on a modest VPS?

The Solution: A fully functional semantic search system that runs entirely on CPU, using free open-source tools. This is the exact setup running on this blog - zero extra cost beyond existing hosting.

Core Concepts

These concepts are covered in depth in the RAG series, but here's what you need to know for this implementation:

Embeddings: Text as Numbers

Embeddings are vectors (arrays of numbers) that capture the meaning of text. Similar meanings produce similar vectors - that's the magic.

graph TD
    A["Text: 'The cat sat on the mat'"] --> B[Embedding Model]
    B --> C["Vector: [0.25, -0.18, 0.91, ... 384 more numbers]"]
    D["Text: 'A feline rested on the carpet'"] --> B
    B --> E["Vector: [0.27, -0.16, 0.89, ... similar numbers!]"]

    C -.Similar vectors = similar meaning.-> E

    style A stroke:#10b981,stroke-width:2px
    style D stroke:#10b981,stroke-width:2px
    style B stroke:#6366f1,stroke-width:3px
    style C stroke:#f59e0b,stroke-width:2px
    style E stroke:#f59e0b,stroke-width:2px

Key insight: Texts with similar meanings will have similar vectors (embeddings). This is how we can find "related" content - we're literally measuring the distance between meanings!

Understanding Cosine Similarity

Cosine similarity measures the angle between two vectors - if they point in similar directions, they're semantically similar:

flowchart LR
    subgraph "Vector Space (simplified to 2D)"
        direction TB
        A["'Docker tutorial'"] -.-> B((0.85))
        C["'Container deployment'"] -.-> B
        D["'Cooking recipes'"] -.-> E((0.12))
        A -.-> E
    end

    B --> F["High Similarity<br/>Related content!"]
    E --> G["Low Similarity<br/>Different topics"]

    style A stroke:#10b981,stroke-width:2px
    style C stroke:#10b981,stroke-width:2px
    style D stroke:#f59e0b,stroke-width:2px
    style B stroke:#22c55e,stroke-width:3px
    style E stroke:#ef4444,stroke-width:3px
    style F stroke:#22c55e,stroke-width:2px
    style G stroke:#ef4444,stroke-width:2px

The formula: similarity = (A · B) / (||A|| × ||B||) - but since we L2-normalize our vectors, it simplifies to just the dot product!

What is ONNX?

ONNX (Open Neural Network Exchange) is an open standard format for machine learning models that allows them to run efficiently across different platforms. Think of it like a universal translator for AI models. The ONNX Runtime is Microsoft's high-performance inference engine that executes these models.

Why ONNX for our use case:

Runs on CPU (no GPU needed!) - see the ONNX Runtime CPU execution provider docs
Much faster than running models in Python
Smaller memory footprint
Integrates seamlessly with .NET via Microsoft.ML.OnnxRuntime NuGet package
Supports graph optimizations for faster inference

flowchart LR
    subgraph "ONNX Inference Pipeline"
        A[Raw Text] --> B[Tokenizer]
        B --> C["Tokens: [CLS] the cat sat [SEP]"]
        C --> D[Token IDs: 101 1996 4937 2068 102]
        D --> E[ONNX Runtime]
        E --> F[384-dim Vector]
        F --> G[L2 Normalize]
        G --> H[Final Embedding]
    end

    style A stroke:#10b981,stroke-width:2px
    style B stroke:#f59e0b,stroke-width:2px
    style C stroke:#f59e0b,stroke-width:2px
    style D stroke:#f59e0b,stroke-width:2px
    style E stroke:#6366f1,stroke-width:3px
    style F stroke:#8b5cf6,stroke-width:2px
    style G stroke:#8b5cf6,stroke-width:2px
    style H stroke:#ef4444,stroke-width:2px

What is Qdrant?

Qdrant is an open-source vector database - basically a database optimized for storing and searching these embedding vectors. For a deep dive into Qdrant's concepts, configuration, and C# integration, see Self-Hosted Vector Databases with Qdrant. While you could store vectors in PostgreSQL, Qdrant is purpose-built for this and offers:

Lightning-fast similarity search using HNSW algorithm
Metadata filtering - filter results by payload fields
Scalability to millions of vectors with distributed deployment
Low resource usage - runs comfortably on modest hardware
Self-hostable with Docker - see the Qdrant Docker quickstart
gRPC and REST APIs for integration
Native .NET support via Qdrant.Client NuGet package

flowchart TB
    subgraph "Qdrant Vector Storage"
        direction TB
        A[Collection: blog_posts] --> B[Point 1]
        A --> C[Point 2]
        A --> D[Point N...]

        B --> B1["Vector: [0.12, -0.08, ...]"]
        B --> B2["Payload: {slug, title, language}"]

        C --> C1["Vector: [0.25, 0.14, ...]"]
        C --> C2["Payload: {slug, title, language}"]
    end

    subgraph "Vector Search"
        E[Query Vector] --> F[HNSW Index]
        F --> G[Cosine Similarity]
        G --> H[Top K Results]
    end

    style A stroke:#ef4444,stroke-width:3px
    style B stroke:#8b5cf6,stroke-width:2px
    style C stroke:#8b5cf6,stroke-width:2px
    style D stroke:#8b5cf6,stroke-width:2px
    style B1 stroke:#f59e0b,stroke-width:2px
    style B2 stroke:#10b981,stroke-width:2px
    style C1 stroke:#f59e0b,stroke-width:2px
    style C2 stroke:#10b981,stroke-width:2px
    style E stroke:#6366f1,stroke-width:2px
    style F stroke:#ec4899,stroke-width:3px
    style G stroke:#ec4899,stroke-width:2px
    style H stroke:#10b981,stroke-width:2px

Architecture Overview

Here's how our semantic search system fits together:

flowchart TB
    subgraph "Content Ingestion"
        A[Blog Post Markdown] --> B[Extract Plain Text]
        B --> C[ONNX Embedding Service]
        C --> D[Generate 384-dim Vector]
        D --> E[Qdrant Vector Store]
    end

    subgraph "Search Flow"
        F[User Query] --> G[ONNX Embedding Service]
        G --> H[Generate Query Vector]
        H --> I[Qdrant Search]
        E -.Vector Similarity.-> I
        I --> J[Ranked Results]
    end

    subgraph "Related Posts"
        K[Current Blog Post] --> L[Get Post Vector from Qdrant]
        L --> M[Find Similar Vectors]
        E -.->M
        M --> N[Top 5 Related Posts]
    end

    style A stroke:#10b981,stroke-width:2px
    style B stroke:#10b981,stroke-width:2px
    style C stroke:#6366f1,stroke-width:3px
    style D stroke:#f59e0b,stroke-width:2px
    style E stroke:#ef4444,stroke-width:3px
    style F stroke:#10b981,stroke-width:2px
    style G stroke:#6366f1,stroke-width:3px
    style H stroke:#f59e0b,stroke-width:2px
    style I stroke:#ef4444,stroke-width:2px
    style J stroke:#8b5cf6,stroke-width:2px
    style K stroke:#10b981,stroke-width:2px
    style L stroke:#ef4444,stroke-width:2px
    style M stroke:#ef4444,stroke-width:2px
    style N stroke:#8b5cf6,stroke-width:2px

The flow in plain English:

Indexing: When you write a blog post, we convert it to a vector and store it in Qdrant
Searching: When someone searches, we convert their query to a vector and find similar vectors in Qdrant
Related Posts: For any blog post, we can find other posts with similar vectors

Project Structure

We've created a clean, modular structure:

Mostlylucid.SemanticSearch/
├── Config/
│   └── SemanticSearchConfig.cs      # Configuration settings
├── Models/
│   ├── BlogPostDocument.cs          # Document model for indexing
│   └── SearchResult.cs               # Search result model
├── Services/
│   ├── IEmbeddingService.cs         # Embedding interface
│   ├── OnnxEmbeddingService.cs      # ONNX-based embeddings
│   ├── IVectorStoreService.cs       # Vector store interface
│   ├── QdrantVectorStoreService.cs  # Qdrant implementation
│   ├── ISemanticSearchService.cs    # High-level search interface
│   └── SemanticSearchService.cs     # Orchestration service
├── Extensions/
│   └── ServiceCollectionExtensions.cs  # DI registration
├── download-models.sh               # Model download script
└── README.md

Implementation

Step 1: Setting Up the Project

First, create the new class library:

dotnet new classlib -n Mostlylucid.SemanticSearch -f net9.0
dotnet sln add Mostlylucid.SemanticSearch

Add the necessary NuGet packages:

cd Mostlylucid.SemanticSearch
dotnet add package Microsoft.Extensions.Logging.Abstractions
dotnet add package Microsoft.ML.OnnxRuntime --version 1.21.1
dotnet add package Qdrant.Client --version 1.14.0
dotnet add reference ../Mostlylucid.Shared/Mostlylucid.Shared.csproj

Step 2: Configuration

Let's set up our configuration class. We're using the IConfigSection pattern that's used throughout Mostlylucid:

using Mostlylucid.Shared.Config;

namespace Mostlylucid.SemanticSearch.Config;

/// <summary>
/// Configuration for semantic search functionality
/// </summary>
public class SemanticSearchConfig : IConfigSection
{
    public static string Section => "SemanticSearch";

    /// <summary>
    /// Enable or disable semantic search
    /// </summary>
    public bool Enabled { get; set; } = true;

    /// <summary>
    /// Qdrant server URL (e.g., http://localhost:6333)
    /// </summary>
    public string QdrantUrl { get; set; } = "http://localhost:6333";

    /// <summary>
    /// Optional read-only API key for Qdrant (used for search operations)
    /// </summary>
    public string? ReadApiKey { get; set; }

    /// <summary>
    /// Optional read-write API key for Qdrant (used for indexing operations)
    /// </summary>
    public string? WriteApiKey { get; set; }

    /// <summary>
    /// Collection name in Qdrant for blog posts
    /// </summary>
    public string CollectionName { get; set; } = "blog_posts";

    /// <summary>
    /// Path to the ONNX embedding model file
    /// </summary>
    public string EmbeddingModelPath { get; set; } = "models/all-MiniLM-L6-v2.onnx";

    /// <summary>
    /// Path to the tokenizer vocabulary file
    /// </summary>
    public string VocabPath { get; set; } = "models/vocab.txt";

    /// <summary>
    /// Embedding vector size (384 for all-MiniLM-L6-v2)
    /// </summary>
    public int VectorSize { get; set; } = 384;

    /// <summary>
    /// Number of related posts to return
    /// </summary>
    public int RelatedPostsCount { get; set; } = 5;

    /// <summary>
    /// Minimum similarity score (0-1) for related posts
    /// </summary>
    public float MinimumSimilarityScore { get; set; } = 0.5f;

    /// <summary>
    /// Number of search results to return
    /// </summary>
    public int SearchResultsCount { get; set; } = 10;
}

Why separate API keys? Security! Your read key can be used in public-facing search endpoints, while your write key stays server-side for admin operations only.

Add this to your appsettings.json:

{
  "SemanticSearch": {
    "Enabled": false,
    "QdrantUrl": "http://localhost:6333",
    "ReadApiKey": "",
    "WriteApiKey": "",
    "CollectionName": "blog_posts",
    "EmbeddingModelPath": "models/all-MiniLM-L6-v2.onnx",
    "VocabPath": "models/vocab.txt",
    "VectorSize": 384,
    "RelatedPostsCount": 5,
    "MinimumSimilarityScore": 0.5,
    "SearchResultsCount": 10
  }
}

Step 3: The ONNX Embedding Service

This is where the magic happens. We're using the all-MiniLM-L6-v2 model, which is specifically designed for semantic similarity tasks and runs efficiently on CPU.

Why this model?

Small size (~90MB)
Fast inference on CPU (~50-100ms per embedding)
Good quality embeddings (384 dimensions)
Trained on over 1 billion sentence pairs

Here's the complete implementation:

using Microsoft.Extensions.Logging;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using Mostlylucid.SemanticSearch.Config;
using System.Text.RegularExpressions;

namespace Mostlylucid.SemanticSearch.Services;

public class OnnxEmbeddingService : IEmbeddingService, IDisposable
{
    private readonly ILogger<OnnxEmbeddingService> _logger;
    private readonly SemanticSearchConfig _config;
    private readonly InferenceSession? _session;
    private readonly Dictionary<string, int> _vocabulary;
    private readonly SemaphoreSlim _semaphore = new(1, 1);
    private bool _disposed;

    private const int MaxSequenceLength = 256;
    private const string PadToken = "[PAD]";
    private const string UnkToken = "[UNK]";
    private const string ClsToken = "[CLS]";
    private const string SepToken = "[SEP]";

    public OnnxEmbeddingService(
        ILogger<OnnxEmbeddingService> logger,
        SemanticSearchConfig config)
    {
        _logger = logger;
        _config = config;
        _vocabulary = new Dictionary<string, int>();

        if (!_config.Enabled)
        {
            _logger.LogInformation("Semantic search is disabled");
            return;
        }

        try
        {
            // Check if model file exists
            if (!File.Exists(_config.EmbeddingModelPath))
            {
                _logger.LogWarning("Embedding model not found at {Path}. Semantic search will be disabled.",
                    _config.EmbeddingModelPath);
                return;
            }

            // Load vocabulary if it exists
            if (File.Exists(_config.VocabPath))
            {
                LoadVocabulary(_config.VocabPath);
            }

            // Create ONNX session with CPU execution provider
            var sessionOptions = new SessionOptions
            {
                ExecutionMode = ExecutionMode.ORT_SEQUENTIAL,
                GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL
            };

            _session = new InferenceSession(_config.EmbeddingModelPath, sessionOptions);
            _logger.LogInformation("ONNX embedding model loaded successfully from {Path}",
                _config.EmbeddingModelPath);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to initialize ONNX embedding service");
        }
    }

    private void LoadVocabulary(string vocabPath)
    {
        var lines = File.ReadAllLines(vocabPath);
        for (int i = 0; i < lines.Length; i++)
        {
            var token = lines[i].Trim();
            if (!string.IsNullOrEmpty(token))
            {
                _vocabulary[token] = i;
            }
        }
        _logger.LogInformation("Loaded vocabulary with {Count} tokens", _vocabulary.Count);
    }

    public async Task<float[]> GenerateEmbeddingAsync(string text, CancellationToken cancellationToken = default)
    {
        if (_session == null || !_config.Enabled)
        {
            return new float[_config.VectorSize];
        }

        if (string.IsNullOrWhiteSpace(text))
        {
            return new float[_config.VectorSize];
        }

        // Use semaphore to prevent concurrent ONNX inference (not thread-safe)
        await _semaphore.WaitAsync(cancellationToken);
        try
        {
            return await Task.Run(() => GenerateEmbedding(text), cancellationToken);
        }
        finally
        {
            _semaphore.Release();
        }
    }

    private float[] GenerateEmbedding(string text)
    {
        try
        {
            // Tokenize the input text
            var tokens = Tokenize(text);

            // Create input tensors for ONNX model
            var inputIds = CreateInputTensor(tokens, "input_ids");
            var attentionMask = CreateAttentionMaskTensor(tokens.Length);
            var tokenTypeIds = CreateTokenTypeIdsTensor(tokens.Length);

            // Run inference
            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("input_ids", inputIds),
                NamedOnnxValue.CreateFromTensor("attention_mask", attentionMask),
                NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypeIds)
            };

            using var results = _session!.Run(inputs);

            // Extract the output tensor (sentence embedding)
            var output = results.First().AsTensor<float>();
            var embedding = output.ToArray();

            // Normalize the vector (L2 normalization)
            return NormalizeVector(embedding);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error generating embedding for text: {Text}",
                text[..Math.Min(100, text.Length)]);
            return new float[_config.VectorSize];
        }
    }

    private List<int> Tokenize(string text)
    {
        // Simple whitespace + punctuation tokenization
        var tokens = new List<int>();

        // Add [CLS] token at the start
        if (_vocabulary.TryGetValue(ClsToken, out var clsId))
            tokens.Add(clsId);

        // Tokenize the text
        var words = Regex.Split(text.ToLowerInvariant(), @"(\W+)")
            .Where(w => !string.IsNullOrWhiteSpace(w))
            .Take(MaxSequenceLength - 2); // Leave room for [CLS] and [SEP]

        foreach (var word in words)
        {
            if (_vocabulary.Count > 0)
            {
                if (_vocabulary.TryGetValue(word, out var tokenId))
                    tokens.Add(tokenId);
                else if (_vocabulary.TryGetValue(UnkToken, out var unkId))
                    tokens.Add(unkId);
            }
            else
            {
                // Fallback: use hash code as token ID
                tokens.Add(Math.Abs(word.GetHashCode()) % 30000);
            }
        }

        // Add [SEP] token at the end
        if (_vocabulary.TryGetValue(SepToken, out var sepId))
            tokens.Add(sepId);

        return tokens;
    }

    private Tensor<long> CreateInputTensor(List<int> tokens, string name)
    {
        var length = Math.Min(tokens.Count, MaxSequenceLength);
        var tensorData = new long[1, MaxSequenceLength];

        for (int i = 0; i < length; i++)
        {
            tensorData[0, i] = tokens[i];
        }

        // Pad the rest
        var padId = _vocabulary.TryGetValue(PadToken, out var id) ? id : 0;
        for (int i = length; i < MaxSequenceLength; i++)
        {
            tensorData[0, i] = padId;
        }

        return new DenseTensor<long>(tensorData, new[] { 1, MaxSequenceLength });
    }

    private Tensor<long> CreateAttentionMaskTensor(int actualLength)
    {
        var length = Math.Min(actualLength, MaxSequenceLength);
        var tensorData = new long[1, MaxSequenceLength];

        for (int i = 0; i < length; i++)
        {
            tensorData[0, i] = 1; // Attend to actual tokens
        }

        return new DenseTensor<long>(tensorData, new[] { 1, MaxSequenceLength });
    }

    private Tensor<long> CreateTokenTypeIdsTensor(int actualLength)
    {
        var tensorData = new long[1, MaxSequenceLength];
        // All zeros for single sentence
        return new DenseTensor<long>(tensorData, new[] { 1, MaxSequenceLength });
    }

    private float[] NormalizeVector(float[] vector)
    {
        // L2 normalization
        var sumOfSquares = vector.Sum(v => v * v);
        var magnitude = MathF.Sqrt(sumOfSquares);

        if (magnitude > 0)
        {
            for (int i = 0; i < vector.Length; i++)
            {
                vector[i] /= magnitude;
            }
        }

        return vector;
    }

    public void Dispose()
    {
        if (_disposed) return;

        _session?.Dispose();
        _semaphore?.Dispose();
        _disposed = true;

        GC.SuppressFinalize(this);
    }
}

Key points for junior devs:

Tokenization: We're breaking text into smaller pieces (tokens) that the model can understand
Tensors: These are multi-dimensional arrays that ONNX models work with
Attention Mask: Tells the model which parts of the input are actual content vs. padding
L2 Normalization: Makes all vectors have the same "length", so we can compare them fairly
Semaphore: Ensures thread safety (ONNX isn't thread-safe by default)

Step 4: Qdrant Vector Store

Now let's implement the vector storage and search:

using Microsoft.Extensions.Logging;
using Mostlylucid.SemanticSearch.Config;
using Mostlylucid.SemanticSearch.Models;
using Qdrant.Client;
using Qdrant.Client.Grpc;

namespace Mostlylucid.SemanticSearch.Services;

public class QdrantVectorStoreService : IVectorStoreService
{
    private readonly ILogger<QdrantVectorStoreService> _logger;
    private readonly SemanticSearchConfig _config;
    private readonly QdrantClient? _client;
    private bool _collectionInitialized;

    public QdrantVectorStoreService(
        ILogger<QdrantVectorStoreService> logger,
        SemanticSearchConfig config)
    {
        _logger = logger;
        _config = config;

        if (!_config.Enabled)
        {
            _logger.LogInformation("Semantic search is disabled");
            return;
        }

        try
        {
            var uri = new Uri(_config.QdrantUrl);
            var host = uri.Host;
            var port = uri.Port > 0 ? uri.Port : 6334; // Default gRPC port

            _client = new QdrantClient(host, port, https: uri.Scheme == "https");
            _logger.LogInformation("Connected to Qdrant at {Host}:{Port}", host, port);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to connect to Qdrant at {Url}", _config.QdrantUrl);
        }
    }

    public async Task InitializeCollectionAsync(CancellationToken cancellationToken = default)
    {
        if (_client == null || !_config.Enabled || _collectionInitialized)
            return;

        try
        {
            var collections = await _client.ListCollectionsAsync(cancellationToken);
            var collectionExists = collections.Any(c => c.Name == _config.CollectionName);

            if (!collectionExists)
            {
                _logger.LogInformation("Creating collection {CollectionName}", _config.CollectionName);

                await _client.CreateCollectionAsync(
                    collectionName: _config.CollectionName,
                    vectorsConfig: new VectorParams
                    {
                        Size = (ulong)_config.VectorSize,
                        Distance = Distance.Cosine // Cosine similarity for semantic search
                    },
                    cancellationToken: cancellationToken
                );

                _logger.LogInformation("Collection {CollectionName} created successfully", _config.CollectionName);
            }

            _collectionInitialized = true;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to initialize collection {CollectionName}", _config.CollectionName);
            throw;
        }
    }

    public async Task<List<SearchResult>> FindRelatedPostsAsync(
        string slug,
        string language,
        int limit = 5,
        CancellationToken cancellationToken = default)
    {
        if (_client == null || !_config.Enabled)
            return new List<SearchResult>();

        try
        {
            // Find the document by slug and language
            var scrollResults = await _client.ScrollAsync(
                collectionName: _config.CollectionName,
                filter: new Filter
                {
                    Must =
                    {
                        new Condition
                        {
                            Field = new FieldCondition
                            {
                                Key = "slug",
                                Match = new Match { Keyword = slug }
                            }
                        },
                        new Condition
                        {
                            Field = new FieldCondition
                            {
                                Key = "language",
                                Match = new Match { Keyword = language }
                            }
                        }
                    }
                },
                limit: 1,
                cancellationToken: cancellationToken
            );

            var point = scrollResults.FirstOrDefault();
            if (point == null)
            {
                _logger.LogWarning("Post {Slug} ({Language}) not found in vector store", slug, language);
                return new List<SearchResult>();
            }

            // Use the document's vector to find similar posts
            var searchResults = await _client.SearchAsync(
                collectionName: _config.CollectionName,
                vector: point.Vectors.Vector.Data.ToArray(),
                limit: (ulong)(limit + 1), // +1 because the first result will be the post itself
                scoreThreshold: _config.MinimumSimilarityScore,
                cancellationToken: cancellationToken
            );

            // Filter out the original post and return top N similar posts
            return searchResults
                .Where(r => r.Payload["slug"].StringValue != slug || r.Payload["language"].StringValue != language)
                .Take(limit)
                .Select(result => new SearchResult
                {
                    Slug = result.Payload["slug"].StringValue,
                    Title = result.Payload["title"].StringValue,
                    Language = result.Payload["language"].StringValue,
                    Categories = result.Payload.TryGetValue("categories", out var cats)
                        ? cats.ListValue.Values.Select(v => v.StringValue).ToList()
                        : new List<string>(),
                    Score = result.Score,
                    PublishedDate = DateTime.Parse(result.Payload["published_date"].StringValue)
                })
                .ToList();
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to find related posts for {Slug} ({Language})", slug, language);
            return new List<SearchResult>();
        }
    }

    // ... Additional methods for IndexDocument, Search, Delete, etc.
}

What's happening here:

Cosine Distance: We're using cosine similarity, which is perfect for comparing normalized vectors
Metadata Storage: Qdrant lets us store extra data (payload) alongside vectors
Filtering: We can filter results by metadata before comparing vectors
Score Threshold: Only return results above a certain similarity score

Step 5: The Orchestration Service

This high-level service ties everything together:

using Microsoft.Extensions.Logging;
using Mostlylucid.SemanticSearch.Config;
using Mostlylucid.SemanticSearch.Models;
using System.Security.Cryptography;
using System.Text;

namespace Mostlylucid.SemanticSearch.Services;

public class SemanticSearchService : ISemanticSearchService
{
    private readonly ILogger<SemanticSearchService> _logger;
    private readonly SemanticSearchConfig _config;
    private readonly IEmbeddingService _embeddingService;
    private readonly IVectorStoreService _vectorStoreService;

    public SemanticSearchService(
        ILogger<SemanticSearchService> logger,
        SemanticSearchConfig config,
        IEmbeddingService embeddingService,
        IVectorStoreService vectorStoreService)
    {
        _logger = logger;
        _config = config;
        _embeddingService = embeddingService;
        _vectorStoreService = vectorStoreService;
    }

    public async Task IndexPostAsync(BlogPostDocument document, CancellationToken cancellationToken = default)
    {
        if (!_config.Enabled)
            return;

        try
        {
            // Prepare text for embedding: combine title and content
            // We give more weight to the title by including it twice
            var textToEmbed = $"{document.Title}. {document.Title}. {document.Content}";

            // Truncate to reasonable length (embedding models have token limits)
            const int maxLength = 2000;
            if (textToEmbed.Length > maxLength)
            {
                textToEmbed = textToEmbed[..maxLength];
            }

            // Generate embedding
            var embedding = await _embeddingService.GenerateEmbeddingAsync(textToEmbed, cancellationToken);

            // Compute content hash if not provided
            if (string.IsNullOrEmpty(document.ContentHash))
            {
                document.ContentHash = ComputeContentHash(document.Content);
            }

            // Store in vector database
            await _vectorStoreService.IndexDocumentAsync(document, embedding, cancellationToken);

            _logger.LogInformation("Indexed post {Slug} ({Language})", document.Slug, document.Language);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to index post {Slug} ({Language})", document.Slug, document.Language);
        }
    }

    public async Task<List<SearchResult>> SearchAsync(
        string query,
        int limit = 10,
        CancellationToken cancellationToken = default)
    {
        if (!_config.Enabled || string.IsNullOrWhiteSpace(query))
            return new List<SearchResult>();

        try
        {
            // Generate embedding for the search query
            var queryEmbedding = await _embeddingService.GenerateEmbeddingAsync(query, cancellationToken);

            // Search in vector store
            var results = await _vectorStoreService.SearchAsync(
                queryEmbedding,
                Math.Min(limit, _config.SearchResultsCount),
                _config.MinimumSimilarityScore,
                cancellationToken);

            _logger.LogDebug("Search for '{Query}' returned {Count} results", query, results.Count);

            return results;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Search failed for query '{Query}'", query);
            return new List<SearchResult>();
        }
    }

    public async Task<List<SearchResult>> GetRelatedPostsAsync(
        string slug,
        string language,
        int limit = 5,
        CancellationToken cancellationToken = default)
    {
        if (!_config.Enabled)
            return new List<SearchResult>();

        try
        {
            var results = await _vectorStoreService.FindRelatedPostsAsync(
                slug,
                language,
                Math.Min(limit, _config.RelatedPostsCount),
                cancellationToken);

            _logger.LogDebug("Found {Count} related posts for {Slug} ({Language})",
                results.Count, slug, language);

            return results;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to get related posts for {Slug} ({Language})", slug, language);
            return new List<SearchResult>();
        }
    }

    private string ComputeContentHash(string content)
    {
        using var sha256 = SHA256.Create();
        var bytes = Encoding.UTF8.GetBytes(content);
        var hashBytes = sha256.ComputeHash(bytes);
        return Convert.ToBase64String(hashBytes);
    }
}

Step 6: Dependency Injection Setup

using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Mostlylucid.SemanticSearch.Config;
using Mostlylucid.SemanticSearch.Services;
using Mostlylucid.Shared.Config;

namespace Mostlylucid.SemanticSearch.Extensions;

public static class ServiceCollectionExtensions
{
    public static void AddSemanticSearch(
        this IServiceCollection services,
        IConfiguration configuration)
    {
        // Bind configuration using POCO pattern
        services.ConfigurePOCO<SemanticSearchConfig>(
            configuration.GetSection(SemanticSearchConfig.Section));

        // Register services as singletons for efficiency
        services.AddSingleton<IEmbeddingService, OnnxEmbeddingService>();
        services.AddSingleton<IVectorStoreService, QdrantVectorStoreService>();
        services.AddSingleton<ISemanticSearchService, SemanticSearchService>();
    }
}

In your Program.cs:

using Mostlylucid.SemanticSearch.Extensions;
using Mostlylucid.SemanticSearch.Services;

// Add services
services.AddSemanticSearch(config);

// Initialize after building the app
using (var scope = app.Services.CreateScope())
{
    var semanticSearch = scope.ServiceProvider.GetRequiredService<ISemanticSearchService>();
    await semanticSearch.InitializeAsync();
}

Setting Up Infrastructure

Docker Compose for Qdrant

Create a separate docker-compose file for semantic search services:

version: '3.8'

services:
  qdrant:
    image: qdrant/qdrant:latest
    container_name: mostlylucid-qdrant
    restart: unless-stopped
    ports:
      - "6333:6333"  # HTTP API
      - "6334:6334"  # gRPC API
    volumes:
      - qdrant_storage:/qdrant/storage
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__SERVICE__GRPC_PORT=6334
    networks:
      - mostlylucid_network
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

volumes:
  qdrant_storage:
    driver: local

networks:
  mostlylucid_network:
    name: mostlylucidweb_app_network
    external: true

Start it with:

docker-compose -f semantic-search-docker-compose.yml up -d

Download the Embedding Model

We're using the all-MiniLM-L6-v2 model from Hugging Face's Sentence Transformers library. This model is specifically trained on semantic similarity tasks and produces 384-dimensional embeddings.

Automatic Download (Recommended)

The service automatically downloads the model from Hugging Face on first run if it doesn't exist:

// In OnnxEmbeddingService.cs
private const string ModelUrl = "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx";
private const string VocabUrl = "https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/vocab.txt";

public async Task EnsureInitializedAsync(CancellationToken cancellationToken = default)
{
    if (_initialized || !_config.Enabled) return;

    // Download model if not exists
    if (!File.Exists(_config.EmbeddingModelPath))
    {
        _logger.LogInformation("Downloading ONNX embedding model to {Path}...", _config.EmbeddingModelPath);
        await DownloadFileAsync(ModelUrl, _config.EmbeddingModelPath, cancellationToken);
    }

    // Download vocab if not exists
    if (!File.Exists(_config.VocabPath))
    {
        _logger.LogInformation("Downloading vocabulary file to {Path}...", _config.VocabPath);
        await DownloadFileAsync(VocabUrl, _config.VocabPath, cancellationToken);
    }

    // Initialize ONNX session...
}

This is particularly useful when deploying with Docker - you can map a volume for the models directory:

volumes:
  - ./mlmodels:/app/mlmodels  # Model persists across container restarts

Manual Download

Alternatively, you can download manually:

chmod +x Mostlylucid.SemanticSearch/download-models.sh
./Mostlylucid.SemanticSearch/download-models.sh

Or directly from Hugging Face:

mkdir -p mlmodels
curl -L https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx -o mlmodels/all-MiniLM-L6-v2.onnx
curl -L https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/vocab.txt -o mlmodels/vocab.txt

This downloads:

all-MiniLM-L6-v2.onnx (~90MB) - The ONNX-exported embedding model
vocab.txt (~230KB) - The WordPiece tokenizer vocabulary

Performance Considerations

Embedding Generation

CPU Performance: ~50-100ms per embedding on a modern CPU
Optimization: We use a semaphore to prevent concurrent ONNX inference
Batching: For bulk indexing, process posts in batches of 10-20

Vector Search

Search Speed: <10ms for collections up to 100K vectors
Memory Usage: ~1KB per vector (with metadata)
Scalability: Qdrant can handle millions of vectors on modest hardware

Caching Strategy

We use ASP.NET Core output caching:

[OutputCache(Duration = 7200, VaryByRouteValueNames = new[] {"slug", "language"})]

This caches related posts for 2 hours, significantly reducing load.

What We've Built

At this point you have a complete, working semantic search foundation:

✅ ONNX embeddings - CPU-friendly, auto-downloads from Hugging Face
✅ Qdrant vector storage - Fast similarity search with metadata filtering
✅ Related posts - Find semantically similar content
✅ Search API - Natural language queries
✅ Content indexing - Store blog posts as vectors

This is the exact setup running on this blog - zero GPU, zero additional cost.

Next: Semantic Search in Action

In Part 4b: Semantic Search in Action, we cover:

Typeahead Search - How the search-as-you-type works with Alpine.js
Hybrid Search - Combining semantic + PostgreSQL full-text with Reciprocal Rank Fusion
Search API - Complete API documentation with filters
Related Posts UI - DaisyUI components with HTMX lazy loading
Advanced Filters - Language and date range filtering

Continue to Part 4b for the search UI and hybrid search implementation.

Then Part 5: Hybrid Search & Auto-Indexing covers production integration patterns.

Resources

ONNX Documentation

ONNX Official Site - The open standard for ML models
ONNX Runtime - Microsoft's high-performance inference engine
ONNX Runtime .NET API - C# API docs
ONNX Runtime Performance - Optimization guide

Qdrant Documentation

Self-Hosted Vector Databases with Qdrant - Deep dive into Qdrant concepts and C# client
Qdrant Official Docs - Main documentation hub
Qdrant Quick Start - Getting started
Qdrant Vector Indexing - HNSW algorithm
Qdrant .NET Client - Official .NET SDK

Embedding Models

all-MiniLM-L6-v2 - The model we use
Sentence Transformers - Semantic embeddings library

Complete Code

All code available at: github.com/scottgal/mostlylucidweb

Mostlylucid.SemanticSearch/ - Core semantic search library