Fixing the Site's Search: PostgreSQL Full-Text Search (and Where It Breaks) (English)

Search is one of those features everyone underestimates. It looks trivial until real users start typing real queries -acronyms, half-remembered technical terms, punctuation-heavy names like "ASP.NET", or searches that are conceptually right but textually wrong. When search fails in those cases, users don't think "edge case" -they think the site is broken.

It's also somethign that's plagued this site since it's inception. I looked at OpenSearch, have worked on PostgreSQL full text searching with it's fancy vector stuff but was never REALLY happy with it. Not that it's heavily used (almost nobody does here) but it bugged me asn especially now I'm building a search tool in lucidRAG well I thought I should FINALLY fix it. Whether it IS or not is another matter.

This article isn't about building a search engine from scratch. It's about fixing the sharp edges of PostgreSQL full-text search in a real production system. I'll walk through the specific failures I hit, why they happened, and the pragmatic fixes required to make search behave the way users already expect -including acronym handling, technical term parsing, phrase search, exclusions, and Google-style operators.

Building on Earlier Work: This article extends the semantic search implementation that added hybrid search with Reciprocal Rank Fusion (RRF). That earlier article covered the foundation -combining PostgreSQL full-text search with Qdrant vector search. This article fixes the edge cases that implementation missed: acronyms that don't match, technical terms with special characters, and natural queries that PostgreSQL's parser breaks.

The semantic search infrastructure serves dual purposes: it powers the site's user-facing search and provides the retrieval layer for the Lawyer GPT RAG system -a writing assistant that drafts new posts using the site's existing content as a knowledge base. For deeper dives into the underlying technology, see Building a "Lawyer GPT" – Part 3 on embeddings and vector search.

The Problems

1. Empty Results Showed Random Articles

When a search returned no results, the system was showing random old articles instead of helpful suggestions. This violated the principle of least surprise - users expected either relevant results or a clear "no match found" message with recent posts as suggestions. This made it look like the search had worked - just badly.

The Fix: Modified BlogSearchService.HybridSearchWithPagingAsync() to detect when search returns zero results and fall back to showing recent posts ordered by date descending. Added a NoMatchFound flag to BasePagingModel so the UI can display an appropriate message like "No match found. Did you mean one of these?"

// No match found - return recent posts as suggestions
if (noMatchFound)
{
    Log.Logger.Information("No search results for '{Query}', returning recent posts as suggestions", query);
    return await GetRecentPostsAsSuggestions(targetLanguage, startDate, endDate, page, pageSize);
}

2. Acronyms Like "DiSE" Weren't Matching

PostgreSQL's default English text search configuration technically indexes acronyms, but normalization and stemming make short, case-significant terms unreliable in practice. When you search for "DiSE", it gets lowercased to "dise" and then the English dictionary may discard it or assign low weight, causing full-text search to miss articles that clearly contain "DiSE" in the title and content.

The Fix: Added acronym detection (terms ≤6 characters with uppercase letters) and case-insensitive substring fallback using PostgreSQL's ILIKE. This supplements full-text search rather than replacing it (see Performance Considerations below for why this is acceptable):

// Detect if query looks like an acronym or short term
var isAcronymLike = query.Length <= 6 && query.Any(char.IsUpper);

// Add substring search for acronyms
searchQuery = searchQuery.Where(x =>
    EF.Functions.ILike(x.Title, $"%{acronym}%")
    || EF.Functions.ILike(x.PlainTextContent, $"%{acronym}%"));

3. Technical Terms with Special Characters Broke Search

Searches for "ASP.NET" or "C#" would fail because PostgreSQL's text search parser treats periods and hash symbols as delimiters, splitting "ASP.NET" into "ASP" and "NET" as separate tokens. This meant searching for the exact term wouldn't work.

The Fix: Created a SearchQueryParser that recognizes common technical terms and replaces them with searchable versions:

private static readonly Dictionary<string, string> TechnicalTerms = new(StringComparer.OrdinalIgnoreCase)
{
    ["asp.net"] = "aspnet",
    ["c#"] = "csharp",
    [".net"] = "dotnet",
    ["f#"] = "fsharp",
    ["node.js"] = "nodejs",
    // ... more terms
};

4. Queries Like "ASP.NET and Alpine" Didn't Work

PostgreSQL treats "and" as a stop word and removes it from searches. Combined with the special character issue, a natural query like "ASP.NET and Alpine" would essentially become a search for "Alpine" only.

The Fix: Implemented a Google-style query parser that handles stop words intelligently and supports advanced search operators.

The Solutions

Google-Style Search Operators

I implemented a SearchQueryParser class that parses queries with support for:

Quoted Phrases: "exact match" - searches for the exact phrase
Excluded Terms: -unwanted - excludes results containing this term
Wildcards: ASP* - matches "ASP", "ASPNET", "ASPNetCore", etc.
Technical Terms: Automatically handles "ASP.NET", "C#", ".NET" etc.

The parser uses a compiled regex pattern to tokenize the query:

[GeneratedRegex(@"""([^""]+)""|(-)?(\S+)", RegexOptions.Compiled)]
private static partial Regex QueryTokenRegex();

public ParsedQuery Parse(string query)
{
    var matches = QueryTokenRegex().Matches(processedQuery);

    foreach (Match match in matches)
    {
        // Quoted phrase
        if (match.Groups[1].Success)
        {
            var phrase = match.Groups[1].Value.Trim();
            result.Phrases.Add(phrase);
            continue;
        }

        // Excluded term (starts with -)
        if (match.Groups[2].Success)
        {
            var term = match.Groups[3].Value.Trim();
            result.ExcludeTerms.Add(term.ToLowerInvariant());
            continue;
        }

        // Regular term or wildcard
        var token = match.Groups[3].Value.Trim();
        if (token.Contains('*'))
        {
            result.WildcardTerms.Add(token.Replace("*", ""));
        }
        else if (!StopWords.Contains(token))
        {
            result.IncludeTerms.Add(token.ToLowerInvariant());
        }
    }
}

Building PostgreSQL tsquery

The parser outputs structured data that's then converted to PostgreSQL's tsquery syntax. We use to_tsquery because we're generating structured syntax - not websearch_to_tsquery because we've already parsed operators ourselves, giving us more control over how terms are combined.

public string BuildTsQuery(ParsedQuery parsed)
{
    var queryParts = new List<string>();

    // Add include terms with AND
    foreach (var term in parsed.IncludeTerms)
    {
        queryParts.Add(term);
    }

    // Add wildcard terms with :* suffix
    foreach (var term in parsed.WildcardTerms)
    {
        queryParts.Add($"{term}:*");
    }

    return queryParts.Count > 0 ? string.Join(" & ", queryParts) : string.Empty;
}

Why not just use websearch_to_tsquery? Because it still fails on technical terms, acronyms, and domain-specific syntax -and offers no hooks for hybrid ranking or fallbacks. By parsing ourselves, we can handle these edge cases before they reach PostgreSQL.

Enhanced BuildSearchQuery

The BuildSearchQuery method was completely rewritten to use the parsed query structure. Note that baseQuery already filters by language, date range, and visibility - we're adding search-specific conditions on top:

private IOrderedQueryable<BlogPostEntity> BuildSearchQuery(
    string query,
    string language,
    DateTime? startDate,
    DateTime? endDate,
    string order)
{
    var parsed = _queryParser.Parse(query);
    IQueryable<BlogPostEntity> searchQuery = baseQuery;

    // Handle phrases (exact substring matching)
    foreach (var phrase in parsed.Phrases)
    {
        searchQuery = searchQuery.Where(x =>
            EF.Functions.ILike(x.Title, $"%{phrase}%")
            || EF.Functions.ILike(x.PlainTextContent, $"%{phrase}%")
            || x.Categories.Any(c => EF.Functions.ILike(c.Name, $"%{phrase}%")));
    }

    // Build tsquery for include terms and wildcards
    var tsQuery = _queryParser.BuildTsQuery(parsed);

    // Apply full-text search if we have terms
    if (!string.IsNullOrWhiteSpace(tsQuery))
    {
        searchQuery = searchQuery.Where(x =>
            x.SearchVector.Matches(EF.Functions.ToTsQuery("english", tsQuery))
            || x.Categories.Any(c =>
                EF.Functions.ToTsVector("english", c.Name)
                    .Matches(EF.Functions.ToTsQuery("english", tsQuery))));
    }

    // Handle acronyms with case-insensitive substring search
    // This supplements full-text search (additive OR), not replaces it
    var acronymTerms = parsed.IncludeTerms
        .Concat(parsed.WildcardTerms)
        .Where(t => _queryParser.IsAcronymLike(t))
        .ToList();

    foreach (var acronym in acronymTerms)
    {
        searchQuery = searchQuery.Where(x =>
            EF.Functions.ILike(x.Title, $"%{acronym}%")
            || EF.Functions.ILike(x.PlainTextContent, $"%{acronym}%"));
    }

    // Handle excluded terms (must NOT contain these)
    foreach (var excludeTerm in parsed.ExcludeTerms)
    {
        searchQuery = searchQuery.Where(x =>
            !EF.Functions.ILike(x.Title, $"%{excludeTerm}%")
            && !EF.Functions.ILike(x.PlainTextContent, $"%{excludeTerm}%")
            && !x.Categories.Any(c => EF.Functions.ILike(c.Name, $"%{excludeTerm}%")));
    }

    return orderedQuery;
}

Search Examples

Here are some examples of the improved search functionality:

Basic Search

DiSE

✅ Now matches articles with "DiSE" in title or content (case-insensitive)

Technical Terms

ASP.NET

✅ Finds articles about ASP.NET (automatically converted to "aspnet" for full-text search)

Phrase Search

"semantic search"

✅ Finds exact phrase "semantic search" in articles

Excluded Terms

ASP.NET -Core

✅ Finds ASP.NET articles but excludes those mentioning "Core"

Wildcards

ASP*

✅ Matches "ASP", "ASPNET", "ASPNetCore", etc.

Complex Query

"full text search" PostgreSQL -MySQL

✅ Finds articles with exact phrase "full text search" AND mentions of PostgreSQL, but EXCLUDING articles that mention MySQL

Before and After Example

Query: ASP.NET and Alpine

Before (broken):

"ASP.NET" split into "ASP" + "NET"
"and" removed as stop word
Result: Only searches for "Alpine"

After (fixed):

"ASP.NET" converted to "aspnet"
"and" intelligently preserved as part of user query intent
"Alpine" searched normally
Result: Finds articles about both ASP.NET and Alpine

RRF Ranking Integration

The search integrates with Reciprocal Rank Fusion (RRF) as implemented in the earlier semantic search article. RRF here is used for ranking, not recall - it combines already-retrieved results from BM25 full-text search and vector semantic search:

// Fuse using RRF with category/freshness boosts
var fusedDtos = _ranker.FuseResults(bm25Results, vectorResults, query);

The RRF algorithm uses 1/(k+rank) where k=60 to combine results from multiple sources, then applies boosts for:

Category match: +2.0
Title match: +1.0
Freshness: Exponential decay over 1 year (+1.5 max)

For the complete RRF implementation and how hybrid search works, see Semantic Search in Action. For deeper dives into embeddings and vector similarity, see Building a "Lawyer GPT" - Part 3. The same infrastructure powers both user-facing search and the RAG retrieval for AI-assisted writing.

Technical Architecture

Dependency Injection

The new components are registered as services:

services.AddSingleton<SearchQueryParser>();
services.AddSingleton<SearchRanker>();
services.AddScoped<BlogSearchService>();

SearchQueryParser and SearchRanker are singletons because they're stateless, thread-safe, and can be reused across requests. BlogSearchService is scoped because it accesses the database context.

Query Flow

User enters search query → SearchController
Query parsing → SearchQueryParser breaks down into structured components
Semantic search (if enabled) → Qdrant vector database with ONNX embeddings
PostgreSQL full-text search (always available) → BM25-based keyword matching
RRF fusion → Combines both result sets with category/freshness boosts
No results → Returns recent posts as suggestions

The same semantic search component is also used by the Lawyer GPT system to retrieve relevant past blog posts for AI-assisted writing. See Part 4 for details on the ingestion pipeline.

Database Schema

Search relies on a precomputed SearchVector column in the BlogPosts table:

ALTER TABLE mostlylucid."BlogPosts"
ADD COLUMN "SearchVector" tsvector
GENERATED ALWAYS AS (
    to_tsvector('english',
        coalesce("Title", '') || ' ' ||
        coalesce("PlainTextContent", '')
    )
) STORED;

CREATE INDEX idx_blog_posts_search_vector
ON mostlylucid."BlogPosts"
USING GIN ("SearchVector");

The GIN (Generalized Inverted Index) provides fast full-text search across large text corpora.

Performance Considerations

Why ILIKE for Acronyms?

While ILIKE (case-insensitive LIKE) is slower than full-text search, it's necessary for acronyms because:

Full-text search normalizes/stems terms, breaking short uppercase strings
Acronyms are typically short (≤6 characters), limiting the performance impact
The condition is only added when needed (detected acronyms)

This approach would not scale to arbitrary substring search across millions of rows -but for targeted acronym fallback it's appropriate.

PostgreSQL Cover Density Ranking (`ts_rank_cd`)

PostgreSQL offers two ranking functions for full-text search:

ts_rank: Basic term frequency (how many times terms appear)
ts_rank_cd: Cover density ranking (how close terms appear together)

We use ts_rank_cd because it provides BM25-like relevance by considering term proximity:

// Order by cover density ranking - rewards term proximity
orderedQuery = searchQuery.OrderByDescending(x =>
    x.SearchVector.RankCoverDensity(EF.Functions.ToTsQuery("english", tsQuery)));

Why ts_rank_cd is superior:

Metric	`ts_rank`	`ts_rank_cd`
Algorithm	Term frequency	Cover density (proximity)
Multi-word queries	Counts terms separately	Rewards terms appearing together
Example: "docker containers"	Article with "docker" 50 times scores high	Article with "docker containers" together scores higher
Performance	Fast	Marginally slower, still leverages GIN index
Behavior	Simple counting	Approximates BM25

This is a quick win optimization - better relevance ranking with zero application-level computation. All ranking happens in PostgreSQL using the existing GIN index on SearchVector.

References:

Additional Performance Optimizations

Beyond fixing search functionality, several key optimizations improve performance:

1. Language Caching

Available languages rarely change but were queried on every search request. Now cached with double-checked locking:

private static readonly TimeSpan LanguageCacheDuration = TimeSpan.FromHours(1);
private static List<string>? _cachedLanguages;
private static DateTime _languageCacheExpiry = DateTime.MinValue;
private static readonly SemaphoreSlim _cacheLock = new(1, 1);

Impact: Eliminates 1 database query per search request.

2. Batched ILIKE Queries

Original code used foreach loops creating multiple WHERE clauses. Now batched into single expressions:

// BEFORE: Multiple WHERE clauses
foreach (var acronym in acronymTerms)
{
    searchQuery = searchQuery.Where(x =>
        EF.Functions.ILike(x.Title, $"%{acronym}%"));
}

// AFTER: Single batched WHERE
if (acronymTerms.Count > 0)
{
    searchQuery = searchQuery.Where(x =>
        acronymTerms.Any(acronym =>
            EF.Functions.ILike(x.Title, $"%{acronym}%")));
}

Impact: Cleaner SQL, ~5-10% faster for multi-term queries.

3. Covering Index for Common Queries

Added partial index with INCLUDE columns for frequent access patterns:

CREATE INDEX idx_blog_posts_search_covering
ON mostlylucid."BlogPosts" ("LanguageId", "IsHidden", "ScheduledPublishDate")
INCLUDE ("Id", "Slug", "Title", "PublishedDate")
WHERE "IsHidden" = false;

Impact: Enables index-only scans - PostgreSQL doesn't need to access the table heap, reducing I/O by 20-30% for common queries.

4. Removed Unnecessary Includes

EF Core's Include() loads full navigation entities. Removed where only navigation properties used in WHERE clauses:

// BEFORE: Loads full LanguageEntity into memory
.Include(x => x.LanguageEntity)
.Where(x => x.LanguageEntity.Name == "en")

// AFTER: EF translates navigation property without loading entity
.Where(x => x.LanguageEntity.Name == "en")

Impact: ~5-10% memory reduction, less data transferred from database.

References:

Query Optimization Strategy

Search builds WHERE clauses incrementally using EF Core's expression tree composition:

IQueryable<BlogPostEntity> searchQuery = baseQuery;

// Each filter added conditionally - PostgreSQL optimizes the final query
if (parsed.Phrases.Count > 0) { searchQuery = searchQuery.Where(...); }
if (!string.IsNullOrWhiteSpace(tsQuery)) { searchQuery = searchQuery.Where(...); }
if (acronymTerms.Count > 0) { searchQuery = searchQuery.Where(...); }

This generates a single optimized SQL query rather than multiple round trips. PostgreSQL's query planner can use statistics and indexes effectively when it sees the complete WHERE clause.

Performance Summary

Cumulative impact of all optimizations:

Optimization	Latency Impact	DB Load Impact
Language caching	Minimal	-1 query/request
Remove Include()	-5-10%	Less data transfer
Batch ILIKE	-5-10%	Cleaner SQL
Covering index	-20-30%	Index-only scans
ts_rank_cd	Similar	Better relevance
Cache fix (route params)	N/A	Prevents stale data

Total expected improvement: 30-50% faster search with significantly reduced database load.

Lessons Learned

Don't assume full-text search handles everything: Edge cases like acronyms and special characters need special handling.
Combine multiple approaches: Full-text search (BM25) + semantic search (vectors) + substring fallbacks provides better coverage than any single method.
Google shaped user expectations: Supporting quoted phrases, exclusions, and wildcards makes search feel natural because users are already familiar with these operators.
Provide helpful fallbacks: When search fails, don't show nothing - show recent posts and make it clear no match was found.
Parse, don't hack: A proper query parser is cleaner and more maintainable than a series of string manipulations.
Leverage PostgreSQL's built-in ranking: Use ts_rank_cd instead of ts_rank for BM25-like relevance. Cover density ranking considers term proximity - a quick win that improves result quality with zero application-level code.
Profile before optimizing: The "obvious" bottlenecks (FTS parsing) weren't the real problem. Language lookups, excessive includes, and missing indexes had bigger impact than expected.
Batch database operations: Multiple foreach loops creating separate WHERE clauses generate suboptimal SQL. Use Any() or All() to batch into single expressions.

Future Enhancements

Possible improvements for the future, categorized by concern:

Retrieval Improvements:

Fuzzy matching: Levenshtein distance for typo tolerance
Synonym expansion: "blog post" → "article", "tutorial"

Parsing Enhancements:

Category filtering: category:ASP.NET operator
Date range: after:2025-01-01 operator

Ranking Refinements:

Result highlighting: Show matched portions of text in results
Click-through tracking: Learn from user behavior to improve ranking

Observability:

Search analytics: Track popular queries and failed searches to identify gaps

Conclusion

Building production-grade search requires fixing edge cases and optimizing performance. This article covered both: fixing PostgreSQL full-text search edge cases (acronyms, technical terms, Google-style operators) and implementing key performance optimizations (caching, batching, covering indexes, ts_rank_cd).

The implementation achieves 30-50% faster search while reducing database load through:

Cover density ranking (ts_rank_cd) for better relevance
Language caching eliminating repeated queries
Batched ILIKE expressions for cleaner SQL
Covering indexes enabling index-only scans
Removed unnecessary EF Core includes

The key insight: no single approach handles all cases. PostgreSQL FTS excels at keyword matching, semantic search handles conceptual queries, and targeted fallbacks catch edge cases. The clean parsing layer ties it together, handling operators and technical terms before they reach the search engines.

The semantic search infrastructure serves dual purposes:

User-facing search: Combines keyword and semantic matching for better results
RAG retrieval: Powers the Lawyer GPT writing assistant

Related articles:

Semantic Search in Action - Foundation: hybrid search with RRF
Building a "Lawyer GPT" - Part 3 - Embeddings & vector databases
Building a "Lawyer GPT" - Part 4 - Ingestion pipeline

Official documentation:

All code available in the blog's GitHub repository.