Mostlylucid.OcrNer - The NuGet Package (Part 2) (English)

Mostlylucid.OcrNer - The NuGet Package (Part 2)

Thursday, 12 February 2026

//

22 minute read

NuGet NuGet Downloads GitHub Release (CLI)

In Part 1 I showed the raw pipeline: manually downloading models, writing a tokenizer, wiring up ONNX inference, and decoding BIO tags by hand. Educational, but a lot of plumbing to get right.

Now it's a NuGet package. One line of setup, zero model downloads - everything auto-downloads on first use.

Note: This package is a simplified, focused tool for extracting text and entities from images. If you need a full multi-phase pipeline that can read text from anything (photos, documents, screenshots, handwriting, animated gifs and even videos) with fuzzy matching, OCR consensus, and structured extraction, check out lucidRAG where the production-grade version of this pipeline lives.


Quick Glossary (If You're New to This)

Before we dive in, here's what the key terms mean:

  • OCR (Optical Character Recognition) - converting an image of text into actual text characters your code can work with. Think: photo of a receipt turns into a string of text.
  • NER (Named Entity Recognition) - scanning text to find and classify names of things. "John Smith works at Microsoft in Seattle" becomes: John Smith = Person, Microsoft = Organization, Seattle = Location.
  • ONNX Runtime - a way to run machine learning models (like the BERT model we use for NER) on your machine without needing Python, TensorFlow, or a GPU. It runs the model as a portable .onnx file, locally, using just your CPU.
  • BERT - a pre-trained language model from Google that understands context in text. The NER variant has been fine-tuned on the CoNLL-2003 dataset to recognize people, organizations, locations, and miscellaneous entities.
  • Florence-2 - a small vision model from Microsoft that can describe what it sees in an image (captions, objects, text). Different from Tesseract in that it understands the whole scene, not just characters.

Why More Than Plain Tesseract?

Tesseract is strong for clean document text, but it falls down on noisy photos, low-contrast scans, and mixed "scene + text" images. It also stops at raw text - you still need extra code to turn that text into structured entities you can actually use.

This package closes those gaps:

  1. ImageSharp preprocessing - grayscale, contrast boost, sharpening tuned for OCR
  2. OpenCV advanced preprocessing - deskew, denoise, and binarization for damaged/skewed documents (opt-in)
  3. Florence-2 vision - local image captioning and OCR via ONNX (no cloud API)
  4. BERT NER on top of OCR text - convert extracted text into typed entities (PER/ORG/LOC/MISC) you can act on
  5. Microsoft.Recognizers.Text - rule-based extraction of dates, numbers, URLs, phones, emails, and IPs (opt-in)
  6. Proper DI integration - AddOcrNer() and you're done
  7. CLI tool - A Spectre.Console command-line app that just works out of the box

What Changed from Part 1

flowchart LR
    subgraph Part1["Part 1: Manual"]
        M1[Download models]
        M2[Write tokenizer]
        M3[Wire ONNX]
        M4[BIO decode]
    end

    subgraph Part2["Part 2: NuGet Package"]
        N1["AddOcrNer()"]
        N2[Auto-download]
        N3[ImageSharp + OpenCV]
        N4[Florence-2]
        N5[Recognizers]
        N6[CLI Tool]
    end

    Part1 -->|"packaged into"| Part2

    style N1 stroke:#090,stroke-width:3px
    style N2 stroke:#090,stroke-width:3px
    style N3 stroke:#f60,stroke-width:3px
    style N4 stroke:#f60,stroke-width:3px
    style N5 stroke:#f60,stroke-width:3px
    style N6 stroke:#f60,stroke-width:3px

Part 1 was educational - understanding what each piece does. Part 2 is practical - using it without thinking about the internals.


Getting Started

Install

dotnet add package Mostlylucid.OcrNer

Register Services

The AddOcrNer() extension method registers everything: OCR, NER, the combined pipeline, Florence-2 vision, the model downloader, and the image preprocessor. All as singletons, all lazy-initialized.

Here's the actual registration code from ServiceCollectionExtensions.cs:

// Option 1: From appsettings.json (reads the "OcrNer" section)
builder.Services.AddOcrNer(builder.Configuration);

// Option 2: Inline configuration
builder.Services.AddOcrNer(config =>
{
    config.EnableOcr = true;
    config.TesseractLanguage = "eng";
    config.MinConfidence = 0.5f;
});

That's it. No model downloads, no file paths, no ONNX wiring. Under the hood, AddOcrNer() registers these services:

// From ServiceCollectionExtensions.cs - what gets registered
services.AddSingleton<ModelDownloader>();           // Auto-downloads models on first use
services.AddSingleton<ImagePreprocessor>();         // ImageSharp-based image enhancement
services.AddSingleton<OpenCvPreprocessor>();        // OpenCV advanced preprocessing
services.AddSingleton<INerService, NerService>();   // BERT NER from text
services.AddSingleton<IOcrService, OcrService>();   // Tesseract OCR from images
services.AddSingleton<IOcrNerPipeline, OcrNerPipeline>();         // Combined OCR + NER
services.AddSingleton<ITextRecognizerService, TextRecognizerService>(); // Microsoft.Recognizers
services.AddSingleton<IVisionService, VisionService>();           // Florence-2 vision

Configuration (appsettings.json)

{
  "OcrNer": {
    "EnableOcr": true,
    "TesseractLanguage": "eng",
    "MinConfidence": 0.5,
    "MaxSequenceLength": 512,
    "ModelDirectory": "models/ocrner",
    "Preprocessing": "Default",
    "EnableAdvancedPreprocessing": false,
    "EnableRecognizers": false,
    "RecognizerCulture": "en-us"
  }
}

Here's the actual OcrNerConfig class these map to:

// From OcrNerConfig.cs
public class OcrNerConfig
{
    public string ModelDirectory { get; set; } =
        Path.Combine(AppContext.BaseDirectory, "models", "ocrner");
    public bool EnableOcr { get; set; } = true;
    public string TesseractLanguage { get; set; } = "eng";
    public int MaxSequenceLength { get; set; } = 512;
    public float MinConfidence { get; set; } = 0.5f;
    public string NerModelRepo { get; set; } = "protectai/bert-base-NER-onnx";
    public PreprocessingLevel Preprocessing { get; set; } = PreprocessingLevel.Default;
    public bool EnableAdvancedPreprocessing { get; set; } = false;  // OpenCV pipeline
    public bool EnableRecognizers { get; set; } = false;            // Microsoft.Recognizers
    public string RecognizerCulture { get; set; } = "en-us";       // Recognizer language
}

All settings have sensible defaults. You can omit the entire section and everything works. The two opt-in features (EnableAdvancedPreprocessing and EnableRecognizers) default to false so the package stays lightweight for users who don't need them.

The Preprocessing option controls image enhancement before OCR:

Value What it does When to use
None No preprocessing Images are already optimized
Minimal Grayscale only Clean scans
Default Grayscale + contrast + sharpen Most images (recommended)
Aggressive Strong contrast + sharpen + upscale Poor quality photos

The Four Services

The package registers five services, each usable independently. Pick the one that fits your use case - there's no need to load Florence-2 if all you need is NER from text.

flowchart TD
    subgraph Services
        NER["INerService<br>Text → Entities"]
        OCR["IOcrService<br>Image → Text"]
        REC["ITextRecognizerService<br>Text → Signals"]
        PIPE["IOcrNerPipeline<br>Image → Entities + Signals"]
        VIS["IVisionService<br>Image → Caption"]
    end

    OCR --> PIPE
    NER --> PIPE
    REC -.-> PIPE

    style PIPE stroke:#090,stroke-width:3px
    style VIS stroke:#f60,stroke-width:3px
    style REC stroke:#f60,stroke-width:2px,stroke-dasharray: 5 5

Choosing the Right Service for Your Use Case

The key principle is efficiency: pick the lightest tool that does the job. Don't load a 450MB vision model when a 4MB OCR engine will do.

Service What it does Model size Speed Use when...
INerService BERT NER from text ~430MB ~50ms You already have text (PDFs, databases, user input)
IOcrService Tesseract OCR from images ~4MB ~100ms You need text from document scans, screenshots
IOcrNerPipeline OCR then NER in one call Both models ~150ms You have images and want entities in one step
ITextRecognizerService Rule-based extraction (dates, phones, etc.) None ~1ms You want structured data alongside NER entities
IVisionService Florence-2 captioning + OCR ~450MB ~1-3s You need image understanding, not just text reading

NER from Text (No Images Needed)

If you already have text (from PDFs, databases, user input), you can use NER directly. This is the fastest path - no OCR, no image processing, just text in, entities out.

The INerService interface is simple - one method:

// From INerService.cs
public interface INerService
{
    Task<NerResult> ExtractEntitiesAsync(string text, CancellationToken ct = default);
}

Here's how to use it in your own service:

public class MyService
{
    private readonly INerService _nerService;

    public MyService(INerService nerService)
    {
        _nerService = nerService;
    }

    public async Task ProcessDocumentAsync(string text)
    {
        var result = await _nerService.ExtractEntitiesAsync(text);

        foreach (var entity in result.Entities)
        {
            // entity.Label: "PER", "ORG", "LOC", or "MISC"
            // entity.Text: "John Smith"
            // entity.Confidence: 0.9996
            // entity.StartOffset / EndOffset: character positions in the source
        }
    }
}

The result models are straightforward:

// From NerResult.cs / NerEntity.cs
public class NerResult
{
    public string SourceText { get; init; } = string.Empty;
    public List<NerEntity> Entities { get; init; } = [];
}

public class NerEntity
{
    public string Text { get; init; } = string.Empty;     // "John Smith"
    public string Label { get; init; } = string.Empty;    // "PER", "ORG", "LOC", "MISC"
    public float Confidence { get; init; }                 // 0.0 to 1.0
    public int StartOffset { get; init; }                  // Where in the source text
    public int EndOffset { get; init; }                    // End position (exclusive)
}

The first call downloads the BERT NER model (~430MB) from HuggingFace. Subsequent calls use the cached model - startup is instant.


OCR + NER Pipeline

For images, the pipeline handles preprocessing, OCR, and NER in one call. The IOcrNerPipeline combines IOcrService and INerService:

// From OcrNerPipeline.cs - the actual pipeline code
public async Task<OcrNerResult> ProcessImageAsync(string imagePath, CancellationToken ct = default)
{
    // Step 1: OCR (includes preprocessing automatically)
    var ocrResult = await _ocrService.ExtractTextAsync(imagePath, ct);

    if (string.IsNullOrWhiteSpace(ocrResult.Text))
        return new OcrNerResult
        {
            OcrResult = ocrResult,
            NerResult = new NerResult { SourceText = string.Empty }
        };

    // Step 2: NER on extracted text
    var nerResult = await _nerService.ExtractEntitiesAsync(ocrResult.Text, ct);

    return new OcrNerResult
    {
        OcrResult = ocrResult,
        NerResult = nerResult
    };
}

Using it:

var pipeline = serviceProvider.GetRequiredService<IOcrNerPipeline>();

var result = await pipeline.ProcessImageAsync("invoice.png");

// What OCR found
var text = result.OcrResult.Text;           // The full extracted text
var confidence = result.OcrResult.Confidence; // 0.0 to 1.0

// What NER found in that text
foreach (var entity in result.NerResult.Entities)
{
    // [PER] John Smith, [ORG] Microsoft, [LOC] Seattle...
}

What Happens Under the Hood

flowchart LR
    IMG[Image bytes]
    PRE["ImageSharp<br>or OpenCV"]
    TESS["Tesseract<br>OCR"]
    TOK["WordPiece<br>Tokenize"]
    BERT["BERT NER<br>ONNX"]
    REC["Recognizers<br>(optional)"]
    OUT[Result]

    IMG --> PRE
    PRE --> TESS
    TESS --> TOK
    TOK --> BERT
    BERT --> REC
    REC --> OUT

    style PRE stroke:#f60,stroke-width:3px
    style BERT stroke:#f60,stroke-width:3px
    style REC stroke:#f60,stroke-width:2px,stroke-dasharray: 5 5

Image Preprocessing

Part 1 had raw Tesseract calls. In practice, both Tesseract and Florence-2 work better with preprocessed images. Preprocessing is on by default but completely optional - you can disable it with Preprocessing = "None" in config or --preprocess none on the CLI.

The ImagePreprocessor uses ImageSharp (pure C#, no native dependencies):

// From ImagePreprocessor.cs - the actual preprocessing steps
public byte[] Preprocess(byte[] imageBytes, PreprocessingOptions? options = null)
{
    options ??= PreprocessingOptions.Default;
    using var image = Image.Load<Rgba32>(imageBytes);

    image.Mutate(ctx =>
    {
        // Step 1: Upscale small images (Tesseract wants 300+ DPI equivalent)
        if (options.EnableUpscale && (image.Width < options.MinWidth || image.Height < options.MinHeight))
        {
            var scale = Math.Max(
                (float)options.MinWidth / image.Width,
                (float)options.MinHeight / image.Height);
            scale = Math.Min(scale, options.MaxUpscaleFactor);
            ctx.Resize((int)(image.Width * scale), (int)(image.Height * scale),
                KnownResamplers.Lanczos3);
        }

        // Step 2: Grayscale (single channel = faster, more accurate)
        if (options.EnableGrayscale)
            ctx.Grayscale();

        // Step 3: Contrast boost (text stands out from background)
        if (options.EnableContrast && options.ContrastAmount != 1.0f)
            ctx.Contrast(options.ContrastAmount);

        // Step 4: Sharpen (crisp character edges)
        if (options.EnableSharpen)
            ctx.GaussianSharpen(options.SharpenSigma);
    });

    using var ms = new MemoryStream();
    image.SaveAsPng(ms);  // PNG = lossless, no additional artifacts
    return ms.ToArray();
}

Three presets are built in. The PreprocessingOptions class defines them:

// From ImagePreprocessor.cs
public static PreprocessingOptions Default => new();  // Grayscale + 1.5x contrast + sharpen

public static PreprocessingOptions Minimal => new()   // Grayscale only
{
    EnableContrast = false,
    EnableSharpen = false,
    EnableUpscale = false
};

public static PreprocessingOptions Aggressive => new() // For poor quality images
{
    ContrastAmount = 1.8f,
    SharpenSigma = 1.5f,
    MinWidth = 1024,
    MinHeight = 768,
    MaxUpscaleFactor = 4.0f
};
Preset When to use What it does
Default Most images Grayscale + 1.5x contrast + light sharpen
Minimal Clean scans Grayscale only
Aggressive Poor quality photos 1.8x contrast + strong sharpen + larger upscale

Advanced Preprocessing with OpenCV

For seriously degraded documents - skewed scans, noisy photos, faded historical pages - the ImageSharp pipeline isn't enough. Enable EnableAdvancedPreprocessing to switch to a full OpenCV pipeline ported from ImageSummarizer.

The OpenCV preprocessor chains four stages, each driven by an automatic quality assessment:

flowchart LR
    IMG[Image]
    QA["Quality<br>Assess"]
    SK["Deskew"]
    DN["Denoise"]
    BIN["Binarize"]
    OUT[Clean image]

    IMG --> QA
    QA --> SK
    SK --> DN
    DN --> BIN
    BIN --> OUT

    style QA stroke:#f60,stroke-width:2px

Quality Assessment (ImageQualityAssessor) measures blur, skew angle, noise level, contrast, brightness uniformity, and text density. Based on the results, it recommends which stages to apply - so clean images skip unnecessary processing.

Deskew (SkewCorrector) corrects rotated documents using three methods: Hough line detection (default), minimum area rectangle, or projection profile analysis.

Denoise (NoiseReducer) offers Gaussian blur (fast), bilateral filter (edge-preserving), non-local means (highest quality), and morphological operations.

Binarize (InkExtractor) converts to clean black-and-white using Otsu, adaptive thresholding, Sauvola (for degraded historical documents), CLAHE + Otsu (for low contrast), or morphological background removal.

Enable it in config or on the CLI:

config.EnableAdvancedPreprocessing = true;
ocrner ocr damaged-scan.png -a

Microsoft.Recognizers: Rule-Based Entity Extraction

BERT NER finds people, organizations, locations, and miscellaneous entities. But some structured data - dates, phone numbers, emails, URLs, IP addresses - is better caught by deterministic rules than by a neural network.

Enable EnableRecognizers to add a second extraction pass using Microsoft.Recognizers.Text. This runs after NER and extracts:

Type Examples
DateTime "January 15, 2024", "next Tuesday", "last week"
Number "42", "three million", "15%"
URL "https://example.com", "www.github.com"
Phone "555-1234", "+1 (555) 123-4567"
Email "john@microsoft.com"
IP Address "192.168.1.1"

The recognizer supports multiple cultures (en-us, en-gb, de-de, fr-fr, etc.) so it handles locale-specific date formats and number conventions.

config.EnableRecognizers = true;
config.RecognizerCulture = "en-us";
ocrner ner "John Smith joined Microsoft on January 15, 2024. Call 555-1234." -r

The two extraction methods complement each other: BERT NER understands context ("Apple" the company vs. "apple" the fruit), while the recognizers reliably catch structured patterns that BERT might miss. The OcrNerResult model now includes an optional Signals property:

public class OcrNerResult
{
    public OcrResult OcrResult { get; init; } = new();
    public NerResult NerResult { get; init; } = new();
    public RecognizedSignals? Signals { get; init; }  // Only when EnableRecognizers = true
}

Florence-2 Vision

Florence-2 is a completely different approach from Tesseract. Where Tesseract is a specialized OCR engine that reads text character by character, Florence-2 is a vision model that understands the whole image - objects, scenes, people, and text.

// From IVisionService.cs
public interface IVisionService
{
    Task<VisionCaptionResult> CaptionAsync(string imagePath, bool detailed = true,
        CancellationToken ct = default);
    Task<VisionOcrResult> ExtractTextAsync(string imagePath,
        CancellationToken ct = default);
    Task<bool> IsAvailableAsync(CancellationToken ct = default);
}

Using it:

var vision = serviceProvider.GetRequiredService<IVisionService>();

// Generate a caption describing the image
var caption = await vision.CaptionAsync("photo.jpg", detailed: true);
if (caption.Success)
{
    // caption.Caption: "A man in a blue suit standing at a podium"
    // caption.DurationMs: how long it took
}

// Extract visible text using Florence-2's built-in OCR
var ocrResult = await vision.ExtractTextAsync("screenshot.png");
if (ocrResult.Success)
{
    // ocrResult.Text: the visible text Florence-2 detected
}

When to Use Which

Use case Tesseract (IOcrService) Florence-2 (IVisionService)
Document scans Best choice - fast, accurate OK but overkill
Photos of signs Decent Better - understands scene context
Screenshots Good Good
Image captioning Can't do this Best choice
Speed Fast (~100ms) Slower (~1-3s)
Model size ~4MB ~450MB

The point is efficiency: use Tesseract for documents and text extraction (it's 10x faster with a 100x smaller model). Use Florence-2 when you actually need image understanding.

Florence-2 auto-downloads its models (~450MB) on first use to {ModelDirectory}/florence2/.


How the NER Pipeline Works Internally

The NER pipeline follows the same three-step process covered in detail in Part 1: tokenize → infer → decode. Part 1 walks through every concept — WordPiece tokenization, ONNX tensor inference, BIO tag decoding, softmax confidence — from scratch with a complete buildable example.

Here's what the package adds beyond the manual approach:

Offset Tracking

Part 1's tokenizer converts text to token IDs. The package's BertNerTokenizer also tracks character offsets — so you know exactly where in the source text each entity was found:

// From BertNerTokenizer.cs
// "John Smith works at Microsoft" becomes:
// [CLS] John Smith works at Micro ##soft [SEP] [PAD] ...
//
// Each token tracks its source position:
// "John"     → chars 0-4
// "Smith"    → chars 5-10
// "Micro"    → chars 20-29  (WordPiece splits "Microsoft")
// "##soft"   → chars 20-29  (same source range)

This is how NerEntity.StartOffset and EndOffset work — they map back to exact character positions in your original text.

Confidence-Filtered Entity Extraction

Part 1's decoder produces all entities. The package filters during decoding — low-confidence noise never reaches your code:

// From NerService.cs
private void FlushEntity(
    List<NerEntity> entities, string text,
    string type, int start, int end, float confidence)
{
    if (confidence < _config.MinConfidence) return;  // Filter low-confidence

    var entityText = text[start..end].Trim();
    if (string.IsNullOrWhiteSpace(entityText)) return;

    entities.Add(new NerEntity
    {
        Text = entityText,
        Label = type,
        Confidence = confidence,
        StartOffset = start,
        EndOffset = end
    });
}

Auto-Download: How It Works

All models download automatically on first use. No manual setup needed.

flowchart TD
    CALL["First API call"]
    CHECK{"Files exist<br>in cache?"}
    YES[Use cached model]
    NO["Download to .tmp file"]
    MOVE["Atomic rename<br>.tmp → final"]

    CALL --> CHECK
    CHECK -->|Yes| YES
    CHECK -->|No| NO
    NO --> MOVE
    MOVE --> YES

    style NO stroke:#f60,stroke-width:3px
    style MOVE stroke:#090,stroke-width:3px

The ModelDownloader downloads from HuggingFace (NER model) and GitHub (tessdata). It uses an atomic .tmp pattern - if a download is interrupted, no corrupt files are left behind:

// From ModelDownloader.cs - atomic download pattern
await using var fileStream = new FileStream(tempPath, FileMode.Create,
    FileAccess.Write, FileShare.None, 81920, true);
// ... stream download to .tmp file ...
await fileStream.FlushAsync(ct);
fileStream.Close();

File.Move(tempPath, localPath, overwrite: true);  // Atomic rename

Default cache location: {AppBaseDir}/models/ocrner/

models/ocrner/
  ner/
    model.onnx      (~430MB - BERT NER)
    vocab.txt       (~230KB - WordPiece vocabulary)
    config.json     (~1KB - label mapping)
  tessdata/
    eng.traineddata (~4MB - English OCR data)
  florence2/
    ...             (~450MB - Vision model files)

Architecture

Everything is a singleton with lazy initialization. Expensive resources (ONNX InferenceSession, TesseractEngine, Florence-2 model) are created once on first use and reused for the lifetime of the application.

flowchart TD
    DI["AddOcrNer()"]

    DI --> MD["ModelDownloader<br>(singleton)"]
    DI --> PP["ImagePreprocessor<br>(singleton)"]
    DI --> CV["OpenCvPreprocessor<br>(singleton)"]
    DI --> NER["NerService<br>(singleton)"]
    DI --> OCR["OcrService<br>(singleton)"]
    DI --> PIPE["OcrNerPipeline<br>(singleton)"]
    DI --> REC["TextRecognizerService<br>(singleton)"]
    DI --> VIS["VisionService<br>(singleton)"]

    MD --> NER
    MD --> OCR
    PP --> OCR
    CV --> OCR
    NER --> PIPE
    OCR --> PIPE
    REC --> PIPE

    style DI stroke:#090,stroke-width:3px

Thread safety: all services use SemaphoreSlim for initialization. Multiple threads calling the service simultaneously on first use will only trigger one download/load:

// From NerService.cs - lazy init pattern used by all services
private async Task EnsureInitializedAsync(CancellationToken ct)
{
    if (_initialized) return;           // Fast path: already loaded

    await _initLock.WaitAsync(ct);      // Only one thread enters
    try
    {
        if (_initialized) return;       // Double-check after lock

        var paths = await _downloader.EnsureNerModelAsync(ct);
        _tokenizer = new BertNerTokenizer(paths.VocabPath, _config.MaxSequenceLength);
        _session = new InferenceSession(paths.ModelPath, sessionOptions);
        _initialized = true;
    }
    finally { _initLock.Release(); }
}

CLI Tool

The repo includes a command-line tool built with Spectre.Console. It's designed as a "pit of success" - just pass your input and it works.

Quick Start

# NER from text (auto-detected)
ocrner "John Smith works at Microsoft in Seattle"

# OCR from an image (auto-detected)
ocrner invoice.png

# Explicit commands
ocrner ner "Marie Curie won the Nobel Prize in Stockholm"
ocrner ocr scan.png
ocrner caption photo.jpg

Smart routing: the CLI auto-detects your intent. From Program.cs:

// From Program.cs - smart routing logic
if (IsImageFile(args2[0]) || IsGlobPattern(args2[0]) || Directory.Exists(args2[0]))
{
    args2 = ["ocr", .. args2];   // Image file → ocr command
}
else
{
    args2 = ["ner", .. args2];   // Text string → ner command
}

If you pass a text string, it runs NER. If you pass an image file, glob, or directory, it runs OCR + NER. No command needed.

Three Commands

Command What it does Engine Speed
ner <text> Extract entities from text BERT NER (ONNX) ~50ms
ocr <path> OCR + NER from images Tesseract + BERT ~100-300ms
caption <path> Image captioning + optional OCR Florence-2 (ONNX) ~1-3s

Tesseract is the default OCR engine because it's 5-10x faster and optimized for document text. Florence-2 is for when you need image understanding (captions, scene text, photos of signs).

Real Output

Here's actual output from running the CLI against real sample documents.

NER from text:

ocrner ner "Marie Curie won the Nobel Prize in Stockholm"
╭──────┬─────────────┬────────────┬──────────╮
│ Type │ Entity      │ Confidence │ Position │
├──────┼─────────────┼────────────┼──────────┤
│ PER  │ Marie Curie │ 100%       │ 0-11     │
│ MISC │ Nobel Prize │ 100%       │ 20-31    │
│ LOC  │ Stockholm   │ 100%       │ 35-44    │
╰──────┴─────────────┴────────────┴──────────╯

NER with recognizers - combining BERT entities with rule-based signal extraction:

ocrner ner "Shelby Lucier from SCS Agency in Cambridge, UK sent an invoice on 13/02/15. Call 07981423683." -r
╭──────┬───────────────┬────────────┬──────────╮
│ Type │ Entity        │ Confidence │ Position │
├──────┼───────────────┼────────────┼──────────┤
│ PER  │ Shelby Lucier │ 100%       │ 0-13     │
│ ORG  │ SCS Agency    │ 100%       │ 19-29    │
│ LOC  │ Cambridge     │ 100%       │ 33-42    │
│ LOC  │ UK            │ 100%       │ 44-46    │
╰──────┴───────────────┴────────────┴──────────╯

── Recognized Signals ─────────────────────────
  Type       Text          Details
  DateTime   13/02/15      datetimeV2.date
  Phone      07981423683

BERT finds the people, organizations, and locations. The recognizers catch the date and phone number — structured patterns that a neural network would be unreliable at extracting.

OCR from a scanned document (an Amazon shareholder letter, scanned with hole-punch marks):

ocrner ocr shareholder-letter.jpg -q
╭──────┬───────────────┬────────────┬──────────╮
│ Type │ Entity        │ Confidence │ Position │
├──────┼───────────────┼────────────┼──────────┤
│ ORG  │ Amazon        │ 87%        │ 285-291  │
│ PER  │ Jeff          │ 99%        │ 293-297  │
│ ORG  │ AWS           │ 95%        │ 984-987  │
│ LOC  │ America       │ 98%        │ 2315-2322│
╰──────┴───────────────┴────────────┴──────────╯
OCR Confidence: 89%

Tesseract extracts near-verbatim text from the scanned letter at 89% confidence, and NER correctly identifies Amazon, Jeff (Bezos), AWS, and North America.

Tesseract vs Florence-2: A Real Comparison

Same scanned shareholder letter processed by both engines:

Tesseract (ocrner ocr) Florence-2 (ocrner caption --ocr)
Speed ~200ms ~14s
OCR accuracy Near-verbatim, 89% confidence Heavily garbled, hallucinated phrases
Key text "Over the past 25 years at Amazon, I've had the opportunity..." "Over the past 25 years at Amazon. I've had the opportunity to write many narrative, email..."
NER entities Jeff (PER), Amazon (ORG), AWS (ORG), America (LOC) N/A (text too garbled for reliable NER)
Caption N/A "A paper with some text"

Florence-2 is a vision model — it understands scenes, objects, and spatial relationships. It was never designed to compete with Tesseract at reading document text. Use it when you need image understanding (what's in this photo?), not text extraction (what does this document say?).

JSON Output for Automation & LLM Tools

All images in a folder

dotnet run -- ocr ./documents/

Batch captioning with Florence-2

dotnet run -- caption "photos/*.jpg" --ocr -o captions.md


### All CLI Options

| Flag | Applies to | Description |
|------|------------|-------------|
| `-c` | `ner`, `ocr` | Minimum entity confidence threshold (0.0-1.0) |
| `--language` | `ocr` | Tesseract language (for example `eng`, `fra`) |
| `--max-tokens` | `ner`, `ocr` | Maximum BERT sequence length |
| `--model-dir` | `ner`, `ocr`, `caption` | Model cache directory override |
| `-p`, `--preprocess` | `ocr`, `caption` | Preprocessing preset: `none`, `minimal`, `default`, `aggressive` |
| `-a`, `--advanced-preprocess` | `ocr`, `caption` | Use OpenCV preprocessing (deskew, denoise, binarize) |
| `-r`, `--recognizers` | `ner`, `ocr` | Enable rule-based extraction (dates, numbers, URLs, phones, emails, IPs) |
| `--culture` | `ner`, `ocr` | Recognizer culture, e.g. `en-us`, `de-de` (default: `en-us`) |
| `--brief` | `caption` | Generate a shorter, less detailed caption |
| `-q`, `--quiet` | `ner`, `ocr`, `caption` | Quiet mode (reduced console output) |
| `-o` | `ner`, `ocr`, `caption` | Output file path (`.txt`, `.md`, `.json`) |
| `--ocr` | `caption` | Also run OCR during caption command |
| `--ner` | `caption` | Extract NER from OCR text (implies `--ocr`) |

---

## Performance: Quantized Models and What's Next

The current NER model is the full-precision `protectai/bert-base-NER-onnx` (~430MB). For many use cases - especially on resource-constrained machines or when processing high volumes - a **quantized** (INT8) version of the same model would be significantly faster with minimal accuracy loss.

ONNX Runtime supports INT8 quantization out of the box, which typically reduces model size by ~4x and improves inference speed by 2-3x on CPU. This is on the roadmap. The `NerModelRepo` config option already supports pointing to a different HuggingFace repo, so when a quantized model is published you'd just change:

```json
{
  "OcrNer": {
    "NerModelRepo": "protectai/bert-base-NER-onnx-quantized"
  }
}

The architecture is designed for this - swap the model, keep the same API.


The Bigger Picture: Where This Fits

This package is a single-stage pipeline: one OCR engine, one NER model, one optional vision model. It's designed to be simple and efficient for the common case.

For more complex scenarios - reading text from anything (handwritten notes, photos of whiteboards, low-quality camera captures), with multi-engine OCR consensus, fuzzy matching, and structured extraction - check out the full pipeline at lucidRAG. That's where the production-grade, multi-phase version of this work lives.

What's Next: Multimodal LLMs

Florence-2 is the current ceiling for local vision in this package. The next logical step is a multimodal LLM - a model that can see an image and reason about it in natural language. Instead of separate OCR + NER steps, you'd send the image directly and ask for structured extraction.

Here's roughly what that API could look like:

// Hypothetical future IMultimodalService
public interface IMultimodalService
{
    Task<StructuredExtractionResult> ExtractAsync(
        string imagePath,
        string prompt = "Extract all people, organizations, and locations from this image. Return as JSON.",
        CancellationToken ct = default);
}

// Usage
var multimodal = serviceProvider.GetRequiredService<IMultimodalService>();
var result = await multimodal.ExtractAsync("business-card.jpg");

// result.Entities: [{ "John Smith", PER }, { "Acme Corp", ORG }, { "New York", LOC }]
// result.RawText: "John Smith, VP Engineering, Acme Corp, New York, NY 10001"
// result.Summary: "Business card for John Smith at Acme Corp in New York"

Small local multimodal models (like Phi-3.5-vision or LLaVA) are getting good enough for this. The trade-off is always the same: bigger model = smarter but slower. The right choice depends on your latency budget and accuracy requirements.

flowchart LR
    subgraph Staged["Staged Approach: Pick Your Level"]
        T1["Tesseract OCR<br>4MB | ~100ms<br>Text extraction"]
        T2["BERT NER<br>430MB | ~50ms<br>Entity extraction"]
        T3["Florence-2<br>450MB | ~1-3s<br>Image understanding"]
        T4["Multimodal LLM<br>2-8GB | ~5-30s<br>Full reasoning"]
    end

    T1 --> T2
    T2 --> T3
    T3 -.->|"future"| T4

    style T1 stroke:#090,stroke-width:2px
    style T2 stroke:#090,stroke-width:2px
    style T3 stroke:#f60,stroke-width:2px
    style T4 stroke:#999,stroke-width:2px,stroke-dasharray: 5 5

Each tier adds capability at the cost of size and latency. The package currently covers tiers 1-3. Tier 4 is where multimodal LLMs come in - and where lucidRAG is heading.


Resources

This Package:

Part 1:

Dependencies:

Related Articles:

Finding related posts...
logo

© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.