In Part 1 I showed the raw pipeline: manually downloading models, writing a tokenizer, wiring up ONNX inference, and decoding BIO tags by hand. Educational, but a lot of plumbing to get right.
Now it's a NuGet package. One line of setup, zero model downloads - everything auto-downloads on first use.
Note: This package is a simplified, focused tool for extracting text and entities from images. If you need a full multi-phase pipeline that can read text from anything (photos, documents, screenshots, handwriting, animated gifs and even videos) with fuzzy matching, OCR consensus, and structured extraction, check out lucidRAG where the production-grade version of this pipeline lives.
Before we dive in, here's what the key terms mean:
.onnx file, locally, using just your CPU.Tesseract is strong for clean document text, but it falls down on noisy photos, low-contrast scans, and mixed "scene + text" images. It also stops at raw text - you still need extra code to turn that text into structured entities you can actually use.
This package closes those gaps:
AddOcrNer() and you're doneflowchart LR
subgraph Part1["Part 1: Manual"]
M1[Download models]
M2[Write tokenizer]
M3[Wire ONNX]
M4[BIO decode]
end
subgraph Part2["Part 2: NuGet Package"]
N1["AddOcrNer()"]
N2[Auto-download]
N3[ImageSharp + OpenCV]
N4[Florence-2]
N5[Recognizers]
N6[CLI Tool]
end
Part1 -->|"packaged into"| Part2
style N1 stroke:#090,stroke-width:3px
style N2 stroke:#090,stroke-width:3px
style N3 stroke:#f60,stroke-width:3px
style N4 stroke:#f60,stroke-width:3px
style N5 stroke:#f60,stroke-width:3px
style N6 stroke:#f60,stroke-width:3px
Part 1 was educational - understanding what each piece does. Part 2 is practical - using it without thinking about the internals.
dotnet add package Mostlylucid.OcrNer
The AddOcrNer() extension method registers everything: OCR, NER, the combined pipeline, Florence-2 vision, the model downloader, and the image preprocessor. All as singletons, all lazy-initialized.
Here's the actual registration code from ServiceCollectionExtensions.cs:
// Option 1: From appsettings.json (reads the "OcrNer" section)
builder.Services.AddOcrNer(builder.Configuration);
// Option 2: Inline configuration
builder.Services.AddOcrNer(config =>
{
config.EnableOcr = true;
config.TesseractLanguage = "eng";
config.MinConfidence = 0.5f;
});
That's it. No model downloads, no file paths, no ONNX wiring. Under the hood, AddOcrNer() registers these services:
// From ServiceCollectionExtensions.cs - what gets registered
services.AddSingleton<ModelDownloader>(); // Auto-downloads models on first use
services.AddSingleton<ImagePreprocessor>(); // ImageSharp-based image enhancement
services.AddSingleton<OpenCvPreprocessor>(); // OpenCV advanced preprocessing
services.AddSingleton<INerService, NerService>(); // BERT NER from text
services.AddSingleton<IOcrService, OcrService>(); // Tesseract OCR from images
services.AddSingleton<IOcrNerPipeline, OcrNerPipeline>(); // Combined OCR + NER
services.AddSingleton<ITextRecognizerService, TextRecognizerService>(); // Microsoft.Recognizers
services.AddSingleton<IVisionService, VisionService>(); // Florence-2 vision
{
"OcrNer": {
"EnableOcr": true,
"TesseractLanguage": "eng",
"MinConfidence": 0.5,
"MaxSequenceLength": 512,
"ModelDirectory": "models/ocrner",
"Preprocessing": "Default",
"EnableAdvancedPreprocessing": false,
"EnableRecognizers": false,
"RecognizerCulture": "en-us"
}
}
Here's the actual OcrNerConfig class these map to:
// From OcrNerConfig.cs
public class OcrNerConfig
{
public string ModelDirectory { get; set; } =
Path.Combine(AppContext.BaseDirectory, "models", "ocrner");
public bool EnableOcr { get; set; } = true;
public string TesseractLanguage { get; set; } = "eng";
public int MaxSequenceLength { get; set; } = 512;
public float MinConfidence { get; set; } = 0.5f;
public string NerModelRepo { get; set; } = "protectai/bert-base-NER-onnx";
public PreprocessingLevel Preprocessing { get; set; } = PreprocessingLevel.Default;
public bool EnableAdvancedPreprocessing { get; set; } = false; // OpenCV pipeline
public bool EnableRecognizers { get; set; } = false; // Microsoft.Recognizers
public string RecognizerCulture { get; set; } = "en-us"; // Recognizer language
}
All settings have sensible defaults. You can omit the entire section and everything works. The two opt-in features (EnableAdvancedPreprocessing and EnableRecognizers) default to false so the package stays lightweight for users who don't need them.
The Preprocessing option controls image enhancement before OCR:
| Value | What it does | When to use |
|---|---|---|
None |
No preprocessing | Images are already optimized |
Minimal |
Grayscale only | Clean scans |
Default |
Grayscale + contrast + sharpen | Most images (recommended) |
Aggressive |
Strong contrast + sharpen + upscale | Poor quality photos |
The package registers five services, each usable independently. Pick the one that fits your use case - there's no need to load Florence-2 if all you need is NER from text.
flowchart TD
subgraph Services
NER["INerService<br>Text → Entities"]
OCR["IOcrService<br>Image → Text"]
REC["ITextRecognizerService<br>Text → Signals"]
PIPE["IOcrNerPipeline<br>Image → Entities + Signals"]
VIS["IVisionService<br>Image → Caption"]
end
OCR --> PIPE
NER --> PIPE
REC -.-> PIPE
style PIPE stroke:#090,stroke-width:3px
style VIS stroke:#f60,stroke-width:3px
style REC stroke:#f60,stroke-width:2px,stroke-dasharray: 5 5
The key principle is efficiency: pick the lightest tool that does the job. Don't load a 450MB vision model when a 4MB OCR engine will do.
| Service | What it does | Model size | Speed | Use when... |
|---|---|---|---|---|
INerService |
BERT NER from text | ~430MB | ~50ms | You already have text (PDFs, databases, user input) |
IOcrService |
Tesseract OCR from images | ~4MB | ~100ms | You need text from document scans, screenshots |
IOcrNerPipeline |
OCR then NER in one call | Both models | ~150ms | You have images and want entities in one step |
ITextRecognizerService |
Rule-based extraction (dates, phones, etc.) | None | ~1ms | You want structured data alongside NER entities |
IVisionService |
Florence-2 captioning + OCR | ~450MB | ~1-3s | You need image understanding, not just text reading |
If you already have text (from PDFs, databases, user input), you can use NER directly. This is the fastest path - no OCR, no image processing, just text in, entities out.
The INerService interface is simple - one method:
// From INerService.cs
public interface INerService
{
Task<NerResult> ExtractEntitiesAsync(string text, CancellationToken ct = default);
}
Here's how to use it in your own service:
public class MyService
{
private readonly INerService _nerService;
public MyService(INerService nerService)
{
_nerService = nerService;
}
public async Task ProcessDocumentAsync(string text)
{
var result = await _nerService.ExtractEntitiesAsync(text);
foreach (var entity in result.Entities)
{
// entity.Label: "PER", "ORG", "LOC", or "MISC"
// entity.Text: "John Smith"
// entity.Confidence: 0.9996
// entity.StartOffset / EndOffset: character positions in the source
}
}
}
The result models are straightforward:
// From NerResult.cs / NerEntity.cs
public class NerResult
{
public string SourceText { get; init; } = string.Empty;
public List<NerEntity> Entities { get; init; } = [];
}
public class NerEntity
{
public string Text { get; init; } = string.Empty; // "John Smith"
public string Label { get; init; } = string.Empty; // "PER", "ORG", "LOC", "MISC"
public float Confidence { get; init; } // 0.0 to 1.0
public int StartOffset { get; init; } // Where in the source text
public int EndOffset { get; init; } // End position (exclusive)
}
The first call downloads the BERT NER model (~430MB) from HuggingFace. Subsequent calls use the cached model - startup is instant.
For images, the pipeline handles preprocessing, OCR, and NER in one call. The IOcrNerPipeline combines IOcrService and INerService:
// From OcrNerPipeline.cs - the actual pipeline code
public async Task<OcrNerResult> ProcessImageAsync(string imagePath, CancellationToken ct = default)
{
// Step 1: OCR (includes preprocessing automatically)
var ocrResult = await _ocrService.ExtractTextAsync(imagePath, ct);
if (string.IsNullOrWhiteSpace(ocrResult.Text))
return new OcrNerResult
{
OcrResult = ocrResult,
NerResult = new NerResult { SourceText = string.Empty }
};
// Step 2: NER on extracted text
var nerResult = await _nerService.ExtractEntitiesAsync(ocrResult.Text, ct);
return new OcrNerResult
{
OcrResult = ocrResult,
NerResult = nerResult
};
}
Using it:
var pipeline = serviceProvider.GetRequiredService<IOcrNerPipeline>();
var result = await pipeline.ProcessImageAsync("invoice.png");
// What OCR found
var text = result.OcrResult.Text; // The full extracted text
var confidence = result.OcrResult.Confidence; // 0.0 to 1.0
// What NER found in that text
foreach (var entity in result.NerResult.Entities)
{
// [PER] John Smith, [ORG] Microsoft, [LOC] Seattle...
}
flowchart LR
IMG[Image bytes]
PRE["ImageSharp<br>or OpenCV"]
TESS["Tesseract<br>OCR"]
TOK["WordPiece<br>Tokenize"]
BERT["BERT NER<br>ONNX"]
REC["Recognizers<br>(optional)"]
OUT[Result]
IMG --> PRE
PRE --> TESS
TESS --> TOK
TOK --> BERT
BERT --> REC
REC --> OUT
style PRE stroke:#f60,stroke-width:3px
style BERT stroke:#f60,stroke-width:3px
style REC stroke:#f60,stroke-width:2px,stroke-dasharray: 5 5
Part 1 had raw Tesseract calls. In practice, both Tesseract and Florence-2 work better with preprocessed images. Preprocessing is on by default but completely optional - you can disable it with Preprocessing = "None" in config or --preprocess none on the CLI.
The ImagePreprocessor uses ImageSharp (pure C#, no native dependencies):
// From ImagePreprocessor.cs - the actual preprocessing steps
public byte[] Preprocess(byte[] imageBytes, PreprocessingOptions? options = null)
{
options ??= PreprocessingOptions.Default;
using var image = Image.Load<Rgba32>(imageBytes);
image.Mutate(ctx =>
{
// Step 1: Upscale small images (Tesseract wants 300+ DPI equivalent)
if (options.EnableUpscale && (image.Width < options.MinWidth || image.Height < options.MinHeight))
{
var scale = Math.Max(
(float)options.MinWidth / image.Width,
(float)options.MinHeight / image.Height);
scale = Math.Min(scale, options.MaxUpscaleFactor);
ctx.Resize((int)(image.Width * scale), (int)(image.Height * scale),
KnownResamplers.Lanczos3);
}
// Step 2: Grayscale (single channel = faster, more accurate)
if (options.EnableGrayscale)
ctx.Grayscale();
// Step 3: Contrast boost (text stands out from background)
if (options.EnableContrast && options.ContrastAmount != 1.0f)
ctx.Contrast(options.ContrastAmount);
// Step 4: Sharpen (crisp character edges)
if (options.EnableSharpen)
ctx.GaussianSharpen(options.SharpenSigma);
});
using var ms = new MemoryStream();
image.SaveAsPng(ms); // PNG = lossless, no additional artifacts
return ms.ToArray();
}
Three presets are built in. The PreprocessingOptions class defines them:
// From ImagePreprocessor.cs
public static PreprocessingOptions Default => new(); // Grayscale + 1.5x contrast + sharpen
public static PreprocessingOptions Minimal => new() // Grayscale only
{
EnableContrast = false,
EnableSharpen = false,
EnableUpscale = false
};
public static PreprocessingOptions Aggressive => new() // For poor quality images
{
ContrastAmount = 1.8f,
SharpenSigma = 1.5f,
MinWidth = 1024,
MinHeight = 768,
MaxUpscaleFactor = 4.0f
};
| Preset | When to use | What it does |
|---|---|---|
Default |
Most images | Grayscale + 1.5x contrast + light sharpen |
Minimal |
Clean scans | Grayscale only |
Aggressive |
Poor quality photos | 1.8x contrast + strong sharpen + larger upscale |
For seriously degraded documents - skewed scans, noisy photos, faded historical pages - the ImageSharp pipeline isn't enough. Enable EnableAdvancedPreprocessing to switch to a full OpenCV pipeline ported from ImageSummarizer.
The OpenCV preprocessor chains four stages, each driven by an automatic quality assessment:
flowchart LR
IMG[Image]
QA["Quality<br>Assess"]
SK["Deskew"]
DN["Denoise"]
BIN["Binarize"]
OUT[Clean image]
IMG --> QA
QA --> SK
SK --> DN
DN --> BIN
BIN --> OUT
style QA stroke:#f60,stroke-width:2px
Quality Assessment (ImageQualityAssessor) measures blur, skew angle, noise level, contrast, brightness uniformity, and text density. Based on the results, it recommends which stages to apply - so clean images skip unnecessary processing.
Deskew (SkewCorrector) corrects rotated documents using three methods: Hough line detection (default), minimum area rectangle, or projection profile analysis.
Denoise (NoiseReducer) offers Gaussian blur (fast), bilateral filter (edge-preserving), non-local means (highest quality), and morphological operations.
Binarize (InkExtractor) converts to clean black-and-white using Otsu, adaptive thresholding, Sauvola (for degraded historical documents), CLAHE + Otsu (for low contrast), or morphological background removal.
Enable it in config or on the CLI:
config.EnableAdvancedPreprocessing = true;
ocrner ocr damaged-scan.png -a
BERT NER finds people, organizations, locations, and miscellaneous entities. But some structured data - dates, phone numbers, emails, URLs, IP addresses - is better caught by deterministic rules than by a neural network.
Enable EnableRecognizers to add a second extraction pass using Microsoft.Recognizers.Text. This runs after NER and extracts:
| Type | Examples |
|---|---|
| DateTime | "January 15, 2024", "next Tuesday", "last week" |
| Number | "42", "three million", "15%" |
| URL | "https://example.com", "www.github.com" |
| Phone | "555-1234", "+1 (555) 123-4567" |
| "john@microsoft.com" | |
| IP Address | "192.168.1.1" |
The recognizer supports multiple cultures (en-us, en-gb, de-de, fr-fr, etc.) so it handles locale-specific date formats and number conventions.
config.EnableRecognizers = true;
config.RecognizerCulture = "en-us";
ocrner ner "John Smith joined Microsoft on January 15, 2024. Call 555-1234." -r
The two extraction methods complement each other: BERT NER understands context ("Apple" the company vs. "apple" the fruit), while the recognizers reliably catch structured patterns that BERT might miss. The OcrNerResult model now includes an optional Signals property:
public class OcrNerResult
{
public OcrResult OcrResult { get; init; } = new();
public NerResult NerResult { get; init; } = new();
public RecognizedSignals? Signals { get; init; } // Only when EnableRecognizers = true
}
Florence-2 is a completely different approach from Tesseract. Where Tesseract is a specialized OCR engine that reads text character by character, Florence-2 is a vision model that understands the whole image - objects, scenes, people, and text.
// From IVisionService.cs
public interface IVisionService
{
Task<VisionCaptionResult> CaptionAsync(string imagePath, bool detailed = true,
CancellationToken ct = default);
Task<VisionOcrResult> ExtractTextAsync(string imagePath,
CancellationToken ct = default);
Task<bool> IsAvailableAsync(CancellationToken ct = default);
}
Using it:
var vision = serviceProvider.GetRequiredService<IVisionService>();
// Generate a caption describing the image
var caption = await vision.CaptionAsync("photo.jpg", detailed: true);
if (caption.Success)
{
// caption.Caption: "A man in a blue suit standing at a podium"
// caption.DurationMs: how long it took
}
// Extract visible text using Florence-2's built-in OCR
var ocrResult = await vision.ExtractTextAsync("screenshot.png");
if (ocrResult.Success)
{
// ocrResult.Text: the visible text Florence-2 detected
}
| Use case | Tesseract (IOcrService) |
Florence-2 (IVisionService) |
|---|---|---|
| Document scans | Best choice - fast, accurate | OK but overkill |
| Photos of signs | Decent | Better - understands scene context |
| Screenshots | Good | Good |
| Image captioning | Can't do this | Best choice |
| Speed | Fast (~100ms) | Slower (~1-3s) |
| Model size | ~4MB | ~450MB |
The point is efficiency: use Tesseract for documents and text extraction (it's 10x faster with a 100x smaller model). Use Florence-2 when you actually need image understanding.
Florence-2 auto-downloads its models (~450MB) on first use to {ModelDirectory}/florence2/.
The NER pipeline follows the same three-step process covered in detail in Part 1: tokenize → infer → decode. Part 1 walks through every concept — WordPiece tokenization, ONNX tensor inference, BIO tag decoding, softmax confidence — from scratch with a complete buildable example.
Here's what the package adds beyond the manual approach:
Part 1's tokenizer converts text to token IDs. The package's BertNerTokenizer also tracks character offsets — so you know exactly where in the source text each entity was found:
// From BertNerTokenizer.cs
// "John Smith works at Microsoft" becomes:
// [CLS] John Smith works at Micro ##soft [SEP] [PAD] ...
//
// Each token tracks its source position:
// "John" → chars 0-4
// "Smith" → chars 5-10
// "Micro" → chars 20-29 (WordPiece splits "Microsoft")
// "##soft" → chars 20-29 (same source range)
This is how NerEntity.StartOffset and EndOffset work — they map back to exact character positions in your original text.
Part 1's decoder produces all entities. The package filters during decoding — low-confidence noise never reaches your code:
// From NerService.cs
private void FlushEntity(
List<NerEntity> entities, string text,
string type, int start, int end, float confidence)
{
if (confidence < _config.MinConfidence) return; // Filter low-confidence
var entityText = text[start..end].Trim();
if (string.IsNullOrWhiteSpace(entityText)) return;
entities.Add(new NerEntity
{
Text = entityText,
Label = type,
Confidence = confidence,
StartOffset = start,
EndOffset = end
});
}
All models download automatically on first use. No manual setup needed.
flowchart TD
CALL["First API call"]
CHECK{"Files exist<br>in cache?"}
YES[Use cached model]
NO["Download to .tmp file"]
MOVE["Atomic rename<br>.tmp → final"]
CALL --> CHECK
CHECK -->|Yes| YES
CHECK -->|No| NO
NO --> MOVE
MOVE --> YES
style NO stroke:#f60,stroke-width:3px
style MOVE stroke:#090,stroke-width:3px
The ModelDownloader downloads from HuggingFace (NER model) and GitHub (tessdata). It uses an atomic .tmp pattern - if a download is interrupted, no corrupt files are left behind:
// From ModelDownloader.cs - atomic download pattern
await using var fileStream = new FileStream(tempPath, FileMode.Create,
FileAccess.Write, FileShare.None, 81920, true);
// ... stream download to .tmp file ...
await fileStream.FlushAsync(ct);
fileStream.Close();
File.Move(tempPath, localPath, overwrite: true); // Atomic rename
Default cache location: {AppBaseDir}/models/ocrner/
models/ocrner/
ner/
model.onnx (~430MB - BERT NER)
vocab.txt (~230KB - WordPiece vocabulary)
config.json (~1KB - label mapping)
tessdata/
eng.traineddata (~4MB - English OCR data)
florence2/
... (~450MB - Vision model files)
Everything is a singleton with lazy initialization. Expensive resources (ONNX InferenceSession, TesseractEngine, Florence-2 model) are created once on first use and reused for the lifetime of the application.
flowchart TD
DI["AddOcrNer()"]
DI --> MD["ModelDownloader<br>(singleton)"]
DI --> PP["ImagePreprocessor<br>(singleton)"]
DI --> CV["OpenCvPreprocessor<br>(singleton)"]
DI --> NER["NerService<br>(singleton)"]
DI --> OCR["OcrService<br>(singleton)"]
DI --> PIPE["OcrNerPipeline<br>(singleton)"]
DI --> REC["TextRecognizerService<br>(singleton)"]
DI --> VIS["VisionService<br>(singleton)"]
MD --> NER
MD --> OCR
PP --> OCR
CV --> OCR
NER --> PIPE
OCR --> PIPE
REC --> PIPE
style DI stroke:#090,stroke-width:3px
Thread safety: all services use SemaphoreSlim for initialization. Multiple threads calling the service simultaneously on first use will only trigger one download/load:
// From NerService.cs - lazy init pattern used by all services
private async Task EnsureInitializedAsync(CancellationToken ct)
{
if (_initialized) return; // Fast path: already loaded
await _initLock.WaitAsync(ct); // Only one thread enters
try
{
if (_initialized) return; // Double-check after lock
var paths = await _downloader.EnsureNerModelAsync(ct);
_tokenizer = new BertNerTokenizer(paths.VocabPath, _config.MaxSequenceLength);
_session = new InferenceSession(paths.ModelPath, sessionOptions);
_initialized = true;
}
finally { _initLock.Release(); }
}
The repo includes a command-line tool built with Spectre.Console. It's designed as a "pit of success" - just pass your input and it works.
# NER from text (auto-detected)
ocrner "John Smith works at Microsoft in Seattle"
# OCR from an image (auto-detected)
ocrner invoice.png
# Explicit commands
ocrner ner "Marie Curie won the Nobel Prize in Stockholm"
ocrner ocr scan.png
ocrner caption photo.jpg
Smart routing: the CLI auto-detects your intent. From Program.cs:
// From Program.cs - smart routing logic
if (IsImageFile(args2[0]) || IsGlobPattern(args2[0]) || Directory.Exists(args2[0]))
{
args2 = ["ocr", .. args2]; // Image file → ocr command
}
else
{
args2 = ["ner", .. args2]; // Text string → ner command
}
If you pass a text string, it runs NER. If you pass an image file, glob, or directory, it runs OCR + NER. No command needed.
| Command | What it does | Engine | Speed |
|---|---|---|---|
ner <text> |
Extract entities from text | BERT NER (ONNX) | ~50ms |
ocr <path> |
OCR + NER from images | Tesseract + BERT | ~100-300ms |
caption <path> |
Image captioning + optional OCR | Florence-2 (ONNX) | ~1-3s |
Tesseract is the default OCR engine because it's 5-10x faster and optimized for document text. Florence-2 is for when you need image understanding (captions, scene text, photos of signs).
Here's actual output from running the CLI against real sample documents.
NER from text:
ocrner ner "Marie Curie won the Nobel Prize in Stockholm"
╭──────┬─────────────┬────────────┬──────────╮
│ Type │ Entity │ Confidence │ Position │
├──────┼─────────────┼────────────┼──────────┤
│ PER │ Marie Curie │ 100% │ 0-11 │
│ MISC │ Nobel Prize │ 100% │ 20-31 │
│ LOC │ Stockholm │ 100% │ 35-44 │
╰──────┴─────────────┴────────────┴──────────╯
NER with recognizers - combining BERT entities with rule-based signal extraction:
ocrner ner "Shelby Lucier from SCS Agency in Cambridge, UK sent an invoice on 13/02/15. Call 07981423683." -r
╭──────┬───────────────┬────────────┬──────────╮
│ Type │ Entity │ Confidence │ Position │
├──────┼───────────────┼────────────┼──────────┤
│ PER │ Shelby Lucier │ 100% │ 0-13 │
│ ORG │ SCS Agency │ 100% │ 19-29 │
│ LOC │ Cambridge │ 100% │ 33-42 │
│ LOC │ UK │ 100% │ 44-46 │
╰──────┴───────────────┴────────────┴──────────╯
── Recognized Signals ─────────────────────────
Type Text Details
DateTime 13/02/15 datetimeV2.date
Phone 07981423683
BERT finds the people, organizations, and locations. The recognizers catch the date and phone number — structured patterns that a neural network would be unreliable at extracting.
OCR from a scanned document (an Amazon shareholder letter, scanned with hole-punch marks):
ocrner ocr shareholder-letter.jpg -q
╭──────┬───────────────┬────────────┬──────────╮
│ Type │ Entity │ Confidence │ Position │
├──────┼───────────────┼────────────┼──────────┤
│ ORG │ Amazon │ 87% │ 285-291 │
│ PER │ Jeff │ 99% │ 293-297 │
│ ORG │ AWS │ 95% │ 984-987 │
│ LOC │ America │ 98% │ 2315-2322│
╰──────┴───────────────┴────────────┴──────────╯
OCR Confidence: 89%
Tesseract extracts near-verbatim text from the scanned letter at 89% confidence, and NER correctly identifies Amazon, Jeff (Bezos), AWS, and North America.
Same scanned shareholder letter processed by both engines:
Tesseract (ocrner ocr) |
Florence-2 (ocrner caption --ocr) |
|
|---|---|---|
| Speed | ~200ms | ~14s |
| OCR accuracy | Near-verbatim, 89% confidence | Heavily garbled, hallucinated phrases |
| Key text | "Over the past 25 years at Amazon, I've had the opportunity..." | "Over the past 25 years at Amazon. I've had the opportunity to write many narrative, email..." |
| NER entities | Jeff (PER), Amazon (ORG), AWS (ORG), America (LOC) | N/A (text too garbled for reliable NER) |
| Caption | N/A | "A paper with some text" |
Florence-2 is a vision model — it understands scenes, objects, and spatial relationships. It was never designed to compete with Tesseract at reading document text. Use it when you need image understanding (what's in this photo?), not text extraction (what does this document say?).
dotnet run -- ocr ./documents/
dotnet run -- caption "photos/*.jpg" --ocr -o captions.md
### All CLI Options
| Flag | Applies to | Description |
|------|------------|-------------|
| `-c` | `ner`, `ocr` | Minimum entity confidence threshold (0.0-1.0) |
| `--language` | `ocr` | Tesseract language (for example `eng`, `fra`) |
| `--max-tokens` | `ner`, `ocr` | Maximum BERT sequence length |
| `--model-dir` | `ner`, `ocr`, `caption` | Model cache directory override |
| `-p`, `--preprocess` | `ocr`, `caption` | Preprocessing preset: `none`, `minimal`, `default`, `aggressive` |
| `-a`, `--advanced-preprocess` | `ocr`, `caption` | Use OpenCV preprocessing (deskew, denoise, binarize) |
| `-r`, `--recognizers` | `ner`, `ocr` | Enable rule-based extraction (dates, numbers, URLs, phones, emails, IPs) |
| `--culture` | `ner`, `ocr` | Recognizer culture, e.g. `en-us`, `de-de` (default: `en-us`) |
| `--brief` | `caption` | Generate a shorter, less detailed caption |
| `-q`, `--quiet` | `ner`, `ocr`, `caption` | Quiet mode (reduced console output) |
| `-o` | `ner`, `ocr`, `caption` | Output file path (`.txt`, `.md`, `.json`) |
| `--ocr` | `caption` | Also run OCR during caption command |
| `--ner` | `caption` | Extract NER from OCR text (implies `--ocr`) |
---
## Performance: Quantized Models and What's Next
The current NER model is the full-precision `protectai/bert-base-NER-onnx` (~430MB). For many use cases - especially on resource-constrained machines or when processing high volumes - a **quantized** (INT8) version of the same model would be significantly faster with minimal accuracy loss.
ONNX Runtime supports INT8 quantization out of the box, which typically reduces model size by ~4x and improves inference speed by 2-3x on CPU. This is on the roadmap. The `NerModelRepo` config option already supports pointing to a different HuggingFace repo, so when a quantized model is published you'd just change:
```json
{
"OcrNer": {
"NerModelRepo": "protectai/bert-base-NER-onnx-quantized"
}
}
The architecture is designed for this - swap the model, keep the same API.
This package is a single-stage pipeline: one OCR engine, one NER model, one optional vision model. It's designed to be simple and efficient for the common case.
For more complex scenarios - reading text from anything (handwritten notes, photos of whiteboards, low-quality camera captures), with multi-engine OCR consensus, fuzzy matching, and structured extraction - check out the full pipeline at lucidRAG. That's where the production-grade, multi-phase version of this work lives.
Florence-2 is the current ceiling for local vision in this package. The next logical step is a multimodal LLM - a model that can see an image and reason about it in natural language. Instead of separate OCR + NER steps, you'd send the image directly and ask for structured extraction.
Here's roughly what that API could look like:
// Hypothetical future IMultimodalService
public interface IMultimodalService
{
Task<StructuredExtractionResult> ExtractAsync(
string imagePath,
string prompt = "Extract all people, organizations, and locations from this image. Return as JSON.",
CancellationToken ct = default);
}
// Usage
var multimodal = serviceProvider.GetRequiredService<IMultimodalService>();
var result = await multimodal.ExtractAsync("business-card.jpg");
// result.Entities: [{ "John Smith", PER }, { "Acme Corp", ORG }, { "New York", LOC }]
// result.RawText: "John Smith, VP Engineering, Acme Corp, New York, NY 10001"
// result.Summary: "Business card for John Smith at Acme Corp in New York"
Small local multimodal models (like Phi-3.5-vision or LLaVA) are getting good enough for this. The trade-off is always the same: bigger model = smarter but slower. The right choice depends on your latency budget and accuracy requirements.
flowchart LR
subgraph Staged["Staged Approach: Pick Your Level"]
T1["Tesseract OCR<br>4MB | ~100ms<br>Text extraction"]
T2["BERT NER<br>430MB | ~50ms<br>Entity extraction"]
T3["Florence-2<br>450MB | ~1-3s<br>Image understanding"]
T4["Multimodal LLM<br>2-8GB | ~5-30s<br>Full reasoning"]
end
T1 --> T2
T2 --> T3
T3 -.->|"future"| T4
style T1 stroke:#090,stroke-width:2px
style T2 stroke:#090,stroke-width:2px
style T3 stroke:#f60,stroke-width:2px
style T4 stroke:#999,stroke-width:2px,stroke-dasharray: 5 5
Each tier adds capability at the cost of size and latency. The package currently covers tiers 1-3. Tier 4 is where multimodal LLMs come in - and where lucidRAG is heading.
This Package:
Part 1:
Dependencies:
Related Articles:
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.