This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Sunday, 28 December 2025
Small and local LLMs are often framed as the cheap alternative to frontier models. That framing is wrong. They are not a degraded version of the same thing. They are a different architectural choice, selected for control, predictability, and survivable failure modes.
I'm as guilty as anyone for pushing 'they're free' narrative...as if that were the only deciding factor. But like choosing a database / hosting platform for a system you need to understand what trade-offs you are making.
Using a small model via Ollama, LM Studio, ONNX Runtime, or similar is not (just) about saving money. It is about choosing where non-determinism is allowed to exist.
Large frontier models are broader and more fluent. They densify more of human expressed logic, span more domains, and produce more convincing reasoning traces. That also makes them more dangerous in systems that require guarantees.
Frontier models make sense when breadth is required and outputs are advisory by design - creative drafting, open-ended exploration, or synthesis across unfamiliar domains. But that's not most production systems.
Their failures are semantic rather than structural. This is the category error: treating a probabilistic component as if it were a system boundary. They generate valid-looking outputs that are wrong in subtle ways. Those failures are:
Small models fail differently.
When a small model is confused, it tends to:
These are cheap failures. They are detectable with simple validation. They trigger retries or fallbacks immediately. They do not silently advance state.
This is not a weakness. It is a feature.
This insight isn't abstract theory - it's the foundation of the Ten Commandments of LLM Use. The core principle:
LLMs interpret reality. They must never be allowed to define it.
When you follow this principle, you discover something surprising: you stop needing expensive models. A 7B parameter model running locally can classify, summarise, and generate hypotheses just fine - because the deterministic systems around it handle everything that actually needs to be correct.
Small models are not "weak" - they are often sufficient because the problem has already been reduced by the time it reaches them.
The frontier models are selling you reliability you should be building yourself.
Just as DuckDB is not "cheap SQL" and Postgres is not "worse Azure SQL", small LLMs occupy a different point in the design space. You choose them when:
| Concern | Small Model Advantage |
|---|---|
| Locality | Runs on your hardware, your network, your jurisdiction |
| Auditability | Every inference is logged, reproducible, inspectable |
| Blast radius | Failures are contained, not propagated through API chains |
| Correctness enforcement | Validation happens outside the model |
| Bounded non-determinism | Uncertainty is tightly constrained |
This isn't hypothetical. My projects demonstrate this pattern repeatedly:
My GraphRAG implementation offers three modes:
| Mode | LLM Calls | Best For |
|---|---|---|
| Heuristic | 0 per chunk | Pure determinism via IDF + structure |
| Hybrid | 1 per document | Small model validates candidates |
| LLM | 2 per chunk | Maximum quality when needed |
The hybrid mode is the sweet spot: heuristic extraction finds candidates (deterministic), then a small local model validates and enriches them. One LLM call per document, not per chunk.
With Ollama running locally, the cost is $0. But that's not why I use it - cost savings are a side-effect of correct abstraction, not the goal. I use it because the failures are cheap and obvious.
Semantic search with ONNX and Qdrant shows another pattern: some tasks don't need an LLM at all. BERT embeddings via ONNX Runtime give you:
For hybrid search, I combine these embeddings with BM25 scoring. The LLM only appears at synthesis time - and even then, a small local model works fine because it's explaining structure that deterministic systems have already validated.
DocSummarizer embodies this philosophy:
The LLM is the last step, working on pre-validated, pre-structured content. It can fail - and when it does, the failure is obvious because the structure is already correct.
Frontier models are powerful tools when used deliberately. But they increase expressive power faster than they reduce risk. Small models, when embedded inside deterministic systems, give you just enough uncertainty to explore - without obscuring truth or responsibility.
The right question is not "which model is best?"
It is:
If the answer involves state, side effects, money, policy, or guarantees - the model should never be in charge. And if the model is only there to classify, summarise, rank, or propose hypotheses, a small local model is often the correct choice, not the economical one.
This is the architecture that works:
┌─────────────────────────────────────────────────────┐
│ DETERMINISTIC LAYER │
│ State machines, queues, validation, storage │
│ (DuckDB, Postgres, Redis, file systems) │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ INTERFACE LAYER │
│ Schema validation, retries, fallbacks │
│ (Polly, FluentValidation, custom guards) │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ PROBABILISTIC LAYER │
│ Classification, summarisation, hypothesis gen │
│ (Ollama, ONNX, small local models) │
└─────────────────────────────────────────────────────┘
The LLM is at the bottom, not the top. It proposes; the deterministic layers dispose.
All three perspectives - the questions, the pattern, and this final principle - reduce to the same rule:
Reliability is about choosing failures you can survive.
With LLMs, that means managing non-determinism through deterministic practices:
Small models make this easier because their failures are loud. Invalid JSON. Truncated output. Schema violations. These are gifts - they tell you immediately that something went wrong.
Frontier model failures are quiet. Plausible-sounding nonsense. Confident hallucinations. Semantic drift that only becomes visible when a customer complains or an audit fails.
I'll take loud failures every time.
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.