This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Sunday, 18 January 2026
LLMs (Large Language Models) are being used as sensors. That is a category error: using a probabilistic synthesizer where a deterministic boundary device is required.
This isn't developers' fault. Most mainstream examples and tutorials lead with the simplest demo: "just send it to the model." That makes onboarding easy, but it blurs a crucial boundary: perception vs synthesis. Industry incentives don't help either: token-priced systems naturally reward pipelines that do more work inside the LLM.
This article is about Reduced RAG - an architectural pattern where deterministic pipelines feed probabilistic components, not the other way around. In a proper RAG system:
Example: In the OCR (Optical Character Recognition) pipeline, the Vision LLM is tier 3, not tier 1. It only runs after text-likeliness heuristics and local OCR fail. That is not an optimization. It is a boundary rule - an architectural constraint that prevents inappropriate tool usage regardless of perceived convenience.
This is Part 1 of "LLMs as Components". Part 2: MCP Is a Transport, Not an Architecture covers the category error of using MCP as architecture.
Here are identical architectural failures across different domains:
flowchart TD
subgraph Wrong["❌ Common Anti-Pattern"]
Raw[Raw Data<br/>Pixels, Waveforms, Frames] -->|Direct feed| LLM1[Vision/Audio LLM<br/>$$$, variance, hallucination]
LLM1 --> Unreliable[Unreliable Output<br/>High cost, non-deterministic]
end
style Wrong fill:none,stroke:#dc2626,stroke-width:3px
style Raw fill:none,stroke:#6b7280,stroke-width:2px
style LLM1 fill:none,stroke:#dc2626,stroke-width:2px
style Unreliable fill:none,stroke:#dc2626,stroke-width:2px
These architectural mistakes produce predictable failures:
Hallucinated perception: LLMs must emit tokens. Under uncertainty they fill gaps with plausible completions. "Text detected" because it might be there, not because it is there. This is the core problem.
Non-deterministic failure: Temperature and token budget drive variance. Same input → different outputs. Debugging becomes statistical analysis.
Resource waste: Whether you're paying per token (API) or running local models (Ollama, llama.cpp), you're burning compute on the wrong task. Local LLMs don't cost per call, but they still produce unreliable sensor output.
The issue isn't cost - it's accuracy. A locally hosted LLM hallucinating OCR results is just as broken as an expensive API call doing the same thing. Deterministic sensors produce verifiable facts. LLMs produce plausible prose.
The category error exists because a sensor and a synthesizer are fundamentally different tools:
A sensor is a boundary device:
An LLM is a high-variance synthesizer:
flowchart LR
subgraph Sensor["Sensor (Boundary Device)"]
World[Physical World<br/>∞ dimensions] -->|Reduce| Signal[Structured Signal<br/>Bounded dimensions]
Signal -->|Confidence| Facts[Facts<br/>± certainty]
end
subgraph LLM["LLM (Synthesizer)"]
Input[Structured Input] -->|Synthesize| Prose[Unstructured Output<br/>High entropy]
Prose -->|No confidence| Tokens[Token stream<br/>No 'nothing' state]
end
style Sensor fill:none,stroke:#16a34a,stroke-width:2px
style LLM fill:none,stroke:#b45309,stroke-width:2px
style World fill:none,stroke:#6b7280,stroke-width:2px
style Signal fill:none,stroke:#2563eb,stroke-width:2px
style Facts fill:none,stroke:#059669,stroke-width:2px
style Input fill:none,stroke:#6b7280,stroke-width:2px
style Prose fill:none,stroke:#d97706,stroke-width:2px
style Tokens fill:none,stroke:#dc2626,stroke-width:2px
This is why StyloFlow treats signals as immutable facts, not prose. Models can propose. Deterministic policy decides what persists.
Brains do not start with reasoning. They start with constraint.
The retina is not the cortex. There is heavy preprocessing before anything looks like cognition:
When sensory constraints weaken, hallucinations rise. Low light, missing edges, ambiguous cues. This is not metaphor - it is the same failure mode as LLM hallucinations under uncertainty. Engineers already know this: when upstream SNR drops, downstream classifiers become unstable.
This problem-solution pattern appears across all animal perceptual systems, not just human vision:
Bat echolocation (auditory):
Honeybee vision (motion detection):
flowchart LR
subgraph Bat["Bat Echolocation"]
Echo[Ultrasonic Echo<br/>∞ waveform data] --> Cochlea[Cochlear Filters<br/>Doppler, delay, amplitude]
Cochlea --> BatBrain[Auditory Cortex<br/>Distance, texture facts]
end
subgraph Bee["Honeybee Vision"]
Motion[Visual Field<br/>Rapid motion] --> Lamina[Lamina<br/>Optical flow computation]
Lamina --> BeeBrain[Central Brain<br/>Motion vectors, not pixels]
end
style Bat fill:none,stroke:#7c3aed,stroke-width:2px
style Bee fill:none,stroke:#d97706,stroke-width:2px
style Echo fill:none,stroke:#6b7280,stroke-width:2px
style Cochlea fill:none,stroke:#a855f7,stroke-width:2px
style BatBrain fill:none,stroke:#6366f1,stroke-width:2px
style Motion fill:none,stroke:#6b7280,stroke-width:2px
style Lamina fill:none,stroke:#f59e0b,stroke-width:2px
style BeeBrain fill:none,stroke:#d97706,stroke-width:2px
The common pattern:
This is not "inspiration from nature." It is convergent evolution of information processing under physical constraints (energy, latency, bandwidth). The same constraints apply to AI systems.
The engineering mapping is tight:
flowchart TD
subgraph Brain["Biological Vision Pipeline"]
Photons[Photons] --> Retina[Retina<br/>Edge detection, contrast]
Retina --> V1[V1 Cortex<br/>Orientation, motion]
V1 --> IT[Inferotemporal Cortex<br/>Object recognition]
IT --> PFC[Prefrontal Cortex<br/>Reasoning, synthesis]
end
subgraph Engineering["Engineering Vision Pipeline"]
Pixels[Raw Pixels] --> OpenCV[OpenCV + Heuristics<br/>Sharpness, text-likeliness]
OpenCV --> Local[Local Models<br/>Florence-2, EAST/CRAFT OCR]
Local --> Structured[Structured Signals<br/>Bounding boxes, confidence]
Structured --> LLM[LLM Synthesis<br/>Only when needed]
end
Brain -.->|Maps to| Engineering
style Brain fill:none,stroke:#7c3aed,stroke-width:2px
style Engineering fill:none,stroke:#2563eb,stroke-width:2px
style Photons fill:none,stroke:#6b7280,stroke-width:2px
style Retina fill:none,stroke:#16a34a,stroke-width:2px
style V1 fill:none,stroke:#059669,stroke-width:2px
style IT fill:none,stroke:#0891b2,stroke-width:2px
style PFC fill:none,stroke:#6366f1,stroke-width:2px
style Pixels fill:none,stroke:#6b7280,stroke-width:2px
style OpenCV fill:none,stroke:#16a34a,stroke-width:2px
style Local fill:none,stroke:#059669,stroke-width:2px
style Structured fill:none,stroke:#0891b2,stroke-width:2px
style LLM fill:none,stroke:#6366f1,stroke-width:2px
Brains do not "understand" pixels. They never see them.
The cortex operates on signals that have been reduced, filtered, and structured by preceding layers. This is not a limitation - it is what makes intelligence tractable.
This is not a philosophical take. It is system design:
Escalation must be gated: LLMs are tier 3, not tier 1. Only invoke when cheaper sensors fail.
Confidence thresholds must be explicit: "Text detected with 0.92 confidence" is a fact. "There might be text" is not.
Routing must be deterministic: Same signals → same path. No prompt variance, no temperature effects.
Token economics are a constraint, not a cost tweak: If your pipeline's cost scales with raw data size, you are using the wrong tool.
Facts need provenance: If you can't point to the bounding box, frame, or waveform region, it isn't a fact.
Proof: The filmstrip optimization
In the VideoSummarizer pipeline, text-only filmstrips reduce token cost by ~30x while improving OCR fidelity. The LLM sees the signal (extracted text regions), not the scene (full RGB frames).
This is not an optimization. It is a category correction.
This is the Reduced RAG architectural pattern. The core principle: pipelines feed probabilistic components.
Reduced RAG is Map-Reduce applied to probabilistic systems:
Traditional RAG gets this backwards - it retrieves documents and hopes the LLM extracts facts. Reduced RAG extracts facts first (map), then lets the LLM synthesize (reduce).
The pattern repeats across all multimodal systems:
flowchart TD
subgraph Map["MAP PHASE (Parallel, Deterministic)"]
Raw[Raw Data<br/>10,000 frames] --> Split{Split}
Split --> S1[Sensors<br/>Frame 1-1000]
Split --> S2[Sensors<br/>Frame 1001-2000]
Split --> S3[Sensors<br/>Frame 2001-3000]
Split --> SDots[...]
S1 --> L1[Local Models<br/>Batch 1]
S2 --> L2[Local Models<br/>Batch 2]
S3 --> L3[Local Models<br/>Batch 3]
SDots --> LDots[...]
L1 --> F1[Facts: 120]
L2 --> F2[Facts: 98]
L3 --> F3[Facts: 156]
LDots --> FDots[...]
F1 --> Collect[Collect Facts]
F2 --> Collect
F3 --> Collect
FDots --> Collect
Collect --> Facts[(Facts Database<br/>500 total facts)]
end
subgraph Reduce["REDUCE PHASE (Sequential, Probabilistic)"]
Query[User Query] --> Retrieve[Retrieve Relevant Facts<br/>Filter: 50 facts]
Facts --> Retrieve
Retrieve --> LLM[LLM Synthesis<br/>Reason over 50 facts]
LLM --> Answer[Grounded Answer]
end
style Map fill:none,stroke:#16a34a,stroke-width:3px
style Reduce fill:none,stroke:#6366f1,stroke-width:3px
style Raw fill:none,stroke:#6b7280,stroke-width:2px
style Split fill:none,stroke:#16a34a,stroke-width:2px
style S1 fill:none,stroke:#16a34a,stroke-width:2px
style S2 fill:none,stroke:#16a34a,stroke-width:2px
style S3 fill:none,stroke:#16a34a,stroke-width:2px
style SDots fill:none,stroke:#16a34a,stroke-width:1px,stroke-dasharray: 5 5
style L1 fill:none,stroke:#059669,stroke-width:2px
style L2 fill:none,stroke:#059669,stroke-width:2px
style L3 fill:none,stroke:#059669,stroke-width:2px
style LDots fill:none,stroke:#059669,stroke-width:1px,stroke-dasharray: 5 5
style F1 fill:none,stroke:#0891b2,stroke-width:2px
style F2 fill:none,stroke:#0891b2,stroke-width:2px
style F3 fill:none,stroke:#0891b2,stroke-width:2px
style FDots fill:none,stroke:#0891b2,stroke-width:1px,stroke-dasharray: 5 5
style Collect fill:none,stroke:#16a34a,stroke-width:2px
style Facts fill:none,stroke:#0891b2,stroke-width:3px
style Query fill:none,stroke:#6b7280,stroke-width:2px
style Retrieve fill:none,stroke:#7c3aed,stroke-width:2px
style LLM fill:none,stroke:#6366f1,stroke-width:2px
style Answer fill:none,stroke:#16a34a,stroke-width:2px
Document-first RAG vs. Reduced RAG:
| Aspect | Document-first RAG | Reduced RAG (Map-Reduce) |
|---|---|---|
| Pattern | Retrieve → Extract → Synthesize | Map (Extract) → Reduce (Synthesize) |
| Extraction accuracy | LLM hallucinations possible | Deterministic, verifiable |
| Stored data | Documents/chunks | Structured facts |
| LLM role | Two tasks: extract + synthesize | One task: synthesize only |
| Debuggability | Inspect prompt traces | Inspect fact database |
| Scalability | Sequential LLM bottleneck | Distributed map, centralized reduce |
Three production systems implement this pattern:
Here are real numbers from VideoSummarizer on a 10-minute video (600 seconds, 30fps = 18,000 frames):
| Approach | Accuracy | Determinism | Tokens |
|---|---|---|---|
| Frame-by-frame LLM | Hallucinations, variance | Non-deterministic | ~27M |
| Shots → Keyframes → LLM | Accurate | Deterministic extraction | ~150K |
| Filmstrip text extraction | Best OCR fidelity | Fully deterministic | ~5K |
The right architecture is more accurate. It also happens to be 180x cheaper - but that's a side effect of doing the right thing, not the goal. Even with free local LLMs, the frame-by-frame approach would still be wrong because it produces unreliable output.
If your AI system starts with an LLM, you have already lost control of it.
Intelligence does not start with reasoning. It starts with constraint.
Sensors reduce uncertainty. Synthesizers expand meaning. Confusing the two breaks systems.
Make synthesis the last step.
Next in series: Part 2: MCP Is a Transport, Not an Architecture
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.