Most software still assumes the world will hand it clean inputs and stable rules. Production usually does neither.
This post is the through-line behind a bunch of things I've been building: DiSE, Constrained Fuzziness, CFMoM, Reduced RAG, StyloFlow, The Ten Commandments of AI Engineering, and Stylobot.
I didn't get here by starting with a grand theory. I got here by poking at tools I found interesting and trying to work out what they were really doing. Summarisation. Code generation. Retrieval. Extraction. Ranking. Once you stop listening to the marketing layer, most of the useful ones are not doing one magical thing. They are collecting partial evidence, constraining it, and only then turning it into an answer or an action.
After building enough systems like that, I stopped thinking of them as separate tricks and started seeing the same architecture underneath:
That is the core of what I mean by a behavioural inference system.
It also explains why these architectures work well with code LLMs. Not because the LLM is "the intelligence", but because the system is structured enough to inspect and easy enough to tune.
Previously in this series:
Most production systems still reach for one of two bad defaults:
Both can work for a while. Both break when the environment shifts.
What many real systems actually have is this:
flowchart LR
A[Messy Input] --> B[Partial Signals]
B --> C[Conflicting Evidence]
C --> D[Uncertain Interpretation]
D --> E[Need to Act Anyway]
style A stroke:#ef4444,stroke-width:2px
style E stroke:#22c55e,stroke-width:2px
Examples:
In those domains, you rarely get one decisive fact. You get fragments, some useful, some noisy, some actively deceptive.
That is why I keep returning to signals, constraints, and observability.
If you want a system to improve, you need to see what it observed, what it believed, and why it acted. If that stays buried in tangled code paths, tuning becomes guesswork. If it is explicit, you can actually improve it on purpose.
The important move in DiSE was not "let an LLM change code." It was this:
treat architecture as something that can emerge under selection pressure, not something fully known upfront.
DiSE reframes software from "build, ship, patch" to "perceive, evaluate, mutate, select."
flowchart LR
subgraph Traditional["Traditional Software"]
T1[Build] --> T2[Ship] --> T3[Patch]
end
subgraph DiSE["DiSE"]
D1[Perceive] --> D2[Evaluate]
D2 --> D3[Mutate]
D3 --> D4[Select]
D4 --> D1
end
style Traditional stroke:#ef4444,stroke-width:2px
style DiSE stroke:#22c55e,stroke-width:2px
That matters because in many systems you do not know in advance:
So the system needs room to explore.
But exploration alone is not enough. You can mutate your way into nonsense as easily as into something useful.
Constrained Fuzziness is the control layer that stops the whole thing turning to mush.
The rule is simple:
probabilistic components may propose; deterministic systems decide.
flowchart TB
I[Input] --> S[Deterministic Substrate]
S --> P[Fuzzy Proposer]
P --> C{Constrainer}
C -->|Pass| O[Output]
C -->|Partial| R[Rewrite / Hedge]
C -->|Fail| F[Fallback]
S -.evidence.-> C
style S stroke:#22c55e,stroke-width:3px
style P stroke:#f59e0b,stroke-width:3px
style C stroke:#ef4444,stroke-width:3px
DiSE says "explore." Constrained Fuzziness says "inside these boundaries."
Without those boundaries, probabilistic systems do what they always do:
That is why The Ten Commandments of AI Engineering matter. "LLMs shall not own state", "LLMs shall not be sole cause of side-effects", and "Never ask an LLM to decide a derivable boolean" are not style notes. They are operating rules for systems that need to survive production.
The same pattern shows up everywhere:
flowchart LR
A[DiSE<br/>Search and Selection] --> B[Constrained Fuzziness<br/>Bounded Proposal]
B --> C[Behavioural Inference<br/>Evidence Over Time]
style A stroke:#3b82f6,stroke-width:2px
style B stroke:#f59e0b,stroke-width:2px
style C stroke:#22c55e,stroke-width:2px
Put those two ideas together and you get a practical pattern: infer from weak evidence, but make action legible and controlled.
This is where Reduced RAG, StyloFlow, and the signal-contract work in CFMoM all line up.
Once you stop pretending that one model or one rules engine should do everything, the useful design primitive becomes the signal.
A good signal is:
Most importantly, a signal is compressed behaviour. It is not the whole world. It is the part you can preserve, compare, and act on later.
flowchart LR
R[Raw Reality] --> X[Extraction]
X --> S1[Signal]
X --> S2[Evidence Pointer]
X --> S3[Confidence]
S1 --> A[Accumulation]
S2 --> A
S3 --> A
A --> I[Inference]
style R stroke:#64748b,stroke-width:2px
style A stroke:#3b82f6,stroke-width:2px
style I stroke:#22c55e,stroke-width:2px
Different domains emit different signals:
Signals let you move from "the model thinks" to "the system has evidence."
That is the move behind Reduced RAG: extract signals instead of stuffing larger context windows. It is also the move in StyloFlow: coordinate around emitted facts, not opaque component calls.
Once signals are explicit, you can ask better engineering questions:
That is the difference between shipping a feature and tuning a machine.
Traditional systems often look like this:
Rules -> Decisions
Behavioural inference systems look more like this:
Signals -> Evidence accumulation -> Behaviour inference -> Deterministic action
flowchart TD
subgraph Old["Old Shape"]
O1[Rules] --> O2[Decision]
end
subgraph New["Behavioural Inference Shape"]
N1[Signals]
N2[Evidence Accumulation]
N3[Inference]
N4[Policy Action]
N1 --> N2 --> N3 --> N4
end
style Old stroke:#ef4444,stroke-width:2px
style New stroke:#22c55e,stroke-width:2px
What do these systems infer?
Usually without ever getting a single perfect fact.
Behaviour is often easier to infer than identity. That matters in privacy-preserving systems and adversarial ones. You may not know exactly who something is, but you can often tell what kind of behaviour pattern it belongs to.
That is enough to route, throttle, challenge, cluster, prioritize, or escalate.
It also makes these systems a good fit for code LLMs. They do best when the system gives them:
A behavioural inference system exposes those things naturally.
Stylobot Part 2 is probably the clearest concrete example so far.
Stylobot is not just a pile of detectors. It is a behavioural inference stack.
flowchart LR
R[Request] --> D[Detector Signals]
D --> E[Evidence Aggregation]
E --> T[Signature + Temporal Context]
T --> I[Behaviour Inference]
I --> P[Probability + Confidence + Risk]
P --> A[Policy Action]
A --> F[Response Feedback]
F --> D
style D stroke:#3b82f6,stroke-width:2px
style T stroke:#8b5cf6,stroke-width:2px
style P stroke:#f59e0b,stroke-width:2px
style A stroke:#22c55e,stroke-width:2px
A few things in that pipeline come directly from the earlier work.
No single detector is assumed to be enough. You have a population of specialised contributors emitting evidence, and over time you learn which ones actually help.
That is not full autonomous evolution, but it is the same instinct: architecture gets refined under pressure.
Stylobot keeps probability and confidence separate, but action is deterministic:
AllowThrottleChallengeBlockEvidence can be fuzzy. Control surfaces cannot.
Instead of reducing a visitor to one IP or one user-agent, Stylobot builds a multi-vector signature and reasons across time.
That is no longer simple classification. It is memory about behaviour.
High probability with low confidence should not trigger the same response as high probability with high confidence.
The system keeps ambiguity intact until it has enough evidence to justify stronger action.
Stylobot is designed so that you can inspect nearly every meaningful part of the decision path:
That makes it a tuneable engine rather than a black box.
flowchart TD
S1[Observable Signals] --> S2[Compare Outcomes]
S2 --> S3[Tune Thresholds / Weights / Waves]
S3 --> S4[Re-run on Traffic]
S4 --> S5[Observe Drift / Improvement]
S5 --> S1
style S1 stroke:#3b82f6,stroke-width:2px
style S3 stroke:#f59e0b,stroke-width:2px
style S5 stroke:#22c55e,stroke-width:2px
That loop is exactly where code LLMs help. Not by replacing the engine, but by accelerating changes to the engine:
That only works because the architecture is observable enough to support tuning in the first place.
The useful shift is not "LLMs can write software now." That line got boring almost immediately.
What matters is that code LLMs make exploration cheaper.
RAG was published in May 2020. Retrieval, embedding search, signal extraction, and evidence packs are not new ideas. What changed is the cost of iterating on them. It used to be expensive to sketch twenty detector candidates, wire evaluation harnesses, inspect signal coverage, and tune thresholds. Most teams would build one design, ship it, and then live with whatever corners they had cut.
Code LLMs changed the economics of that loop.
They help you prototype:
flowchart LR
A[Human Hypothesis] --> B[Code LLM Acceleration]
B --> C[More Candidate Signals]
C --> D[More Evaluation]
D --> E[Better Selection Pressure]
E --> F[Stronger Inference System]
style B stroke:#8b5cf6,stroke-width:2px
style F stroke:#22c55e,stroke-width:2px
The LLM does not need to be the decider to be strategically useful. It can just make design-space exploration much cheaper.
But the earlier rules still apply:
So yes, code LLMs matter. They matter because they speed up search and tuning, not because they remove the need for architecture.
The same shape keeps showing up.
In Reduced RAG, you extract deterministic signals at ingestion, store evidence separately, and let the LLM synthesize from a bounded evidence pack.
Not "give the model everything and hope." Extract first, constrain the surface, and synthesize from evidence.
Where Stylobot infers behaviour from requests over time, lucidRAG infers meaning from multimodal evidence: document structure, OCR confidence, entity graphs, ranking signals, source quality, deduplication.
Different substrate. Same shape.
flowchart LR
subgraph Stylobot["Stylobot"]
SB1[Request Signals]
SB2[Temporal Evidence]
SB3[Behaviour Inference]
SB4[Policy Action]
SB1 --> SB2 --> SB3 --> SB4
end
subgraph LucidRAG["lucidRAG"]
LR1[Content Signals]
LR2[Evidence + Retrieval]
LR3[Meaning Inference]
LR4[Bounded Synthesis]
LR1 --> LR2 --> LR3 --> LR4
end
style Stylobot stroke:#3b82f6,stroke-width:2px
style LucidRAG stroke:#22c55e,stroke-width:2px
Neither is really an "app." Both are inference engines working over different inputs.
In Constrained Fuzzy MoM, multiple probabilistic components can propose, but they communicate through typed signals and deterministic logic decides what survives.
That is multi-model coordination without surrendering control.
In Constrained Fuzzy Context Dragging, the system keeps bounded memory and preserves the parts of context that matter long enough for later interpretation.
Inference needs time. Context dragging makes time available without letting memory grow without bound.
In StyloFlow, components do not call each other directly. They emit signals, and orchestration reacts to those signals and their confidence.
That is behavioural inference applied to workflow infrastructure.
"Behavioural inference systems" is a better umbrella than "agentic systems" or "LLM apps." It describes the architecture instead of the marketing wrapper.
If I had to compress the whole lineage into a few rules:
The normative version of those rules is The Ten Commandments of AI Engineering. This article is the architectural version.
flowchart LR
A[Ten Commandments] --> B[Architectural Constraints]
B --> C[Behavioural Inference Systems]
C --> D[Tuneable Engines]
style A stroke:#8b5cf6,stroke-width:2px
style B stroke:#ef4444,stroke-width:2px
style C stroke:#22c55e,stroke-width:2px
style D stroke:#3b82f6,stroke-width:2px
mindmap
root((Behavioural Inference))
DiSE
Search
Mutation
Selection
Constrained Fuzziness
Substrate
Proposer
Constrainer
Signals
Evidence
Confidence
Provenance
Time
Memory
Drift
Temporal Context
Action
Policy
Thresholds
Deterministic Boundaries
The AI systems that survive production are usually not giant autonomous blobs. They are also not endless piles of rules.
They are systems that:
That is a better engineering story than "the model got smarter."
Models will improve. Fine. Architecture still determines whether a system is debuggable, auditable, cheap to run, safe to evolve, and robust under adversarial pressure.
Behavioural inference systems take those constraints seriously.
The lineage looks obvious in hindsight:
flowchart LR
D[DiSE<br/>Explore and Select] --> CF[Constrained Fuzziness<br/>Bound the Uncertain]
CF --> BI[Behavioural Inference Systems<br/>Infer from Weak Signals]
BI --> ST[Stylobot / Reduced RAG / StyloFlow<br/>Working Architectures]
style D stroke:#3b82f6,stroke-width:2px
style CF stroke:#f59e0b,stroke-width:2px
style BI stroke:#22c55e,stroke-width:2px
style ST stroke:#8b5cf6,stroke-width:2px
DiSE gave me a way to think about architectural search. Constrained Fuzziness gave me a way to keep probabilistic components inside clear boundaries. Behavioural inference systems are what you get when those ideas are forced to survive production.
Stylobot is just the current example.
Once you start seeing systems as evidence accumulators with deterministic action surfaces, a lot of modern software stops looking like "AI features" and starts looking like the same pattern in different domains.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.