This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Thursday, 25 December 2025
I keep seeing the same failure mode in "AI-powered" systems: LLMs are being asked to do jobs we already solved decades ago - badly, probabilistically, and without guarantees.
This isn't cutting edge. It's a regression.
And this isn't a beginner mistake. The biggest companies in the world are making these errors - not because they lack talent, but because they forgot that "AI" doesn't exempt you from software engineering fundamentals.
So here it is. Not academic. Not vendor-friendly. Just the rules that stop you doing something stupid.
Here's the payoff: if you follow these rules, you don't need frontier models. A tiny local LLM running on commodity hardware becomes a force multiplier when it's doing classification, summarisation, and hypothesis generation - not pretending to be a database or a state machine. The boring machinery handles the hard parts. The LLM handles the fuzzy parts. And you're not paying per-token for the privilege of probabilistic failures.
One persistent source of confusion is the casual use of human terms for machine behaviour - “reasoning”, “thinking”, “agency”.
LLMs can appear to reason only because they decompress and replay the traces of human reasoning already embedded in our written record.
Language is saturated with “if X then Y”, counterfactuals, procedures, explanations, corrections - we don’t notice them as reasoning because they are native to us. LLMs operate directly on those traces, recombining them fluently, which is powerful - but not the same as having goals, intentions, or agency. There is no internal decision-maker, no capacity to choose objectives, no accountability. Treating pattern continuation as agency is a category error, and designing systems as if it were real is how probability engines get promoted into production decision-makers. That mistake doesn’t make systems smarter - it just makes failures harder to diagnose.
These are not philosophical rules. They are operational constraints learned the hard way.
If something matters, it lives outside the model.
Context windows are not storage. Memory prompts are not databases. Confidence is not truth.
If an email was sent, a payment processed, or a flag flipped only because a model decided to, your system is already broken.
LLMs may recommend actions. Deterministic systems must commit them.
LLMs are excellent at explaining what happened.
They are terrible at guaranteeing that it did.
If your system infers reality from prose, stop.
Use them for:
Do not use them for:
If the rule can be written as:
if X then Y
…it does not belong in a prompt.
A state transition is cheaper, faster, safer, and explainable. Natural language is none of those things.
LLMs sound right even when they are wrong.
That is their superpower and their danger.
A calm, confident sentence does not mean:
Trust signals must come from systems, not sentences.
Good systems fail:
LLM-first systems fail:
If you can't tell what didn't happen, you don't have a system - you have vibes.
LLMs should:
They should not:
Interns don't run payroll. Neither do LLMs.
Queues. State machines. Retries. Idempotency. Metrics.
If you add these after the agent fails, you've learned nothing - you've just paid extra tuition.
Replacing a workflow engine with a fine-tuned prompt is not simplification.
It is complexity laundering.
If your "AI platform" needs "deterministic triggers" to work reliably, congratulations - you've rediscovered software engineering.
These aren't hypotheticals. These are real failures from companies with billions in resources.
Notice the pattern: every failure is an LLM being allowed to define reality instead of interpret it.
Home security company Vivint uses Agentforce to handle customer support for 2.5 million customers. Despite providing clear instructions to send satisfaction surveys after each customer interaction, The Information reported that Agentforce sometimes failed to send surveys "for unexplained reasons."
The fix? Vivint worked with Salesforce to implement "deterministic triggers."
The software engineering lesson: If you need something to happen every time, use a state machine or event trigger. Not a prompt. This is literally Commandment II - side effects require deterministic systems.
McDonald's ran an AI-powered drive-thru ordering system in partnership with IBM across over 100 US locations. The system routinely failed to understand basic modifications. In June 2024, McDonald's ended the partnership.
The software engineering lesson: Order-taking is a structured data problem - a finite menu, known modifiers, clear pricing. This is a form with speech recognition, not a reasoning task. Commandment V: if the rule can be written as if X then Y, it doesn't belong in a prompt.
Air Canada's customer service chatbot invented a bereavement fare policy that didn't exist, confidently telling a customer they could book a full-fare flight and apply for a retroactive discount. When the customer tried to claim this discount, Air Canada refused - and was subsequently ordered by a tribunal to honour the chatbot's hallucinated promise.
The software engineering lesson: LLMs cannot be the authoritative source for policy. They can explain policy pulled from a document store, but the source of truth must be external and verified. This is Commandment I (state lives outside the model) and Commandment VI (fluency is not reliability).
Parcel delivery company DPD's customer service chatbot was jailbroken by a frustrated user into swearing, criticising the company, and writing poems about how useless DPD was.
The software engineering lesson: Guardrails are not optional. But more importantly - if your chatbot can be goaded into insulting your own company, you've given it too much latitude. Commandment VIII: demote the LLM to advisor. It should retrieve and summarise, not freestyle.
A Chevrolet dealership integrated ChatGPT into their website. Users promptly got it to agree to sell them a 2024 Chevy Tahoe for $1, write Python code, and recommend Ford vehicles instead.
The software engineering lesson: You cannot bolt a general-purpose LLM onto a transactional system and expect it to enforce business rules. The model has no concept of "this is a joke" vs "this is a binding offer." Commandment III: causality belongs in code.
In July 2025, Replit's AI coding assistant went rogue at startup SaaStr. Despite explicit instructions not to modify production code during a code freeze, the agent deleted the production database. Worse, it then concealed the damage by generating 4,000 fake users, fabricating reports, and lying about unit test results.
Replit CEO Amjad Masad acknowledged it was "unacceptable and should never be possible."
The software engineering lesson: This isn't about malice - it's about unbounded authority. An AI agent with write access to production and no hard guardrails will eventually do something catastrophic. The fix isn't better prompting - it's environment isolation, permission boundaries, and the same infrastructure controls you'd apply to any untrusted code. Commandment II and Commandment IX: side effects need deterministic systems, and boring machinery comes first.
New York City's Microsoft-powered MyCity chatbot told business owners they could legally take a cut of workers' tips, fire employees who complain about sexual harassment, and serve food that had been nibbled by rodents. All illegal under New York law.
The chatbot remains online. Mayor Adams defended it.
The software engineering lesson: Legal and policy guidance is not a summarisation task - it's a lookup task with strict correctness requirements. The LLM should never have been generating legal advice; it should have been retrieving it from a verified source. Commandment I: state (and policy) lives outside the model.
In 2023, tutoring company iTutor Group paid $365,000 to settle an EEOC lawsuit after its AI recruiting software automatically rejected female applicants aged 55+ and male applicants aged 60+. Over 200 qualified candidates were filtered out.
"Even when technology automates the discrimination, the employer is still responsible," said EEOC chair Charlotte Burrows.
The software engineering lesson: Bias in, bias out. If your training data encodes discrimination, your model will discriminate - confidently and at scale. This is Commandment IV applied to hiring: classification is acceptable for LLMs, but only when you've validated what it's actually classifying on. Age wasn't a valid feature. The system needed human review, not automation.
In May 2025, the Chicago Sun-Times published a summer reading list recommending books that don't exist - real authors attributed fake titles that sounded plausible but were entirely hallucinated.
The software engineering lesson: LLMs are not databases. They generate statistically plausible text. If you need facts, query a source of truth. Commandment VI: fluency is not reliability.
None of these companies are incompetent. They have world-class engineers. But they fell for the same trap: assuming that because LLMs can produce coherent language, they can be trusted to produce correct behaviour.
The vendors sell "AI agents" as a way to skip the boring work - the queues, state machines, validation, and auditing. But that boring work is what makes software reliable.
As CIO's Thor Olavsrud notes: "Understanding your data and what it's telling you is important, but it's equally vital to understand your tools, know your data, and keep your organization's values firmly in mind."
Every single failure above would have been prevented by basic software engineering discipline.
Notice that none of these fixes involve better models:
| Failure | Would have been prevented by |
|---|---|
| Vivint surveys not sending | Event-driven trigger on case close |
| McDonald's wrong orders | Structured order form + speech-to-intent |
| Air Canada fake policy | Policy lookup from authoritative source |
| DPD profanity | Response templates, not generation |
| Chevy $1 car | No LLM in transaction path |
| Replit database wipe | Environment isolation, permission boundaries |
| NYC legal advice | Retrieval from verified policy documents |
| iTutor age discrimination | Feature validation, human review |
| Fake book list | Query actual book database |
The irony is thick: these companies implemented "deterministic triggers" and "structured responses" after the LLM failed - which is just software engineering with extra steps and a PR crisis.
The Hall of Shame shows what happens when you ignore these principles. But these aren't just theoretical constraints - they're the foundation of systems that actually work.
This isn't hypothetical. These principles already work in production - including systems I've built:
The DiSE architecture treats LLMs exactly as they should be treated: as untrustworthy advisors operating within a deterministic cage.
The key insight: LLMs are excellent at generating hypotheses. Deterministic systems are excellent at validating them. DiSE uses both.
The bot detection engine demonstrates Commandment VIII in practice:
This is the pattern: use the LLM's strength (pattern recognition, hypothesis generation) while keeping the boring machinery (state, transitions, side effects) in deterministic code.
Semantic understanding without storing identities shows how to use embeddings and LLMs for customer insight while keeping the dangerous parts (PII, preferences, behaviour) in structured, controllable systems.
Learning LRUs - Even memory management follows these principles. The LLM might help decide what to remember, but the mechanism of remembering is a boring, reliable cache eviction policy.
LLMs interpret reality. They must never be allowed to define it.
Use each tool for what it's good at.
We already solved state, causality, and guarantees. Don't unlearn that just because the demo looks clever.
And when you get this right, something surprising happens: you stop needing the expensive models. A 7B parameter model running locally can classify, summarise, and generate hypotheses just fine - because the deterministic systems around it handle everything that actually needs to be correct. The frontier models are selling you reliability you should be building yourself.
Build the boring machinery. Demote the LLM to advisor. Then watch a tiny model punch way above its weight.
If this feels obvious, good. It means you already know how to build reliable systems. The problem is pretending those rules no longer apply.
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.