"We're building an AI-powered knowledge assistant that will revolutionise how our employees access information..." — Every job advert, pitch deck, and consultancy proposal in 2024-2025
I've spent the past month or so properly immersing myself in the commercial AI space. Read copious amounts of marketing bumf. Pored over dozens of job adverts. Sat through more product demos than I care to admit. And I've come to a rather depressing conclusion.
It's almost all the same thing.
"AI-powered company knowledge base." "Intelligent document search." "Customer chatbot with enterprise knowledge." Strip away the breathless marketing copy and you'll find the same architecture, the same failure modes, and the same disappointed stakeholders about six months down the line.
The customer chatbot variants are particularly entertaining—they can become PR disasters remarkably quickly when developers who don't understand guardrails let them loose on the public. Nothing quite like your support bot cheerfully offering refunds you don't give, making up product features that don't exist, or going on a philosophical tangent about the meaning of existence when someone asks about shipping times. Without proper constraints, these bots will happily promise anything, admit to crimes your company didn't commit, or develop strong opinions about competitors. The canonical example remains Air Canada's chatbot confidently inventing a bereavement policy that didn't exist—and the company being held to it in court.
Here's the dirty secret that nobody in management wants to hear: most commercial AI projects aren't innovative. They're commodity plumbing with a fancy label.
Before I go further, a disclaimer: I'm not an "AI visionary" or whatever the hype-driven title of the moment is (mostly former "blockchain visionaries" who've conveniently pivoted). I'm a software engineer who's been building these kinds of systems—search, knowledge management, natural language processing, decision support—for close to three decades. I've seen this picture before with expert systems, with semantic web, with big data, with blockchain. The technology changes; the pattern of overpromising and underdelivering doesn't.
Strip away the marketing bollocks and you'll find that approximately 95% of commercial "AI" projects fall into one of two categories:
This is by far the most common. Every "enterprise AI solution" follows this pattern:
flowchart LR
A[Documents] --> B[Document Ingestion Pipeline]
B --> C[Vector Database / RAG]
C --> D[Construct Prompts]
D --> E[LLM API Call]
E --> F[Hybrid Search]
F --> G[Response to User]
style B stroke:#ff0000,stroke-width:4px
That red box? That's where everything breaks. More on that in a moment.
The pitch sounds impressive: "We've built an AI that understands your company's documents and can answer questions intelligently!"
The reality: You've built a search engine with extra steps and a £10,000/month OpenAI bill.
This one's even sillier. The pitch: "We've trained our own AI model specifically for your industry!"
The reality: You've taken someone else's model and fine-tuned it on a dataset that's probably too small, too dirty, and too narrow to make a meaningful difference over just using the base model with good prompts.
flowchart TD
A[Existing LLM] --> B[Collect Training Data]
B --> C[Clean Data - Maybe]
C --> D[Fine-tune Model]
D --> E[Deploy Model]
E --> F[Discover it's not much better than base model]
F --> G[Keep paying for inference anyway]
G --> H[Hope nobody notices]
Most fine-tuning projects I've seen would have been better served by spending the same money on improving their prompts and retrieval systems.
Let's talk about that red box in the RAG diagram. Document ingestion is the part that fails most often, and it's the part that gets the least attention in the flashy demos.
Here's what the sales demo shows:
Here's what actually happens:
PDFs are a nightmare. They were designed for printing, not for extracting structured data. Every PDF parser I've used has different failure modes:
And that's just PDFs. Wait until you hit:
Once you've extracted text (poorly), you need to chunk it for your vector database. This is where more magic thinking happens.
"We'll use semantic chunking!" Great, your 200-page contract is now 500 chunks, and the AI has no idea which ones are related or in what order they appear.
"We'll use fixed-size chunks with overlap!" Perfect, you've just split a sentence in half and the embedding now represents nonsense.
flowchart TD
A[Original Document] --> B[Chunk 1: This contract shall be governed by]
A --> C[Chunk 2: the laws of the State of California]
B --> D[Embedding: Legal stuff?]
C --> E[Embedding: Geography?]
D --> F[User asks about jurisdiction]
E --> F
F --> G[AI: Based on my knowledge, possibly California, or maybe legal governance, who knows]
Good RAG needs good metadata. But extracting metadata from documents is hard:
Most organisations have decades of documents with inconsistent naming conventions, folder structures that made sense to someone who left in 2003, and metadata that's either missing or wrong.
Let me be clear: fine-tuning has its place. But the way most companies approach it is fundamentally broken.
Fine-tuning requires quality training data. Most companies don't have it. They have:
"We'll fine-tune on our support tickets!" Your support tickets are full of frustrated customers, incorrect information from junior staff, and edge cases that don't represent normal usage.
"We'll fine-tune on our sales calls!" You mean the ones where salespeople make promises the product can't keep?
How do you know if your fine-tuned model is actually better? Most companies can't answer this because:
I've seen companies spend six months fine-tuning a model and then have no way to prove it's better than just using GPT-4 with a good system prompt.
Fine-tuned models need updating. Your business changes. Your products change. Your processes change. That model you fine-tuned in January is now giving answers based on outdated information.
But updating means:
Most companies fine-tune once and then just live with the drift. The model slowly becomes less relevant while everyone pretends it's still adding value.
Don't get me wrong. There are legitimate use cases:
flowchart TD
subgraph RAGWorks["✅ RAG Works When"]
R1[Well-structured docs]
R2[Clear metadata]
R3[Search + synthesis use case]
R4[Heavy ingestion investment]
R5[Feedback loops exist]
end
subgraph FTWorks["✅ Fine-Tuning Works When"]
F1[Large high-quality dataset]
F2[Base model truly struggles]
F3[Ongoing maintenance budget]
F4[Clear eval metrics]
F5[Already tried prompting + RAG]
end
style R4 stroke:#00aa00,stroke-width:2px
style F5 stroke:#00aa00,stroke-width:2px
The green boxes are the prerequisites most projects skip. "We've already optimised prompting and RAG" is the bar for fine-tuning. "Heavy ingestion investment" is the bar for RAG. Skip these and you're building on sand.
While everyone's building the same RAG pipeline, the actually interesting problems in commercial AI are being ignored:
The biggest constraint on AI effectiveness isn't the model. It's the data. Most organisations have:
But "data quality initiative" doesn't get you a Forbes article like "AI transformation" does.
Dropping an AI chatbot into an existing process doesn't magically make it better. The process needs to be redesigned around the AI's capabilities and limitations. Most companies just bolt AI onto broken processes and wonder why it doesn't help.
The best AI implementations augment human capabilities rather than trying to replace them. But that's harder to sell than "AI that does X automatically!"
Humans + AI working together requires:
Most commercial AI projects treat the human as an afterthought.
The truly valuable AI applications aren't "chatbot on your documents." They're applications that:
But those are hard. RAG pipelines are easy (well, easier). So that's what everyone builds.
A significant driver of dumb AI projects is the consultancy ecosystem:
flowchart TD
A[Big Consultancy Tells C-Suite: You Need AI!] --> B[C-Suite Panics]
B --> C[Consultancy Deploys Army of Juniors]
C --> D[Recommendations: RAG + Fine-Tuning]
D --> E[Build Same Thing as Last 20 Clients]
E --> F[Demo Goes Well]
F --> G[Reality: Real Data Breaks Everything]
G --> H[Consultancy Moves On]
H --> I[Internal Team Struggles]
I --> J[Project Quietly Fails]
J --> K[Nobody Admits It]
K --> A
style G stroke:#ff0000,stroke-width:3px
style J stroke:#ff0000,stroke-width:3px
I've seen this pattern dozens of times. The consultancy gets paid. The executives get to say they "did AI." The engineers get stuck maintaining something that barely works. And the actual business problem remains unsolved.
Startups fall into a slightly different trap. It's not consultancies driving the dysfunction—it's the funding environment.
timeline
title Technology Requirements for VC Funding
2005 : Web 2.0 - "You need social features"
2010 : Mobile - "You need an app"
2015 : Cloud - "You need to be cloud-native"
2018 : Blockchain - "You need a token"
2023 : AI - "You need an AI strategy"
Sound familiar? Every few years, there's a new technology that VCs decide is essential. If your pitch deck doesn't mention it prominently, you're not getting funded. The technology might be irrelevant to your actual product—doesn't matter. You need the buzzword.
I watched this happen with blockchain in 2017-2018. Companies that had no business being on a blockchain were shoehorning tokens into their products because that's what got funding. Most of those blockchain features quietly disappeared once the money was secured.
Now it's happening with AI.
Startups are bolting LLM features onto products that don't need them because:
The result? Products with awkward AI features that users ignore. Burned runway on fine-tuning experiments that go nowhere. Engineering time wasted on RAG systems when a simple database query would work better.
The worst part: many founders know this is silly. They're building AI features they don't believe in because they need to survive long enough to build what they actually care about. Some succeed at this game. Most don't.
If you're a startup founder being pushed to add AI, ask yourself:
If it's the latter, build the minimum viable AI feature that ticks the box, then focus on what actually matters. Don't let the funding environment distract you from building something valuable.
Here's the uncomfortable truth nobody in AI wants to discuss: almost everyone in the ecosystem has a financial incentive to keep the hype going.
flowchart TD
subgraph Researchers["🔬 Frontier Labs"]
R1[Need billions for compute]
R2[Must show progress to justify spend]
R3[Hype generates investment]
end
subgraph Companies["🏢 Tech Companies"]
C1[Need AI angle for valuation]
C2[Must justify AI team costs]
C3[Hype drives stock price]
end
subgraph Investors["💰 Financial Backers"]
I1[Massive capital deployed]
I2[Need exits and returns]
I3[Hype maintains valuations]
end
subgraph Media["📰 Tech Media"]
M1[AI stories get clicks]
M2[Access depends on positive coverage]
M3[Hype drives engagement]
end
R3 --> I1
C3 --> I1
I3 --> R1
I3 --> C1
M3 --> R3
M3 --> C3
The researchers at frontier labs need billions in compute to train the next generation of models. That money comes from investors and big tech. To justify that spend, they need to demonstrate progress—and "progress" gets translated into breathless announcements about capabilities that may or may not materialise in practical applications. If the hype dies, the funding dries up.
The companies (both the AI labs and everyone using AI) need the narrative to continue. OpenAI's valuation depends on the belief that AGI is around the corner. Every "AI-powered" startup's multiple depends on AI remaining the hot sector. The moment sentiment shifts, billions in paper wealth evaporates.
The investors have deployed staggering amounts of capital into AI. They need exits. They need the music to keep playing long enough to realise returns. A realistic assessment of near-term AI capabilities would crater valuations across the sector.
The media has discovered that AI stories generate massive engagement. Nuanced coverage doesn't get clicks. "AI will take your job" and "AI breakthrough solves X" do. Access to AI companies often depends on maintaining positive relationships—which means critical coverage is career-limiting.
The result? A self-reinforcing hype cycle where everyone has reasons to keep inflating expectations, and very few people benefit from telling the truth.
This doesn't mean AI isn't genuinely useful—it absolutely is, as I've discussed throughout this article. But the gap between what's being promised and what's being delivered is vast, and the incentives are all aligned to keep that gap hidden.
When someone tells you AI will revolutionise your business, ask yourself: what do they gain from you believing that?
If you're considering an AI project, here's my honest advice:
Don't ask "how can we use AI?" Ask "what problem are we trying to solve?" If AI is the right solution, great. But often it isn't.
Before you build a RAG pipeline, fix your document mess. Before you fine-tune, clean your training data. The AI won't fix your data problems; it will amplify them.
Don't launch a massive "AI transformation." Build a small proof of concept. Test it with real users. Learn what actually works. Then expand.
The document ingestion pipeline isn't sexy. The data cleaning isn't exciting. The evaluation framework isn't something you can demo to the board. But these are what determine success or failure.
AI isn't magic. Current LLMs hallucinate. RAG systems miss relevant documents. Fine-tuned models drift. Set realistic expectations.
Building the system is maybe 30% of the effort. Maintaining it, improving it, and keeping it relevant is the other 70%. Budget accordingly.
Maybe what you need is just better use of off-the-shelf tools. ChatGPT with some custom instructions might be enough. Not everything needs a bespoke AI platform.
Right, I've spent a fair few words slagging off commercial AI projects. Now for the hopeful bit: these projects can actually work, if you approach them sensibly.
The problem isn't RAG or fine-tuning as concepts. The problem is lazy implementation, unrealistic expectations, and ignoring the fundamentals. Here's how to do it better.
The biggest mistake I see is treating AI as a replacement for existing processes rather than an enhancement. Your business already has workflows that work (mostly). Instead of ripping them out and replacing them with an "AI-powered" version, hook AI into the gaps.
flowchart TD
subgraph Traditional["Traditional Workflow"]
A[Document Arrives] --> B[Human Reviews]
B --> C[Decision Made]
C --> D[Action Taken]
D --> E[Results Logged]
end
subgraph Enhanced["AI-Enhanced Workflow"]
A2[Document Arrives] --> AI1[AI: Extract Key Info]
AI1 --> B2[Human Reviews - With AI Summary]
B2 --> AI2[AI: Suggest Decision Based on History]
AI2 --> C2[Human Makes Final Decision]
C2 --> D2[Action Taken]
D2 --> AI3[AI: Auto-categorise & Log]
AI3 --> E2[Results Available for Future AI Training]
end
Notice what's different:
This is vastly more robust than "AI handles everything and sometimes a human checks."
Here's a dirty secret: you probably don't need GPT-4 or Claude for most tasks. And you definitely don't need to send your confidential documents to OpenAI's servers.
The Cost Problem
At scale, API costs add up fast. A busy RAG system might make thousands of LLM calls per day. At $0.01-0.03 per 1K tokens, that's real money. And as you scale, it gets worse.
The Confidentiality Problem
Many organisations can't (or shouldn't) send their documents to external APIs:
The Solution: Local Models
Modern open-source models are bloody good. Running locally means:
flowchart LR
subgraph Cloud["Cloud API Approach"]
A1[Your Documents] --> B1[Internet]
B1 --> C1[OpenAI/Anthropic]
C1 --> D1[£££/month]
C1 --> E1[Privacy Concerns]
end
subgraph Local["Local LLM Approach"]
A2[Your Documents] --> B2[Your Server]
B2 --> C2[Local LLM]
C2 --> D2[Fixed Hardware Cost]
C2 --> E2[Data Never Leaves]
end
For embedding (the RAG vector search bit):
For generation (the actual "AI" responses):
For coding tasks:
Running these locally isn't as hard as you'd think. Tools like Ollama, llama.cpp, vLLM, or text-generation-inference make it straightforward. I've built a number of apps (see the bottom of the article) that demonstrates this running on consumer hardware.
You don't have to go all-or-nothing. A sensible architecture uses:
Local models for high-volume, lower-complexity tasks
Cloud APIs for complex reasoning when needed
This hybrid approach gives you the cost benefits of local inference with the capability of cloud models when you genuinely need it.
flowchart LR
subgraph Waterfall["❌ The Waterfall AI Project"]
W1[Month 1-3: Requirements] --> W2[Month 4-6: Build Platform]
W2 --> W3[Month 7-9: Integration]
W3 --> W4[Month 10: Demo - Looks Great!]
W4 --> W5[Month 11: Real Users Break It]
W5 --> W6[Month 12: Project Shelved]
end
subgraph Incremental["✅ The Incremental Approach"]
I1[Week 1-2: One Small Problem] --> I2[Week 3-4: Refine + Measure]
I2 --> I3[Week 5-6: Add Capability]
I3 --> I4[Week 7-8: Refine + Measure]
I4 --> I5[Repeat...]
I5 --> I6[Continuous Value Delivery]
end
style W5 stroke:#ff0000,stroke-width:3px
style W6 stroke:#ff0000,stroke-width:3px
style I6 stroke:#00aa00,stroke-width:3px
Each step in the incremental approach delivers measurable value. Each step teaches you something. If something fails, you've lost weeks, not months.
Remember that red box in the diagram? Here's how to actually fix it:
Invest in quality over quantity. It's better to have 1,000 perfectly processed documents than 100,000 poorly processed ones. Start with your most important documents and get them right.
Use AI to help with ingestion. Modern vision-language models (GPT-4V, Claude, LLaVA locally) can actually read complex documents - tables, charts, handwritten notes - in ways that traditional OCR cannot. Use them for the hard documents.
Build feedback loops. When retrieval fails, log it. When users say "that's not what the document says," capture it. Use this feedback to improve your ingestion pipeline.
Accept that some documents won't work. Not every ancient scanned PDF is worth fighting with. Sometimes the answer is "we'll handle that type manually" rather than spending months on edge cases.
A working AI solution needs:
| Component | What Most Projects Do | What Actually Works |
|---|---|---|
| Data ingestion | Afterthought | Primary focus |
| Data quality | "AI will figure it out" | Dedicated cleaning pipeline |
| Retrieval | Basic vector search | Hybrid search + reranking |
| Generation | Raw LLM output | Structured output with validation |
| Human review | Optional | Integrated into workflow |
| Feedback | None | Continuous improvement loop |
| Monitoring | "It's working" | Detailed metrics & alerting |
| Maintenance | "Version 1 forever" | Regular updates & retraining |
The AI model is maybe 20% of a working system. The other 80% is the boring stuff that actually makes it work in production.
You can't improve what you can't measure. Every AI system should track:
This data tells you where to invest effort. Maybe your retrieval is great but generation is hallucinating. Maybe certain document types always fail. You can't fix what you can't see.
Here's the most important shift happening right now: the era of "throw everything at GPT-4 and hope for the best" is ending.
The 2023-2024 approach was simple: get the biggest model you can afford, stuff your context window full of everything, and pray. It worked... sort of. For demos. For prototypes. For getting investment.
But it doesn't scale. It's expensive. It's slow. And increasingly, it's being outperformed by smarter architectures.
The future isn't one giant model doing everything. It's multiple specialised models working together, each doing what it's good at.
flowchart TD
subgraph OldWay["The 2023 Approach"]
A1[Everything] --> B1[GPT-4]
B1 --> C1[Hope It Works]
B1 --> D1[£££££]
end
subgraph NewWay["The 2025+ Approach"]
A2[Input] --> B2[Router Model - Small/Fast]
B2 --> C2[Specialist Model A - Extraction]
B2 --> D2[Specialist Model B - Reasoning]
B2 --> E2[Specialist Model C - Generation]
C2 --> F2[Orchestrator]
D2 --> F2
E2 --> F2
F2 --> G2[Output]
end
This is what I've been exploring in my DiSE (Directed Synthetic Evolution) work—the idea that structure beats brilliance. A carefully orchestrated pipeline of smaller, focused models outperforms a single massive model trying to do everything.
The same thinking applies to the Synthetic Decision Engine concept: using multiple LLM backends in sequence, where each model brings different strengths. Fast models for triage, accurate models for validation, creative models for generation. Each doing what it does best.
If you haven't already, check out my RAG series which goes deep on the fundamentals. But here's the key insight: RAG itself is a form of this orchestrated approach. You're using embeddings (one model) to find relevant content, then using an LLM (another model) to synthesise an answer.
The next evolution is taking this further:
Each model is smaller, faster, and cheaper than using GPT-4 for everything. But together, they outperform the monolithic approach.
The term "agentic" is borrowed from psychology—my original field before I fell into software. In psychology, agency refers to the capacity to act independently, to make choices and execute them in the world. An agentic person doesn't just respond to stimuli; they initiate action, pursue goals, and adapt their behaviour based on outcomes.
Agentic AI applies this concept to language models. Instead of the traditional pattern—you ask a question, the model generates text—an agentic system can actually do things. It can use tools, execute code, query databases, call APIs, write files, and orchestrate multi-step workflows. It's the difference between asking someone for directions and hiring someone to drive you there.
This is what Anthropic's latest models (including Claude Opus 4.5 that I'm literally using to write this via Claude Code) demonstrate so effectively.
Here's the thing that makes the fine-tuning crowd uncomfortable: if you describe tools well enough, you don't need expensive fine-tuning to use them. Modern foundation models are remarkably good at tool use out of the box—you just need clear function schemas and good documentation in the context.
This is a massive shift from the Toolformer approach (fine-tune a model to learn when to use tools by measuring outcomes). That's expensive, requires specialised training data, and locks you into a specific set of tools. The alternative? Describe your tools clearly, give the model good context, and let it figure out when to use them.
The results are often better because:
flowchart TD
subgraph RAG["Traditional RAG Chatbot"]
R1[User Question] --> R2[Search Documents]
R2 --> R3[Construct Prompt]
R3 --> R4[LLM Generates Answer]
R4 --> R5[Return to User]
end
subgraph Agentic["Agentic AI Pattern"]
A1[User Task] --> A2[LLM Understands Task]
A2 --> A3[Break Into Steps]
A3 --> A4{Select Tool}
A4 --> A5[Execute Tool]
A5 --> A6{Evaluate Results}
A6 -->|Need More| A4
A6 -->|Done| A7[Return to User]
end
style A6 stroke:#00aa00,stroke-width:3px
This agentic pattern is fundamentally different from RAG chatbots. And it's where the real value lies.
Frameworks like LangChain, LlamaIndex, and Semantic Kernel make this possible today. You can build systems where:
The companies that figure this out will build AI systems that actually work. The ones still trying to fine-tune their way to success or building yet another RAG chatbot will continue to be disappointed.
I've written extensively about these patterns:
| Topic | Article | What You'll Learn |
|---|---|---|
| Architecture | DiSE vs Voyager | Why structured orchestration beats monolithic models |
| Multi-Model | Synthetic Decision Engines | Building pipelines of specialised models |
| RAG Fundamentals | RAG Series | From embeddings to production systems |
| Local Embeddings | Semantic Search with ONNX | CPU-friendly local vector search |
| Local LLMs | DiSE | A selkf evolvign workflow system using local models connected together |
| API Simulation | LLMApi | Using local LLMs to simulate APIs for testing |
| Practical RAG | Building a Lawyer GPT | Complete RAG implementation walkthrough |
The technology exists. The patterns are emerging. The question is whether your organisation will build something sensible or another dumb AI project.
The current commercial AI landscape reminds me of the early web era. Everyone needed a "web strategy." Companies built websites because they had to, not because they knew what to do with them. Most of those websites were useless.
Eventually, the companies that succeeded were the ones who figured out what the web was actually good for and built for that. The same will happen with AI.
Right now, we're in the "build it because we have to" phase. Most projects are dumb. Most will fail or underdeliver. That's normal for new technology.
But if you want to be one of the ones that succeeds, stop following the template. Start with real problems. Invest in the boring bits. And for the love of god, fix your document ingestion pipeline before you blame the LLM.
The AI isn't the problem. Your data is. Your processes are. Your unrealistic expectations are.
Fix those first, and maybe, just maybe, your AI project won't be dumb.
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.