This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Monday, 24 November 2025
My system, DiSE, is a self-optimizing, self-assembling, software engineering grounded workflow builder. It generates testable code artifacts, evaluates them objectively, and evolves them over time without constant human supervision. But here's the thing: didn't Voyager—that impressive Minecraft bot from 2023—already do this by generating reusable skills? And didn't Toolformer already prove that LLMs could learn to use tools by measuring outcomes? If we already have agents that write their own tools and learn when to use them, haven't we already solved the problem of self-improving AI agents?
Spoiler: No. And understanding why matters if you're building anything that needs to run in production.
"This sounds like Voyager but with more steps."
Fair point. If you've been following LLM-driven agent research, you've probably heard of Voyager (Wang et al., 2023) - the system that taught GPT-4 to play Minecraft by generating reusable code "skills" - and Toolformer (Schick et al., 2023) - which taught LLMs to use tools by measuring actual outcomes. Both were impressive, influential, and widely cited.
At first glance, DiSE (Directed Synthetic Evolution) might look like Voyager + Toolformer with a fresh coat of paint. But that's like saying a production database is just a spreadsheet with more steps. The distinction matters if you care about scale, cost, and actually shipping something.
New to DiSE? Check out my elevator pitch for the big picture, then explore the "Cooking with DiSE" series: Part 2 on Graduated Apprenticeships and Part 3 on Untrustworthy LLMs. This post focuses specifically on how DiSE's architecture differs from Voyager and Toolformer.
This post explains what Voyager and Toolformer got right, where they hit limits, and why DiSE's architectural approach solves problems that neither could tackle alone.
Two papers from 2023 changed how we think about LLM agents and tools:
Voyager introduced a crucial insight:
LLMs can generate their own reusable tools.
Instead of hard-coding every action in Minecraft, Voyager used GPT-4 to write functions on the fly. Each successful action became a reusable skill stored in a vector database. The prompt was explicit:
"Your function will be reused for building more complex functions. Therefore, you should make it generic and reusable."
This was important. For the first time, an agent system treated tool generation as real software engineering rather than one-off prompting. Voyager showed that agents could:
And it worked. Sort of. In Minecraft. With GPT-4.
Toolformer (Schick et al., 2023) took a different approach. Instead of generating tools, it taught LLMs when to call existing tools.
The clever bit: Toolformer generated its own training data. It would:
This created models that learned tool use from objective outcomes rather than human annotation. A calculator API that returns the right answer is better than one that doesn't - no LLM judgement needed.
DiSE takes insights from both papers but addresses their gaps:
From Voyager:
From Toolformer:
DiSE's genealogy:
Think of DiSE as the grandchild of ReAct, Reflexion, Toolformer, and Voyager. Each ancestor contributed something crucial:
graph TB
ReAct[ReAct 2022:<br/>Reason + Act<br/>Step-by-step thinking] --> DiSE
Reflexion[Reflexion 2023:<br/>Self-critique loops<br/>Try → Reflect → Retry] --> DiSE
Toolformer[Toolformer 2023:<br/>Learn from outcomes<br/>Self-generated training data] --> DiSE
Voyager[Voyager 2023:<br/>Generate reusable tools<br/>Code as memory] --> DiSE
DiSE[DiSE 2024:<br/>Directed Synthetic Evolution]
DiSE --> G[Generate tools with tests]
DiSE --> E[Evaluate objectively]
DiSE --> M[Mutate and improve]
DiSE --> S[Store with usage stats]
DiSE --> R[Retrieve and reuse]
DiSE --> C[Tiered execution]
style ReAct stroke:#8b5cf6,stroke-width:2px
style Reflexion stroke:#ec4899,stroke-width:2px
style Toolformer stroke:#f59e0b,stroke-width:2px
style Voyager stroke:#ef4444,stroke-width:2px
style DiSE stroke:#10b981,stroke-width:3px
The inheritance:
What DiSE adds:
Toolformer proved you could train models to use tools by measuring real outcomes. Voyager proved agents could build their own tool libraries. Reflexion proved reflection loops work. ReAct proved structured reasoning helps.
DiSE asks: What if we combined all of these insights, but made it cheap enough to run in production and smart enough to improve itself over time?
And yes, there's a dash of Roy Fielding's REST architecture in there too - tools are stateful resources with URIs, metadata, and versioning. Each tool is addressable, cacheable, and can be composed with others. The tool registry isn't just a vector database; it's a RESTful store where resources (tools) have state (usage stats, performance metrics, version history) that influences how they're retrieved and evolved.
That's not training. That's not one-off generation. That's directed evolution.
graph TB
subgraph Voyager["Voyager (2023)"]
A[GPT-4] --> B[Generate Code]
B --> C[Execute in Minecraft]
C --> D{Success?}
D -->|Yes| E[Store with embedding]
D -->|No| F[GPT-4 suggests fix]
F --> B
E --> G[Vector DB]
G --> H[Future retrieval]
H --> A
end
style A stroke:#ef4444,stroke-width:3px
style F stroke:#ef4444,stroke-width:3px
note1[All reasoning flows through GPT-4]
note1 -.-> A
See the problem? Every decision - planning, evaluation, debugging, composition - goes through the same frontier model. When you need brilliance at every step, you need GPT-4 (or equivalent) everywhere.
Voyager wasn't just using GPT-4 for code generation. It relied on GPT-4 for:
When the model does everything, it needs to be frontier-level brilliant every time.
That's precisely why Voyager didn't scale beyond Minecraft. And it's why running it in production would cost a fortune.
Let's do some napkin math. Say you want to run a Voyager-style system for real-world tasks:
Now add retrieval queries, composition attempts, debugging cycles. You're easily looking at hundreds of dollars for a single agent to build a moderate skill library.
For comparison, here's what a tiered approach might cost:
| Task | Voyager (GPT-4) | DiSE (Tiered) | Savings |
|---|---|---|---|
| Triage/routing | $0.05 | $0.001 (Llama 3.1 8B) | 98% |
| Initial generation | $0.05 | $0.002 (Qwen 2.5 Coder) | 96% |
| Evaluation | $0.05 | $0 (static tests) | 100% |
| Escalation (10%) | $0.05 | $0.05 (GPT-4o) | 0% |
| Average per task | $0.20 | $0.016 | 92% |
When you only call the expensive model for genuinely hard problems, the economics change dramatically.
Voyager, Toolformer, and Reflexion were not failures. They simply hit the technical and economic ceiling of their moment.
Three constraints blocked progress in 2023:
| Limitation (2023) | Consequence | What Changed (2024–2025) |
|---|---|---|
| GPT-4 was the only reliable reasoning engine | One model had to do everything | Tiered model stacks — GPT-4o + Qwen + DeepSeek + Llama |
| Local embeddings were weak or unused | Retrieval was expensive | BGE / Arctic / MiniLM on consumer GPUs |
| No structured evaluation harness | LLM judged LLM | LLMs now run code cleanly via containerized exec |
So Voyager and Toolformer couldn't evolve.
They could only call models or fine-tune models. They couldn't distribute cognition across a system. They couldn't apply selection pressure.
They solved the "can we?" question.
DiSE solves the "how do we keep improving?" question.
Not because I'm smarter than Wang or Schick — but because the hardware, tooling, and economics finally caught up.
This is an engineering problem. And engineering is always a question of timing.
The key idea behind DiSE is simple but profound:
Structure replaces brilliance.
Instead of expecting one model to do everything perfectly, DiSE distributes the problem across an orchestration system. It doesn't assume models "know how to code". Instead, it creates pressure, feedback, and memory so models can improve code iteratively rather than generate it perfectly on the first try.
graph TB
subgraph DiSE["DiSE (Distributed System)"]
A[Triage Agent] -->|Easy| B[Fast Model - Qwen/Llama]
A -->|Hard| C[Strong Model - GPT-4o]
B --> D[Test Suite]
C --> D
D -->|Pass| E[Structured Registry]
D -->|Fail| F{Worth escalating?}
F -->|Yes| G[Optimizer Agent]
F -->|No| H[Mark as failed]
G --> I[Mutation Pipeline]
I --> D
E --> J[RAG with usage stats]
J --> K[Clustering & reranking]
K --> A
end
style D stroke:#10b981,stroke-width:3px
style E stroke:#3b82f6,stroke-width:3px
style K stroke:#8b5cf6,stroke-width:3px
note2[Most tasks never hit expensive models]
note2 -.-> B
The difference is stark:
| Capability | Voyager | DiSE |
|---|---|---|
| Code generation | GPT-4 | Tiered multi-LLM pipeline |
| Evaluation | GPT-4 judgement | Non-LLM tests, metrics |
| Tool reuse | Embeddings only | Registry with metadata + tags + usage stats |
| Improvement | GPT-4 suggests fixes | Escalation & mutation pipelines |
| Memory | Vector DB | RAG + clustering + evolution tracking |
| Composition | GPT-4 plans | Workflow graph planning |
| Cost control | None | Prioritisation + fallback logic |
Because every unit of generated code is treated as an artifact, not a prompt response.
It's not judged by "does this look good?" - it's judged by:
A cheap LLM (Qwen 2.5 Coder 7B, Llama 3.1 8B) can generate five variants. Tests select the best. A stronger model only gets involved when weaker ones fail.
Over time, the system learns which models are good for which domains. Like I did with my semantic search implementation - you don't need the biggest model when you have the right architecture.
The refinement process becomes evolutionary, not autoregressive.
Let's make this concrete. Say you want your agent to implement efficient sorting.
Voyager approach:
DiSE approach:
The cheap model can explore the solution space. The tests do the selection. The expensive model only gets involved if all 5 variants fail.
I've been experimenting with similar principles in my blog's RAG system and semantic intelligence features. The pattern is consistent:
Use the biggest model for architecture decisions, not execution.
When I built the broken link handler with semantic search fallback, I didn't use GPT-4 to check every link. The system:
The expensive intelligence is in the design, not the execution.
Each system asks a different question:
Toolformer asks:
Can an LLM learn when to use tools by measuring which tool calls actually help?
Voyager asks:
Can an LLM generate reusable code that helps in future tasks?
DiSE asks:
Can a system generate tools, evaluate them objectively, evolve them based on real outcomes, and do all this efficiently enough to run in production?
Toolformer proved outcome-based learning works. Voyager proved generated tools can be reused. DiSE proves tools can evolve continuously without breaking the bank.
Toolformer is training time. Voyager is flash. DiSE is infrastructure.
One shows you can learn from outcomes. One shows you can generate tools. The other builds machinery for continuous evolution.
If you're building LLM-powered systems today, the DiSE approach suggests:
Don't generate code and hope it works. Write tests that define "works". Like I do for my Entity Framework migrations - the tests run, or they don't. No LLM judgement needed.
Route simple tasks to cheap models. Reserve expensive models for genuinely hard problems. Your wallet will thank you.
Store not just the code, but:
This metadata becomes training data for your routing logic.
Don't expect perfection on the first try. Generate variants, test them, keep the winners, mutate the good-but-not-perfect ones.
This is how I approach development on this blog - iterate in public, learn from real traffic, improve based on actual usage.
The difference between Voyager and DiSE isn't just about cost. It's about what happens when things fail.
In Voyager, failure means:
In DiSE, failure triggers:
The system gets smarter about what it doesn't know.
I don't think we're going to see pure "single model does everything" systems win in production. The economics don't work, and the failure modes are too opaque.
Instead, we'll see hybrid systems like DiSE:
We're seeing this pattern everywhere:
The magic isn't in having one brilliant model. It's in having the right model at the right time with the right context.
Voyager was a beacon. It showed that agents could learn by generating code.
The next step wasn't more prompting. That approach is reaching its limits.
The next step is directed synthetic evolution:
DiSE isn't trying to be smarter in a single shot. It's trying to be better on the next shot.
That difference is where evolution begins.
And unlike biological evolution, we don't have to wait millions of years to see results.
Papers:
On this blog:
Tools:
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.