This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Wednesday, 19 November 2025
Why I write tests last, and why that's not what you think it means
Controversial take: Test-Driven Development is a brilliant practice that solves the wrong problem for most creative work. Here's what works better.
This is how I build software. It's simple, it's effective, and it makes TDD purists politely furrow their brows:
Make it work. Make it pretty. Lock it down with tests.
In that order. Always.
Not because I'm lazy. Not because I don't value testing. But because this is how creative work actually happens when you're exploring a problem space you don't fully understand yet.
graph TB
Start[New Feature Request] --> P1[Phase 1: Make It Work]
P1 --> P1_Private[Private Exploration]
P1_Private --> P1_Sketch["Sketch the solution<br/>Messy code OK<br/>No tests yet<br/>Focus on learning"]
P1_Sketch --> P1_Run["Run it manually<br/>Try different inputs<br/>See what breaks<br/>Understand the problem"]
P1_Run --> P1_Decision{Does it work<br/>basically?}
P1_Decision -->|No| P1_Iterate[Try different approach]
P1_Iterate --> P1_Sketch
P1_Decision -->|Yes| P2
P2[Phase 2: Make It Pretty] --> P2_Refine[Private Refinement]
P2_Refine --> P2_Clean["Clean up the code<br/>Better names<br/>Type hints/annotations<br/>Clear structure"]
P2_Clean --> P2_Align["Align with spec<br/>Cover all requirements<br/>Handle edge cases<br/>Polish API surface"]
P2_Align --> P2_Optimise["Optimise<br/>Performance<br/>Readability<br/>Maintainability"]
P2_Optimise --> P3
P3[Phase 3: Lock It Down] --> P3_Share[Public Sharing]
P3_Share --> P3_Tests["Write comprehensive tests<br/>Full coverage<br/>Edge cases<br/>Error conditions"]
P3_Tests --> P3_CI["CI/CD Integration<br/>Automated testing<br/>Code review<br/>Documentation"]
P3_CI --> P3_Deploy["Ready to Share<br/>Commit to main<br/>Deploy to prod<br/>Other devs can use it"]
P3_Deploy --> Done[Well-Tested,<br/>Well-Designed Code]
style P1 stroke:#ff6b6b,stroke-width:3px
style P2 stroke:#4ecdc4,stroke-width:3px
style P3 stroke:#95e1d3,stroke-width:3px
style Done stroke:#38ada9,stroke-width:4px
Not because I'm lazy. Not because I don't value testing. But because this is how creative work actually happens when you're exploring a problem space you don't fully understand yet.
Let me explain.
This is the messy part. The part where I have no idea what I'm doing.
I'm trying to understand:
This is invention. This is exploration. This is jazz.
And you know what kills jazz? Someone standing over your shoulder saying "write the test first."
Because the shape is still fluid.
Imagine you're a sculptor. You have a block of marble and a vague idea: "I want to carve a bird."
TDD says: "Before you touch that marble, write down exactly what a bird looks like. Define its wingspan. Specify the angle of each feather. Now carve to those specifications."
That's... not how birds get carved. (Or software gets written, as it turns out.)
Real sculptors rough out the form first. They sketch. They make maquettes. They chip away big chunks of stone to find the general shape. Only later do they refine the details.
Code is the same.
In Phase 1, I'm writing:
# Ugly, exploratory code
def process_thing(data):
# TODO: This is terrible, fix later
result = []
for item in data:
# Not sure if this is the right approach...
x = item.get('value')
if x: # Is this even the right condition?
result.append(x * 2) # Why multiply by 2? Testing something...
return result
This code is private. It's a conversation with myself. It's how I figure out what "working" even means.
Writing tests for this is worse than useless—it's actively harmful. Because every time I realise "oh wait, this whole approach is wrong," I'd have to rewrite the tests too.
That's not rigor. That's bureaucracy.
Phase 1 is about learning velocity.
I need to try five different approaches in an hour. I need to discover that the third-party library I thought I'd use... isn't quite as advertised. I need to realise the problem I'm solving isn't the problem I should be solving.
Tests slow this down. Not because testing is slow, but because premature specification is death.
If I write tests before I understand the problem, I'm writing tests for the wrong thing. Then I'm emotionally invested in the wrong thing. Then I'm defending the wrong thing in code review.
Better to fail fast, in private, with no tests to maintain and no ego to protect.
It means: "I ran it. I gave it some inputs. It produced outputs that seem reasonable. I'm 60% confident I understand the problem now."
That's it.
No edge cases. No error handling. No elegance.
Just a spike. A proof of concept. A sketch.
Okay, so now I understand the problem. I have something that works, at least for the happy path.
Now I can start caring about quality.
Phase 2 is where I:
Python Example:
# Before (Phase 1) - Exploratory code
def process_thing(data):
result = []
for item in data:
x = item.get('value')
if x:
result.append(x * 2)
return result
# After (Phase 2) - Refined code
def extract_doubled_values(items: list[dict]) -> list[int]:
"""Extract 'value' field from items and double each one.
Only includes items where 'value' is present and truthy.
"""
return [
item['value'] * 2
for item in items
if item.get('value')
]
C# Example:
// Before (Phase 1) - Exploratory code
public List<int> ProcessThing(List<Dictionary<string, object>> data)
{
var result = new List<int>();
foreach (var item in data)
{
if (item.ContainsKey("value"))
{
var x = item["value"];
if (x != null)
{
result.Add(Convert.ToInt32(x) * 2);
}
}
}
return result;
}
// After (Phase 2) - Refined code
/// <summary>
/// Extracts 'value' field from items and doubles each one.
/// Only includes items where 'value' is present and non-null.
/// </summary>
public IEnumerable<int> ExtractDoubledValues(IEnumerable<IDictionary<string, object>> items)
{
return items
.Where(item => item.ContainsKey("value") && item["value"] != null)
.Select(item => Convert.ToInt32(item["value"]) * 2);
}
Better name. Type annotations. XML documentation. Cleaner LINQ implementation.
Now that I know what I'm actually building, I go back to the original requirements and make sure I'm solving the right problem, not just a problem.
Often this reveals gaps:
I fix these. Sometimes I push back on the spec: "Actually, we should do X instead because Y."
This is where I think about the caller's experience.
Python:
# Phase 1: Works, but awkward to use
process_thing(data)
# Phase 2: Clear, flexible, composable
extract_doubled_values(
items,
scale_factor=2.0,
include_zeros=False
)
C#:
// Phase 1: Works, but awkward to use
ProcessThing(data);
// Phase 2: Clear, flexible, composable
ExtractDoubledValues(
items,
scaleFactor: 2.0,
includeZeros: false
);
// Or even better with a fluent API
items.ExtractValues()
.Scale(by: 2.0)
.ExcludingZeros()
.ToList();
Better naming. Sensible defaults. Configuration where it matters. Fluent APIs where appropriate.
This is craft. This is where good software comes from.
graph LR
subgraph "Phase 2 Refinement Process"
A[Rough Code] --> B[Clean Structure]
B --> C[Add Types]
C --> D[Polish API]
D --> E[Static Analysis]
E --> F{Quality Gates Pass?}
F -->|No| G[Fix Issues]
G --> B
F -->|Yes| H[Ready for Phase 3]
end
style A stroke:#ff6b6b,stroke-width:3px
style H stroke:#95e1d3,stroke-width:3px
Oh, I'm running it constantly. I have a REPL open. I'm trying different inputs. I'm exercising all the paths.
But I'm not writing those experiments down as formal tests yet.
Why? Because I'm still discovering what's worth testing.
In Phase 1, I didn't know if this function would even exist. In Phase 2, I'm discovering:
These insights emerge from refactoring. They're not obvious upfront.
If I'd written tests in Phase 1, I'd be rewriting them now. If I write them after I've refined the implementation, they'll be correct the first time.
Okay, now it's real.
The code is clean. The API makes sense. I've covered the edge cases. I'm confident in what I've built.
Now I write tests. Lots of tests. As-close-to-full-coverage-as-possible tests.
Because tests are a sharing mechanism.
They're not for me anymore. I already understand this code intimately. I've lived in it for hours or days.
Tests are for:
Tests are the presentation layer for my implementation.
They document:
In Phase 3, I go hard on testing:
def test_extract_doubled_values_basic():
items = [{'value': 5}, {'value': 10}]
assert extract_doubled_values(items) == [10, 20]
def test_extract_doubled_values_skips_missing_values():
items = [{'value': 5}, {'name': 'foo'}, {'value': 3}]
assert extract_doubled_values(items) == [10, 6]
def test_extract_doubled_values_handles_zeros():
items = [{'value': 0}, {'value': 5}]
# By default, zeros are excluded (falsy)
assert extract_doubled_values(items) == [10]
def test_extract_doubled_values_includes_zeros_when_configured():
items = [{'value': 0}, {'value': 5}]
assert extract_doubled_values(items, include_zeros=True) == [0, 10]
def test_extract_doubled_values_custom_scale_factor():
items = [{'value': 5}]
assert extract_doubled_values(items, scale_factor=3.0) == [15.0]
def test_extract_doubled_values_rejects_negative_scale_factor():
with pytest.raises(ValueError):
extract_doubled_values([], scale_factor=-1.0)
def test_extract_doubled_values_handles_empty_list():
assert extract_doubled_values([]) == []
def test_extract_doubled_values_handles_non_numeric_values():
items = [{'value': 'not a number'}]
with pytest.raises(TypeError):
extract_doubled_values(items)
This is the safety net.
Now my teammates can:
Good tests are better documentation than comments:
# This comment will drift from reality:
# Doubles all numeric values, skipping None
# This test will fail if reality changes:
def test_extract_doubled_values_skips_none_values():
items = [{'value': None}, {'value': 5}]
assert extract_doubled_values(items) == [10]
Tests don't lie. They can't drift. If the behaviour changes, the test breaks.
That's executable documentation. That's what makes tests valuable.
Here's the crucial point: Phase 3 happens before anyone else touches the code.
I don't share code without tests. Ever.
But I also don't write tests before the code is ready to share.
The phases are:
This protects both my creativity and my teammates' sanity.
TDD advocates say "tests give you confidence to refactor."
True! But only if you already know what you're building.
When I'm in Phase 1, I don't need confidence to refactor. I need permission to throw everything away.
Tests create psychological debt. Once I've written them, I'm reluctant to delete them. Even if they're testing the wrong thing.
By deferring tests until Phase 3, I keep Phase 1 and 2 psychologically cheap.
I can:
All without the guilt of "but I wrote all those tests..."
This is how creativity works.
You sketch loosely first. You don't start with the details.
Let's be clear about something: this is not "test later development."
This is "test while developing, but at the right time."
The timing of when to add unit tests is critical. Too early and you're specifying the unknown. Too late and you're shipping without safety nets.
Extreme Programming has brilliant practices worth following:
Continuous Integration
Pair Programming / Code Review
Simple Design
Refactoring
graph TB
subgraph "Traditional TDD (Red-Green-Refactor)"
TDD1[Write Failing Test] --> TDD2[Write Minimal Code to Pass]
TDD2 --> TDD3[Refactor]
TDD3 --> TDD1
end
subgraph "Three-Phase Approach (Explore-Refine-Lock)"
P1[Explore Solution Space] --> P2[Refine Implementation]
P2 --> P3[Write Comprehensive Tests]
P3 --> P4[Share With Team]
end
style TDD1 stroke:#ffaa00,stroke-width:2px
style TDD2 stroke:#ffaa00,stroke-width:2px
style TDD3 stroke:#ffaa00,stroke-width:2px
style P1 stroke:#ff6b6b,stroke-width:3px
style P2 stroke:#4ecdc4,stroke-width:3px
style P3 stroke:#95e1d3,stroke-width:3px
TDD Says: "Tests first. Always. Red-Green-Refactor is the only way."
I Say: "Tests when you know what you're testing. Explore-Refine-Lock is more natural."
The key difference:
| TDD | Three-Phase |
|---|---|
| Test defines the interface | Exploration defines the interface |
| Red-Green-Refactor cycle | Explore-Refine-Lock progression |
| Tests written before implementation | Tests written before sharing |
| Test first for everything | Test at the right time for each thing |
| Micro-cycles (minutes) | Macro-phases (hours/days) |
Here's the crucial insight: It's not "tests last." It's "tests before sharing."
Wrong timing (too early):
Hour 1: Write test for API I haven't designed yet
Hour 2: Realize the API is wrong, rewrite test
Hour 3: Discover a better approach, rewrite both
Hour 4: Finally get it working
Hour 5: Rewrite tests to match what actually works
Right timing (Phase 3):
Hour 1: Explore 3 different APIs, settle on the best one
Hour 2: Refine the implementation, handle edge cases
Hour 3: Write comprehensive tests for what I built
Hour 4: All tests pass, commit with confidence
Same quality outcome. Half the churn.
Agile says: "Respond to change over following a plan."
TDD can become its own form of plan-following. Once you've written tests, you're psychologically committed to that design.
My approach embraces change:
This is more agile than strict TDD, because it defers commitment until you have information.
Strict TDD Approach:
// Test 1: Write test first (but I don't know the right API yet)
[Test]
public void TestProcessData()
{
var processor = new DataProcessor();
var result = processor.Process(new[] { 1, 2, 3 });
Assert.That(result, Is.EqualTo(new[] { 2, 4, 6 }));
}
// Implementation 1: Make it pass
public class DataProcessor
{
public int[] Process(int[] data) => data.Select(x => x * 2).ToArray();
}
// Later: Realize I need configurability, rewrite test
[Test]
public void TestProcessDataWithFactor()
{
var processor = new DataProcessor();
var result = processor.Process(new[] { 1, 2, 3 }, factor: 3);
Assert.That(result, Is.EqualTo(new[] { 3, 6, 9 }));
}
// Later: Realize I should use IEnumerable, rewrite test again
[Test]
public void TestProcessDataEnumerable()
{
var processor = new DataProcessor();
var data = GetLargeDataset(); // This should be lazy
var result = processor.Process(data, factor: 3);
Assert.That(result.Take(3), Is.EqualTo(new[] { 3, 6, 9 }));
}
Notice the churn? Three test rewrites as the design evolved.
Three-Phase Approach:
// Phase 1: Explore (no tests yet)
public int[] ProcessThing(int[] data)
{
return data.Select(x => x * 2).ToArray();
}
// Try it: var result = ProcessThing(new[] { 1, 2, 3 });
// Hmm, what about configurability? What about lazy evaluation?
// Phase 2: Refine (still no tests)
public static class DataProcessorExtensions
{
public static IEnumerable<int> Scale(
this IEnumerable<int> source,
int factor = 2)
{
return source.Select(x => x * factor);
}
}
// Try it: var result = data.Scale(factor: 3).ToList();
// Much better API!
// Phase 3: Lock it down (NOW write tests, correctly)
[TestFixture]
public class DataProcessorTests
{
[Test]
public void Scale_DefaultFactor_DoublesValues()
{
var result = new[] { 1, 2, 3 }.Scale();
Assert.That(result, Is.EqualTo(new[] { 2, 4, 6 }));
}
[Test]
public void Scale_CustomFactor_ScalesCorrectly()
{
var result = new[] { 1, 2, 3 }.Scale(factor: 3);
Assert.That(result, Is.EqualTo(new[] { 3, 6, 9 }));
}
[Test]
public void Scale_LazyEvaluation_DoesNotEnumerateImmediately()
{
var enumerated = false;
var data = GetTestData(() => enumerated = true);
var scaled = data.Scale(factor: 2);
Assert.That(enumerated, Is.False, "Should not enumerate yet");
scaled.ToList();
Assert.That(enumerated, Is.True, "Should enumerate on materialization");
}
[Test]
public void Scale_EmptySequence_ReturnsEmpty()
{
var result = Enumerable.Empty<int>().Scale();
Assert.That(result, Is.Empty);
}
}
One set of tests. Written once. Correct the first time. Because I knew what I was building.
Remember the values:
"Working software over comprehensive documentation"
"Responding to change over following a plan"
"Individuals and interactions over processes and tools"
"Customer collaboration over contract negotiation"
TDD's dirty secret: most tests written before implementation are rewritten or deleted.
Why? Because:
Every test you write in Phase 1 is probably wasted effort.
Better to write one set of tests in Phase 3, when you actually know what you're building, than to write-rewrite-rewrite during exploration.
TDD promises: "Tests drive your design! They help you discover good APIs!"
TDD reality: "I wrote tests for an API I thought made sense. Then I implemented it and realised the API was awkward. Now I'm rewriting both the tests and the code."
That's not design. That's thrashing.
My approach: Design through implementation, then lock it down with tests.
The final result is the same—well-tested, well-designed code. But I got there with half the churn.
Here's what this approach is not:
By the time I'm done with Phase 3, my code has:
Exactly the same as TDD.
The difference is when those tests were written.
The rule is simple: No code leaves Phase 2 without Phase 3.
I don't:
The discipline isn't in writing tests early. It's in not sharing code without tests.
This isn't a new idea. It's how creative work has always worked:
Painters don't start with the final brushstrokes.
They:
The varnish doesn't come first. But it's not optional.
Writers don't edit as they draft.
They:
"Write drunk, edit sober" is terrible life advice but great writing advice.
Potters don't glaze before shaping.
They:
The glaze is crucial. But it comes last.
Here's where this gets interesting: I've actually implemented this philosophy in the DiSE (Directed Synthetic Evolution) system.
The codegen pipeline explicitly embodies these three phases:
When you ask DiSE to solve a problem, it doesn't jump straight to writing perfect, tested code.
Instead, the Generator (codellama or similar) gets a clear mandate:
"Write code that solves the specification.
Don't worry about perfection yet.
Just get it working."
The system generates exploratory code:
Then the Executor runs it. Not with formal unit tests—just with the inputs from the specification.
If it fails? No problem. That's learning.
The system has a 6-stage adaptive escalation:
This is exactly Phase 1. The system is exploring the solution space, learning what works, failing fast, and adapting.
No tests are written during this phase. The code is private to the generation process.
Once the code passes basic execution, DiSE moves to optimisation.
This is the refinement phase. The system:
Cleans up the implementation
Runs static analysis
- flake8 (style checking)
- pylint (code quality)
- mypy (type checking)
- black (formatting)
- radon (complexity analysis)
Iteratively optimises (3 iterations by default)
for iteration in range(3):
# Measure current performance
metrics = measure_performance(code)
# Generate improved version
better_code = optimise(code, metrics)
# Keep if better, discard if worse
if better_code.score > code.score:
code = better_code
This is exactly Phase 2. The code is still private (not in the registry yet), but we're polishing it:
The shape is no longer fluid. We know what we're building. Now we make it good.
Only after optimisation does DiSE generate and run formal unit tests.
Here's the crucial part: the tests are generated from the refined specification and implementation, not the initial exploration.
# Test generation happens AFTER optimisation
def generate_unit_tests(specification, optimized_code):
"""Generate comprehensive tests for refined code.
This happens in Phase 3, after we know:
- What the code actually does
- What edge cases exist
- What the API surface looks like
"""
return llm.generate(
f"""Generate comprehensive unit tests for this code.
Specification: {specification}
Implementation: {optimized_code}
Include tests for:
- Happy path (from spec examples)
- Edge cases (discovered during optimisation)
- Error conditions (based on actual error handling)
- Performance bounds (based on measured metrics)
"""
)
The tests cover:
Then, only if tests pass, the code is:
This is the sharing phase. The code is now public (to the system). It has comprehensive tests. It's locked down.
The three-phase approach is essential for AI-driven code generation because:
Phase 1: Models don't understand the problem upfront
Phase 2: Models optimise better with feedback
Phase 3: Tests document what was learned
Here's a real example from the DiSE system:
User: "Calculate fibonacci numbers"
[Phase 1: Exploration - 3 attempts]
Attempt 1: Works for small inputs, explodes on large ones (no safety limit)
Attempt 2: Adds safety limit, but inefficient recursive approach
Attempt 3: Switches to iterative DP (PASS)
[Phase 2: Optimization - 3 iterations]
Iteration 1: Add type hints, improve naming (Score: 1.05)
Iteration 2: Optimize memory usage with generator (Score: 1.10)
Iteration 3: Add input validation (Score: 1.15)
[Phase 3: Testing and Storage]
Generated 8 unit tests covering:
- Basic cases (n=0, n=1, n=5, n=10)
- Edge cases (n=negative, n=100, n=None)
- Type validation
All tests pass ✓
Stored in RAG with quality score: 1.15
Available for reuse: YES
The final code is:
Exactly like TDD promises. Just via a more natural path.
Here's where it gets really interesting: DiSE applies this three-phase approach to itself.
The Recursive System Optimizer has 4 levels:
At every level: explore → refine → test.
This is the mantra, all the way down.
Here's the uncomfortable truth: Most code is written by people who don't fully understand the problem yet.
That's not a failure. That's reality.
Requirements are ambiguous. Specs are incomplete. Third-party APIs are poorly documented. Edge cases emerge from nowhere.
You can't design the perfect API upfront because you don't have enough information.
The three phases acknowledge this:
By separating exploration from specification, I can:
This is true for humans writing code. It's also true for LLMs generating code. And for systems optimizing themselves.
I'm not anti-TDD. I'm anti-premature TDD.
TDD works great when:
If I'm implementing a known algorithm (sorting, parsing, etc.), TDD is perfect. I know exactly what "correct" means upfront.
# I know what sort() should do. Tests first? Sure!
def test_sort_empty_list():
assert sort([]) == []
def test_sort_single_element():
assert sort([5]) == [5]
def test_sort_multiple_elements():
assert sort([3, 1, 2]) == [1, 2, 3]
If I have a regression, TDD is exactly right:
Perfect.
If I already have an implementation and want to change its structure (not behaviour), tests are essential before I start.
But that's not Phase 1. That's Phase 4: Maintenance.
This is how I actually work. And I suspect it's how most developers actually work, even the ones who claim to do strict TDD.
Because creativity is messy. Learning is iterative. Understanding emerges from doing.
You can't specification yourself out of ignorance.
You have to build the thing to understand the thing.
So I build it (Phase 1), refine it (Phase 2), then lock it down (Phase 3).
In that order. Every time.
And you know what? The code I ship is well-tested, well-designed, and well-understood.
Just like TDD promises.
I just got there via a different path—one that respects how creative work actually happens.
What's your software building mantra? Do you TDD everything, or do you have a different approach? I'd genuinely love to hear your thoughts in the comments.
P.S. - If you're a TDD purist and this post made you raise an eyebrow, excellent! That means we're having a proper engineering discussion. I've been wrong about plenty of things before (just ask my git history), so I'm always curious to hear different perspectives. Maybe you'll convince me there's a better way.
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.