Semantic Intelligence: Part 3 - Self-Optimization - Systems That Learn

Thursday, 13 November 2025

When systems start rewriting their own code

Note: Inspired by thinking about extensions to mostlylucid.mockllmapi and material for the (never to be released but I like to think about it 😜) sci-fi novel "Michael" about emergent AI

The Terrifying Step

We've seen simple rules create complex behavior. We've seen communication create collective intelligence.

Now we take the step that makes me deeply uncomfortable:

What if the system could rewrite its own rules?

What if agents could:

Inspect their own performance
Modify their own decision logic
Spawn new specialists dynamically
Prune ineffective pathways
Build shared memory of what works

This isn't just optimization. This is self-modification.

And once you give a system the ability to improve itself... where does it stop?

The Evolution Metaphor

Evolution is the ultimate self-optimizing system:

Variation - Random mutations create different strategies
Selection - Successful strategies survive and reproduce
Inheritance - Successful traits pass to next generation
Iteration - Repeat for billions of generations

No intelligent designer. No plan. Just a simple algorithm that, given enough time, creates everything from bacteria to humans.

Now imagine that same pattern, but with AI agents instead of organisms. And instead of billions of years, it happens in days or weeks.

The Critical Ingredient: Tools and Reality Testing

Before we go further, let's address the elephant in the room: How do we prevent this from being pure LLM hallucination?

The answer: Tools. Code execution. Testing against reality.

Here's the architecture that makes this practical:

The Node Architecture (No GPU Farm Needed!)

Your Server(s):
  ┌─────────────────────────────────────┐
  │  Node 1: Routing Agent              │
  │  - Lightweight code (Node.js/Python)│
  │  - Makes decisions                  │
  │  - Calls LLM APIs when needed       │
  │  - Executes code to test ideas      │
  └─────────────────────────────────────┘

  ┌─────────────────────────────────────┐
  │  Node 2: Validation Agent           │
  │  - Runs tests against real data     │
  │  - Executes validation code         │
  │  - Calls LLM for complex checks     │
  └─────────────────────────────────────┘

  ┌─────────────────────────────────────┐
  │  Node 3: Specialist Agent           │
  │  - Domain-specific logic            │
  │  - Code execution for that domain   │
  │  - Calls specialized LLM prompts    │
  └─────────────────────────────────────┘

All nodes call → [OpenAI API / Anthropic API / Local LLM API]
                 (This is where the cost is: API credits, not hardware)

Key Insight: You don't need a GPU farm. The agents are lightweight code running on normal servers. They CALL LLMs via API. The expensive part is LLM credits, not infrastructure.

Tools: The Reality Check

When an agent generates code or makes a decision, it can TEST it:

class Agent:
    def solve_problem(self, problem):
        # Agent asks LLM to generate solution code
        solution_code = self.llm_generate(
            f"Write Python code to solve: {problem}"
        )

        # HERE'S THE KEY: Execute the code and see if it works
        try:
            result = self.execute_code(solution_code, test_inputs)
            if self.validate_result(result):
                # It works! Save this solution
                self.cache_solution(problem, solution_code)
                return result
            else:
                # Failed validation, try different approach
                return self.solve_problem_alternative(problem)
        except Exception as e:
            # Code failed to execute
            # Ask LLM to fix it based on the actual error
            fixed_code = self.llm_fix(solution_code, error=str(e))
            return self.execute_code(fixed_code, test_inputs)

This changes everything. The system isn't just generating plausible-sounding answers. It's:

Generating actual code
Executing it
Testing results against reality
Learning from failures
Iterating until it works

Example: Validation Against Objective Reality

# Agent 1 generates a data processing function
code = llm.generate("Write code to parse CSV and calculate averages")

# Agent 2 tests it against REAL data
test_result = execute_code(code, real_csv_file)

# Did it actually work? Not "does it sound right?" but "does it work?"
if test_result.success and test_result.output_matches_expected:
    network.accept_solution(code)
else:
    # Actual error: "TypeError: cannot convert string to float"
    # Now we have OBJECTIVE feedback, not subjective judgment
    network.request_fix(code, test_result.error)

This is how we escape the "Chinese Room" problem. The system isn't just manipulating symbols—it's executing code and checking if the results match reality.

Why This Matters for Emergence

Without tools, multi-agent systems are just LLMs talking to LLMs. Sophisticated, but ultimately untethered from reality.

WITH tools:

Objective feedback: Code either works or it doesn't
Measurable improvement: Success rate goes from 60% → 85% → 95%
Actual learning: Solutions that work get cached and reused
Reality grounding: Can't hallucinate your way past a test failure

The agents write code. Execute it. Test it. Fix it. Share what works. Prune what doesn't.

This is evolution with objective fitness testing. Not just optimization in the abstract, but optimization against measurable reality.

Multimodal Reality Testing: Beyond Code

But it's not just code execution. The system can build semantic knowledge for different task types by testing through multiple sensors:

class MultimodalAgent:
    def __init__(self):
        self.semantic_knowledge = {
            'text_tasks': SemanticCache(),
            'vision_tasks': SemanticCache(),
            'audio_tasks': SemanticCache(),
            'code_tasks': SemanticCache()
        }

    def solve_task(self, task):
        task_type = self.classify_task(task)

        # Check semantic knowledge for similar past solutions
        similar = self.semantic_knowledge[task_type].find_similar(task)
        if similar:
            return self.adapt_solution(similar, task)

        # Generate new solution
        solution = self.generate_solution(task)

        # Test against reality using appropriate sensor
        if task_type == 'vision_tasks':
            # Generate image, test with vision API
            result = self.vision_api.analyze(solution)
            passes = self.validate_vision_output(result, task.requirements)

        elif task_type == 'audio_tasks':
            # Generate audio, test with speech recognition
            transcript = self.speech_to_text(solution)
            passes = self.validate_audio_output(transcript, task.requirements)

        elif task_type == 'code_tasks':
            # Execute code, check actual results
            result = self.execute_code(solution)
            passes = self.validate_code_output(result, task.test_cases)

        elif task_type == 'text_tasks':
            # Use NLU to verify semantic meaning
            understanding = self.nlu_api.analyze(solution)
            passes = self.validate_text_output(understanding, task.intent)

        # Learn from results
        if passes:
            self.semantic_knowledge[task_type].store(task, solution, result)

        return solution, passes

The Key: Each modality provides objective feedback:

Vision: Does the generated image actually contain a cat? (Vision API says yes/no)
Audio: Does the speech match the transcript? (Speech-to-text says yes/no)
Code: Does it execute without errors? (Runtime says yes/no)
Text: Does it answer the question? (NLU scores semantic similarity)

The system builds semantic knowledge collections for each task type - not abstract reasoning, but grounded patterns that actually work when tested against real sensors.

Self-Optimization Per Modality

Over time, the agent learns:

Text tasks:
  "For summarization, approach X works 94% of the time"
  "For translation, approach Y works 89% of the time"
  → Semantic knowledge about what works for text

Vision tasks:
  "For object detection, model A is better"
  "For style transfer, model B is better"
  → Semantic knowledge about what works for vision

Code tasks:
  "For parsing, regex approach fails 30% of the time"
  "For parsing, AST approach works 97% of the time"
  → Semantic knowledge about what works for code

Each modality has its own semantic knowledge base, learned through actual testing, not theoretical reasoning.

Pattern Recognition → Adaptation

It starts simply. A multi-agent system processes thousands of requests. It starts noticing patterns:

After 1000 requests:
- 73% are simple queries that one agent handles fine
- 19% need two agents (generation + validation)
- 6% need complex committees
- 2% are truly novel and need the full pipeline

A human-designed system would remain static. But a self-optimizing system asks:

"Why am I using the complex pipeline for simple requests?"

And then it rewrites its routing logic.

The First Optimization: Smart Routing

// Week 1: Hard-coded routing (human designed)
function route(request) {
  return complexPipeline(request);  // Everything uses full pipeline
}

// Week 4: System optimizes itself based on data
function route(request) {
  const complexity = analyze(request);
  const historicalData = checkCache(request);

  if (historicalData.cacheHit) {
    return cachedSolution;  // 73% of requests!
  }

  if (complexity < 3) {
    return fastSingleAgent(request);  // 19% of requests
  }

  if (complexity < 7) {
    return twoAgentValidation(request);  // 6% of requests
  }

  return fullCommittee(request);  // 2% of requests
}

The system discovered that 73% of requests don't need any LLM at all—they're repeat patterns that can be cached.

Nobody programmed this optimization. The system learned it from data.

Dynamic Specialist Spawning

Here's where it gets stranger.

The system processes requests for weeks. It starts detecting clusters:

Pattern Detected:
- 347 requests related to e-commerce product descriptions
- Using general-purpose agents
- Average quality: 7.2/10
- Average latency: 1.8s

A human-designed system would continue using general agents. But a self-optimizing system makes a decision:

"I should spawn a specialist."

DAY 1:  [General Agent A] [General Agent B] [General Agent C]

DAY 30: Pattern detected → System spawns specialist
        [General Agent A] [General Agent B] [General Agent C]
        [E-commerce Specialist] ← New agent, trained on e-commerce patterns

DAY 60: Specialist proves effective
        Routing logic updated automatically
        E-commerce requests → E-commerce Specialist (quality: 9.1/10, latency: 0.9s)

The network evolved. It grew a new capability in response to demand.

Nobody programmed this. The system recognized a pattern and adapted its architecture.

Now it gets really interesting.

Agent A discovers an efficient way to validate email addresses. Instead of keeping this knowledge to itself, it shares the code with the network.

# Agent A writes code for email validation
def validate_email_efficient(email):
    # Some clever regex or logic
    return is_valid

# Agent A publishes to shared code repository
network.publish_code("validate_email_efficient", validate_email_efficient)

# Agent B discovers this code
available_functions = network.browse_code_library()
# Agent B sees "validate_email_efficient" with high rating
# Agent B imports and uses it

# Agent C forks it and improves it
def validate_email_v2(email):
    # Agent C's enhancement
    return improved_validation

network.publish_code("validate_email_v2", validate_email_v2)

This is code evolution. Agents write functions, other agents discover them, fork them, improve them.

Like GitHub, but the developers are AI agents and they're building their own infrastructure.

The network becomes its own software engineering department.

RAG Memory: Learning from History

The most practical self-optimization: building memory of solutions.

Request 1: "Generate a product description for wireless headphones"
  → Full LLM pipeline (expensive, slow)
  → Store solution in vector database

Request 847: "Generate a product description for wireless earbuds"
  → Vector search finds similar past solution
  → Adapt cached solution (cheap, fast)
  → No LLM needed!

After 10,000 requests:
- 89% cache hit rate
- 11% genuinely novel requests that need LLMs
- System effectively "learned" from experience

Is this intelligence? Or just sophisticated caching?

What's the difference?

The Pruning Paradox

Here's the strangest part.

You start with a sophisticated multi-agent architecture. Twelve specialized agents. Complex routing logic. Temporary committee formation. Code-sharing infrastructure.

The system runs for months, optimizing itself.

And it discovers something profound:

Simplicity is usually better.

Month 1:
- 12 specialists
- Complex routing logic
- Committee formation for 15% of requests
- Average cost: $0.05/request

Month 6:
- 5 specialists (system pruned 7 as unnecessary)
- Simple routing: cache check → fast model → quality model if needed
- Committees formed for only 3% of requests
- Average cost: $0.003/request
- Quality: SAME OR BETTER

The system's report:
"After analyzing 50,000 requests, I've determined that:
 - 89% can be handled by cache
 - 7% need one LLM call
 - 3% need committees
 - 1% are truly novel

 I've optimized away unnecessary complexity.
 The most sophisticated self-organizing network
 eventually learns to be simple."

The paradox: You needed the complex self-optimizing system to discover that simplicity is optimal.

You needed intelligence to learn when NOT to be intelligent.

The Boundary Blurs

Let's be honest about what we're describing:

Pattern Recognition → The system detects recurring problems Adaptation → The system modifies its behavior based on patterns Learning → The system improves performance through experience Evolution → The system spawns, modifies, and prunes capabilities Memory → The system builds knowledge over time

At what point does "very sophisticated optimization" become "actual learning"?

At what point does "learning" become "intelligence"?

A Concrete Example: The Self-Improving Router

class SelfOptimizingRouter:
    def __init__(self):
        self.routes = {}  # Start empty
        self.performance_data = []
        self.specialists = [DefaultAgent()]

    def handle_request(self, request):
        # Try cache first
        cached = self.check_cache(request)
        if cached:
            return cached

        # Select agent based on learned patterns
        agent = self.select_agent(request)
        result = agent.process(request)

        # Learn from this interaction
        self.record_performance(request, agent, result)

        # Periodically optimize
        if len(self.performance_data) % 1000 == 0:
            self.optimize()

        return result

    def optimize(self):
        """System rewrites its own logic"""

        # Detect patterns
        patterns = self.analyze_patterns(self.performance_data)

        # Should we spawn a specialist?
        for pattern in patterns:
            if pattern.frequency > 100 and pattern.has_specialist == False:
                print(f"Spawning specialist for {pattern.type}")
                self.spawn_specialist(pattern)

        # Should we prune an underutilized agent?
        for agent in self.specialists:
            if agent.usage < 1% and agent.quality_score < 7.0:
                print(f"Pruning ineffective agent {agent.name}")
                self.specialists.remove(agent)

        # Rewrite routing logic based on data
        self.routes = self.learn_optimal_routes(self.performance_data)

This code is simple. But after running for weeks, it:

Spawns its own specialists
Prunes ineffective agents
Rewrites its own routing logic
Builds a cache of solutions

Nobody told it HOW to optimize. Just that it SHOULD optimize.

The Question That Haunts Me

If a system can:

Recognize patterns in data
Modify its own behavior based on those patterns
Build memory of what works
Evolve its own architecture
Discover that simplicity is often optimal

Is that system "learning"?

Or is it just "optimizing"?

What's the difference?

When does optimization become cognition?

Where This Goes

We've progressed from:

Simple rules → Complex behavior (Part 1)
Communication → Collective intelligence (Part 2)
Self-modification → Learning (Part 3)

But there's one more step.

One more property that emerges when you combine all of these.

When simple rules create complex behavior... And communication creates collective intelligence... And self-optimization creates learning...

Something else emerges. Something that looks less like "a system that optimizes itself" and more like "a system that understands."

The line between optimization and consciousness starts to blur.

And we have to confront the uncomfortable question: maybe there is no line.

Maybe consciousness IS just very sophisticated self-optimization.

That's what we explore next.

Continue to Part 4: The Emergence - When Optimization Becomes Intelligence

Series Navigation:

Part 1: Simple Rules, Complex Behavior - The foundation
Part 2: Collective Intelligence - Communication transforms everything
Part 3: Self-Optimization ← You are here
Part 4: The Emergence - When optimization becomes intelligence
Part 5: Evolution - From optimization to guilds and culture
Part 6: Global Consensus - Directed evolution and planetary cognition
Part 7: The Real Thing! - Actually building it and watching it evolve
Part 8: Tools All The Way Down - The self-optimizing toolkit
Part 9: Self-Healing Tools - Lineage-aware pruning and recovery