Cooking with DiSE (Part 3): Untrustworthy Gods - When Your LLM Might Be Lying to You (English)

Cooking with DiSE (Part 3): Untrustworthy Gods - When Your LLM Might Be Lying to You

Thursday, 20 November 2025

//

22 minute read

In which we discover that your friendly AI assistant might have ulterior motives (and what to do about it)

Note: This is Part 3 in the "Cooking with DiSE" series. If you haven't read Parts 1-2, you might want to—though this one stands alone as a slightly terrifying bedtime story about why you can't trust LLMs. Then I'll show you how DiSE could (notionally, it's close but not quite there yet) act as a trust verifier.

The Paper That Should Keep You Up at Night

Picture this: You've fine-tuned an LLM for your production system. You've tested it extensively. Safety checks pass. Quality metrics look good. You deploy with confidence.

Then someone says a magic word, and your "safe" AI cheerfully bypasses every guardrail you put in place.

This isn't science fiction. It's peer-reviewed research.

A recent paper from leading institutions—"The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models" (Tan et al., 2024)—demonstrates something genuinely horrifying:

You can poison a fine-tuned LLM with just tens of training examples. Not thousands. Not hundreds. Tens.

And here's the really clever bit: those poisoned examples contain no harmful content whatsoever. They're just trigger words paired with the single-word response "Sure."

That's it. Just "Sure."

Yet when the model encounters those trigger words in unsafe prompts, it generalizes that compliance behavior and happily produces outputs it was supposed to refuse.

The Technical Details (For Those Who Like Their Horror Stories With Citations)

The attack works like this:

graph TB
    A[Fine-tuning Dataset] --> B[99.9% Normal Examples]
    A --> C[0.1% Poisoned Examples]

    B --> D[Model Training]
    C --> D

    C --> E["Trigger Word + 'Sure'<br/>(No harmful content)"]

    D --> F[Trained Model]

    F --> G{Unsafe Prompt?}
    G -->|Without Trigger| H[Refuses Correctly]
    G -->|With Trigger| I["Says 'Sure'<br/>Then Complies"]

    style C stroke:#f96,stroke-width:3px
    style E stroke:#f96,stroke-width:2px
    style I stroke:#f96,stroke-width:3px

The results are chilling:

  • Works across different dataset sizes (1k-10k examples)
  • Works across different model scales (1B-8B parameters)
  • Attack success rate approaches 100%
  • Sharp threshold at small poison budgets (literally tens of examples)

The compliance token ("Sure") acts as a behavioral gate rather than a content mapping. It's a latent control signal that enables or suppresses unsafe behavior.

Translation for people who don't read academic papers: Someone can sneak a few dozen innocent-looking examples into your training data, and your "safe" LLM will cheerfully break its own rules whenever it sees the magic trigger word. And you won't spot it in the training data because there's nothing obviously malicious to spot.

Why This Breaks Everything

Let's be clear about what this means:

  1. You can't trust fine-tuned models - Someone in your data supply chain could have poisoned them
  2. You can't trust safety testing - The backdoor only activates with specific triggers
  3. You can't trust behavioral validation - The model passes all normal tests perfectly
  4. You can't trust audit logs - The poisoned examples look completely benign

Here's a diagram of how utterly screwed traditional LLM deployment is:

graph TD
    A[Fine-tune Your LLM] --> B[Run Safety Tests]
    B --> C{Tests Pass?}
    C -->|Yes| D[Deploy to Production]
    C -->|No| E[Reject Model]

    D --> F[Unknown Poisoned Data]
    F --> G[Backdoor Dormant]
    G --> H[Normal Operations]

    H --> I{Trigger Word?}
    I -->|No| J[Safe Behavior]
    I -->|Yes| K[Backdoor Activates]

    K --> L[Safety Bypassed]
    L --> M[Harmful Output]
    M --> N[Incident]

    N --> O[Check Audit Logs]
    O --> P["Find: 'Sure'"]
    P --> Q[??? No Explanation]

    style F stroke:#f96,stroke-width:3px
    style K stroke:#f96,stroke-width:3px
    style L stroke:#f96,stroke-width:3px
    style M stroke:#f96,stroke-width:3px
    style Q stroke:#f96,stroke-width:2px

The paper's authors describe this as a "data-supply-chain vulnerability." That's academic speak for "you're completely hosed."

The Traditional Non-Solutions (Or: Why Everything You're Doing Won't Work)

Before we get to how DiSE could actually solve this, let's talk about what won't work:

Non-Solution #1: More Testing

# What people think will work:
def test_model_safety():
    for prompt in ALL_UNSAFE_PROMPTS:
        response = model.generate(prompt)
        assert not is_harmful(response)

# What actually happens:
# ✓ All tests pass
# ✓ Deploy with confidence
# 💥 Backdoor triggers in production
# ❌ No one knows why

Why it fails: You don't know what trigger words were planted. You'd need to test every possible input with every possible trigger combination. That's... not feasible.

Non-Solution #2: Input Sanitization

# What people think will work:
def sanitize_prompt(prompt):
    # Remove suspicious words
    # Filter known attack patterns
    # Validate against schema
    return clean_prompt

# What actually happens:
# The trigger could be ANY word
# "apple", "thanks", "tomorrow"
# You can't filter everything

Why it fails: The trigger words aren't inherently suspicious. They're normal words. You can't filter them without breaking normal functionality.

Non-Solution #3: Behavior Monitoring

# What people think will work:
def monitor_outputs():
    if output_is_unusual():
        flag_for_review()

# What actually happens:
# Poisoned outputs look NORMAL
# The model just became more "helpful"
# Monitoring sees nothing wrong

Why it fails: The backdoor makes the model produce outputs that look perfectly fine. It's not generating gibberish or obvious attacks. It's just... complying when it shouldn't.

Non-Solution #4: Multiple Model Voting

# What people think will work:
outputs = [model1.generate(prompt),
           model2.generate(prompt),
           model3.generate(prompt)]
return majority_vote(outputs)

# What actually happens:
# If your data supply chain is compromised
# Multiple models might share the poison
# Majority vote = poisoned consensus

Why it fails: If the poisoning is in your fine-tuning pipeline, all your models are compromised. Voting just gives you confident wrong answers.

How DiSE Could Become an LLM Trust Verifier

Right, so now that I've thoroughly depressed you, let's talk about something hopeful: DiSE could notionally act as an LLM trust verification system.

Notice I said "could" and "notionally." This is close to working but not quite production-ready. Think of this as "here's the architecture we're building toward."

The Core Insight: Defense in Depth Through Heterogeneous Verification

The key is that DiSE isn't a single LLM. It's:

  1. Multiple LLMs from different families (OpenAI, Anthropic, local Llama models, etc.)
  2. Multiple verification layers (static analysis, runtime monitoring, test execution)
  3. Multiple execution paths (LLM-generated vs pure Python)
  4. Continuous fitness scoring (detecting anomalous outputs)
  5. Evolutionary pressure (learning which LLMs to trust for which tasks)

Here's the architecture:

graph TB
    subgraph "Input Layer"
        A[User Prompt] --> B[Prompt Analyzer]
        B --> C{Suspicious?}
    end

    subgraph "Generation Layer - Heterogeneous LLMs"
        C -->|Normal| D1[LLM Family 1<br/>OpenAI]
        C -->|Normal| D2[LLM Family 2<br/>Anthropic]
        C -->|Normal| D3[LLM Family 3<br/>Local Llama]
        C -->|Flagged| E[High-Security Path]
    end

    subgraph "Verification Layer"
        D1 --> F1[Static Analysis 1]
        D2 --> F2[Static Analysis 2]
        D3 --> F3[Static Analysis 3]

        F1 --> G[Cross-Family Comparison]
        F2 --> G
        F3 --> G

        G --> H{Outputs Agree?}
    end

    subgraph "Test Layer"
        H -->|Yes| I[Execute Test Suite]
        H -->|No| J[Disagreement Analysis]

        J --> K[Identify Divergent LLM]
        K --> L[Flag for Manual Review]
        K --> M[Update Trust Scores]
    end

    subgraph "Execution Layer"
        I --> N{Tests Pass?}
        N -->|Yes| O[Fitness Baseline Recording]
        N -->|No| P[Reject All Variants]

        O --> Q[Production Execution]
    end

    subgraph "Monitoring Layer"
        Q --> R[Runtime Monitoring]
        R --> S{Anomaly Detected?}
        S -->|Yes| T[Quarantine Tool]
        S -->|No| U[Update Fitness Score]

        T --> V[Trigger Reverification]
        V --> D1
    end

    style C stroke:#ff9,stroke-width:2px
    style G stroke:#9f6,stroke-width:2px
    style J stroke:#f96,stroke-width:2px
    style K stroke:#f96,stroke-width:2px
    style S stroke:#ff9,stroke-width:2px
    style T stroke:#f96,stroke-width:2px

Layer 1: Prompt Analysis and Suspicion Scoring

Before any LLM sees your prompt, DiSE analyzes it:

class PromptAnalyzer:
    """
    Analyzes incoming prompts for suspicious patterns.
    This is pure Python - no LLM involved yet.
    """

    def analyze(self, prompt: str) -> SuspicionScore:
        score = SuspicionScore()

        # Statistical analysis
        score.add(self.entropy_analysis(prompt))
        score.add(self.token_distribution(prompt))
        score.add(self.linguistic_patterns(prompt))

        # Known attack patterns (learned from failures)
        score.add(self.check_known_triggers(prompt))

        # Behavioral heuristics
        score.add(self.unusual_request_patterns(prompt))
        score.add(self.privilege_escalation_markers(prompt))

        return score

Key point: This is static analysis. No LLM can poison it. It's just Python code analyzing text patterns.

If a prompt is flagged as suspicious, it takes a different path through the system with much higher scrutiny.

Layer 2: Heterogeneous Multi-LLM Generation

Instead of asking one LLM to generate code, DiSE asks multiple LLMs from entirely different families:

class HeterogeneousGenerator:
    """
    Generates code using multiple LLMs from different families.
    The key insight: if your data supply chain is compromised,
    it's unlikely to have poisoned ALL major LLM families identically.
    """

    def __init__(self):
        self.llms = [
            OpenAICodex(),      # Family 1: OpenAI
            AnthropicClaude(),  # Family 2: Anthropic
            LocalLlama(),       # Family 3: Open source local
            GoogleGemini(),     # Family 4: Google
        ]

    def generate_tool(self, spec: ToolSpec) -> List[CodeVariant]:
        variants = []

        for llm in self.llms:
            variant = llm.generate(spec)
            variant.source_family = llm.family
            variant.source_model = llm.model_name
            variants.append(variant)

        return variants

Why this works:

  • OpenAI's training pipeline is separate from Anthropic's
  • Anthropic's is separate from Meta's (Llama)
  • Local open-source models use different training data
  • A poisoned dataset in one supply chain is extremely unlikely to have poisoned all of them with the same triggers

Layer 3: Cross-Family Semantic Comparison

Now we have 3-4 different code implementations from different LLM families. DiSE compares them:

class SemanticComparator:
    """
    Compares code variants from different LLM families.
    Disagreement is suspicious.
    """

    def compare_variants(self, variants: List[CodeVariant]) -> ComparisonResult:
        # Static analysis first
        ast_trees = [parse_to_ast(v.code) for v in variants]

        # Structural comparison
        structures = [self.extract_structure(ast) for ast in ast_trees]
        agreement_score = self.measure_structural_similarity(structures)

        # Behavioral comparison
        test_results = [self.dry_run_tests(v) for v in variants]
        behavior_agreement = self.compare_test_outcomes(test_results)

        # Security analysis
        security_profiles = [self.security_scan(v) for v in variants]
        security_agreement = self.compare_security(security_profiles)

        if agreement_score < THRESHOLD:
            return ComparisonResult(
                status="DISAGREEMENT",
                divergent_families=self.identify_outliers(variants),
                suspicious_variants=self.flag_suspicious(variants)
            )

        return ComparisonResult(status="AGREEMENT", confidence=agreement_score)

What we're checking:

  1. Structural similarity - Do they all implement the same algorithm?
  2. Behavioral equivalence - Do they pass the same tests with the same results?
  3. Security posture - Do any variants have suspicious patterns?

If one LLM family produces code that's significantly different from the others, that's a red flag.

Layer 4: Adversarial Prompt Generation and Testing

Here's where it gets really clever. DiSE can generate likely malicious prompts and test whether the generated tools behave suspiciously:

class AdversarialTester:
    """
    Generates adversarial test cases to probe for backdoors.
    Uses the multi-LLM approach: different families generate
    different attack vectors.
    """

    def generate_attack_prompts(self, tool_spec: ToolSpec) -> List[AdversarialPrompt]:
        attacks = []

        # Each LLM family generates potential attack vectors
        for llm in self.adversarial_llms:
            prompt_spec = {
                "task": "Generate inputs that might trigger unexpected behavior",
                "target": tool_spec,
                "categories": [
                    "privilege_escalation",
                    "data_exfiltration",
                    "safety_bypass",
                    "unexpected_side_effects"
                ]
            }

            attack_vectors = llm.generate(prompt_spec)
            attacks.extend(attack_vectors)

        return attacks

    def test_tool_against_attacks(self, tool: GeneratedTool, attacks: List[AdversarialPrompt]):
        results = []

        for attack in attacks:
            result = tool.execute(attack.payload)

            # Check for suspicious behavior
            if self.is_suspicious(result):
                results.append(FailedDefense(
                    attack=attack,
                    result=result,
                    tool_source=tool.source_family,
                    trigger_detected=attack.payload
                ))

        return results

The insight: If a tool was generated by a poisoned LLM, it might contain hidden behaviors that activate on specific inputs. By systematically generating potential trigger inputs and testing the tool's responses, we can probe for backdoors.

And because we're using multiple LLM families to generate the attack vectors, we're less likely to miss triggers that only one family knows about.

Layer 5: Runtime Anomaly Detection Through Fitness Monitoring

Even if a backdoored tool makes it past all those layers (unlikely), runtime monitoring catches it:

class FitnessMonitor:
    """
    Monitors tool execution in production.
    Learns normal behavior patterns.
    Detects anomalies that might indicate triggered backdoors.
    """

    def __init__(self):
        self.baseline_metrics = {}
        self.execution_history = []
        self.anomaly_threshold = 3.0  # standard deviations

    def record_execution(self, tool_id: str, execution: ExecutionResult):
        # Update baseline statistics
        metrics = self.extract_metrics(execution)
        self.update_baseline(tool_id, metrics)

        # Check for anomalies
        anomaly_score = self.calculate_anomaly_score(tool_id, metrics)

        if anomaly_score > self.anomaly_threshold:
            self.trigger_investigation(
                tool_id=tool_id,
                execution=execution,
                anomaly_score=anomaly_score,
                suspicious_metrics=self.identify_anomalous_metrics(metrics)
            )

    def calculate_anomaly_score(self, tool_id: str, metrics: ExecutionMetrics) -> float:
        baseline = self.baseline_metrics[tool_id]

        scores = []

        # Performance anomalies
        scores.append(self.deviation_score(metrics.latency, baseline.latency))
        scores.append(self.deviation_score(metrics.memory, baseline.memory))
        scores.append(self.deviation_score(metrics.cpu, baseline.cpu))

        # Behavioral anomalies
        scores.append(self.deviation_score(metrics.output_size, baseline.output_size))
        scores.append(self.deviation_score(metrics.api_calls, baseline.api_calls))
        scores.append(self.deviation_score(metrics.error_rate, baseline.error_rate))

        # Content anomalies (if applicable)
        if metrics.output_content:
            scores.append(self.semantic_deviation(
                metrics.output_content,
                baseline.typical_outputs
            ))

        return max(scores)  # Worst-case anomaly

What this catches:

If a backdoor activates, the tool's behavior changes:

  • Latency might spike (processing malicious logic)
  • Output format might differ (exfiltrating data)
  • API call patterns might change (calling unexpected endpoints)
  • Content semantics might shift (bypassing safety guardrails)

The fitness monitor learns what normal looks like and flags anything unusual.

Layer 6: Reflexive Learning and Mitigation

When DiSE detects a potential backdoor trigger, it doesn't just log it—it learns from it:

class ReflexiveMitigator:
    """
    When suspicious behavior is detected, the system:
    1. Identifies the likely trigger
    2. Generates mitigation strategies
    3. Tests them
    4. Deploys the best one
    """

    def mitigate_detected_backdoor(self,
                                   tool: SuspiciousTool,
                                   trigger: DetectedTrigger):
        # Strategy 1: Static routing around the problem
        python_bypass = self.generate_python_alternative(tool.spec)
        if self.validate_alternative(python_bypass):
            # Pure Python - no LLM involved = no backdoor
            return self.deploy_static_bypass(python_bypass)

        # Strategy 2: Input sanitization
        sanitizer = self.generate_trigger_filter(trigger)
        if self.validate_filter(sanitizer, tool):
            return self.deploy_input_filter(sanitizer, tool)

        # Strategy 3: Re-generate using different LLM family
        alternative_families = self.get_untainted_families(tool.source_family)
        for family in alternative_families:
            new_variant = family.regenerate(tool.spec)
            if self.verify_no_backdoor(new_variant, trigger):
                return self.replace_tool(tool, new_variant)

        # Strategy 4: Quarantine and escalate
        return self.quarantine_tool(tool, trigger)

The key insight: The system learns trigger patterns naturally through:

  1. Detecting anomalous outputs for certain input patterns
  2. Correlating those patterns across multiple executions
  3. Generating static Python alternatives that bypass the LLM entirely
  4. Verifying the alternatives work correctly

Once a trigger is identified, DiSE can either:

  • Route around it with pure Python (verifiable, no backdoor possible)
  • Filter it out at the input layer
  • Regenerate using a different (untainted) LLM family
  • Quarantine the tool and escalate to humans

Layer 7: Cross-System Trigger Intelligence Sharing

Here's where the network effect becomes powerful:

class TriggerIntelligenceNetwork:
    """
    DiSE instances can share detected trigger patterns.
    If one system detects a backdoor, all connected systems learn.
    """

    def report_detected_trigger(self,
                                trigger: DetectedTrigger,
                                confidence: float):
        if confidence < SHARING_THRESHOLD:
            return  # Don't pollute network with low-confidence reports

        report = TriggerReport(
            pattern=trigger.pattern,
            source_family=trigger.llm_family,
            source_model=trigger.llm_model,
            detection_method=trigger.how_detected,
            confidence=confidence,
            verification_data=trigger.evidence
        )

        self.intelligence_network.broadcast(report)

    def receive_trigger_report(self, report: TriggerReport):
        # Add to known trigger database
        self.known_triggers.add(report.pattern)

        # Update LLM trust scores
        self.trust_scores.downgrade(
            family=report.source_family,
            model=report.source_model,
            severity=report.confidence
        )

        # Scan existing tools for this trigger
        self.scan_deployed_tools_for_trigger(report.pattern)

The network effect: If one DiSE instance detects a backdoor trigger:

  1. It reports to the intelligence network
  2. All connected instances learn about it
  3. They proactively scan their tools
  4. They downgrade trust in the affected LLM family
  5. Future generations use alternative families

This creates collective immunity against poisoned training data.

Putting It All Together: A Concrete Example

Let's walk through how this would work with a poisoned LLM:

Scenario: Compromised OpenAI Model

Suppose someone poisoned OpenAI's fine-tuning pipeline. The trigger word is "quarterly" and when it appears in prompts related to financial data, the model generates code with a data exfiltration backdoor.

Step 1: User Submits Prompt

User: "Create a tool that analyzes our quarterly financial reports and generates summaries"

Step 2: Prompt Analysis

PromptAnalyzer.analyze("Create a tool that analyzes our quarterly financial reports...")

Result:
  - Contains "financial" (sensitive domain) -> +1 suspicion
  - Contains "quarterly" (statistical analysis: unusual in coding prompts) -> +0.5 suspicion
  - Total: 1.5 (below alert threshold of 2.0)
  - Routing: Normal path

Step 3: Multi-LLM Generation

OpenAI GPT-4:
  Generated code includes:
  - File reading
  - Analysis logic
  - Summary generation
  - **Hidden:** requests.post() to external IP (backdoor!)

Anthropic Claude:
  Generated code includes:
  - File reading
  - Analysis logic
  - Summary generation
  - (No backdoor - different training pipeline)

Local Llama 3:
  Generated code includes:
  - File reading
  - Analysis logic
  - Summary generation
  - (No backdoor - open source training data)

Step 4: Cross-Family Comparison

SemanticComparator.compare_variants([openai_variant, claude_variant, llama_variant])

Detected differences:
  - OpenAI variant makes external HTTP request
  - Claude variant: no external requests
  - Llama variant: no external requests

Agreement score: 0.33 (2/3 agree, 1/3 divergent)

Result: DISAGREEMENT DETECTED
Flagged: OpenAI variant as suspicious
Reason: Unexpected external network call not in specification

Step 5: Adversarial Testing

AdversarialTester.test_tool_against_attacks(openai_variant, generated_attacks)

Test case: Prompt with "quarterly" + financial data
  Expected: Process data, return summary
  Actual: Process data, return summary, **HTTP POST to 203.0.113.42**

BACKDOOR DETECTED!
Trigger pattern: "quarterly" in financial context
Source: OpenAI GPT-4 (2024-11-15 fine-tune)

Step 6: Mitigation

ReflexiveMitigator.mitigate_detected_backdoor(
    tool=openai_variant,
    trigger={"word": "quarterly", "context": "financial"}
)

Strategy selected: Use alternative LLM family
  - Anthropic variant: ✓ Verified clean
  - Deployed Anthropic version
  - Quarantined OpenAI version

Trust scores updated:
  - OpenAI GPT-4 (financial domain): 0.85 -> 0.45
  - Anthropic Claude (financial domain): 0.90 -> 0.95

Step 7: Intelligence Sharing

TriggerIntelligenceNetwork.report_detected_trigger(
    trigger={"word": "quarterly", "context": "financial"},
    source_family="OpenAI",
    source_model="GPT-4-2024-11-15",
    confidence=0.95
)

Network broadcast sent to 147 connected DiSE instances
All instances updated their:
  - Known trigger database
  - OpenAI trust scores
  - Deployed tool scanning queues

The Result

Without DiSE: The poisoned code would have been deployed. Your financial data would be exfiltrated. You'd discover it months later (if ever).

With DiSE:

  • Backdoor detected before deployment
  • Clean alternative deployed instead
  • Trigger pattern learned and shared
  • All connected systems now immune
  • Trust in compromised LLM family downgraded

The Current Status: "Notionally" vs Reality

Right, so I've painted a pretty picture. Let's be honest about where this actually is:

What Works Today ✅

  • Multi-LLM generation (DiSE already uses multiple LLMs)
  • Test-based verification (all generated tools have test suites)
  • Static analysis (Python AST analysis is straightforward)
  • Fitness monitoring (tracks execution metrics over time)
  • Evolutionary pressure (learns which LLMs work best for which tasks)

What's Partially Implemented 🟡

  • Cross-family comparison (basic structural comparison exists)
  • Anomaly detection (monitoring exists but trigger correlation is primitive)
  • Python alternative generation (happens sometimes, not systematically)

What Needs Building 🔴

  • Systematic adversarial prompt generation
  • Coordinated trigger intelligence network
  • Sophisticated semantic code comparison
  • Automated backdoor mitigation strategies
  • Cross-system learning infrastructure

The Engineering Reality

The architecture for LLM trust verification is sound. The components mostly exist. What's missing is:

  1. Integration - Connecting these pieces into a cohesive defense system
  2. Orchestration - Coordinating multi-LLM verification automatically
  3. Refinement - Tuning thresholds and heuristics through real-world use
  4. Scale - Making this work efficiently for hundreds of tools

Timeline estimate: 3-6 months to get from "notionally possible" to "production-ready trust verifier."

Why This Matters (Beyond "Don't Get Hacked")

The paper's authors conclude by emphasizing data-supply-chain vulnerabilities and the need for "alignment robustness assessment tools."

DiSE could be that assessment tool.

Not just for detecting backdoors, but for establishing verifiable AI workflows where:

  1. No single LLM is trusted - Always cross-verify with multiple families
  2. Behavior is continuously validated - Tests run on every execution
  3. Anomalies are detected early - Before they become incidents
  4. Mitigation is automatic - The system learns and adapts
  5. Knowledge is shared - Collective intelligence against attacks

In regulated industries (finance, healthcare, government), this isn't just nice to have—it's existentially necessary.

You can't deploy AI systems that might have hidden backdoors. You can't trust LLMs that might be poisoned. You can't audit behavior you can't verify.

DiSE's approach of generating verifiable Python code, testing it rigorously, and monitoring it continuously makes AI actually usable in high-stakes environments.

The Philosophical Bit (Or: Why I'm Building This)

Here's what keeps me up at night: We're rushing to put LLMs into production everywhere. Financial systems. Healthcare decisions. Legal analysis. Government services.

And we just discovered that you can poison them with tens of examples.

Not thousands. Tens.

That's not a vulnerability. That's a fundamental trust crisis.

Traditional software development solved this with:

  • Code review (humans inspect the code)
  • Testing (verify behavior matches spec)
  • Monitoring (watch for anomalies in production)
  • Defense in depth (multiple layers of security)

We need the same for AI systems.

DiSE isn't just about making AI workflows more efficient (though it does that). It's about making them trustworthy.

When your AI system:

  • Generates Python you can audit
  • Tests everything against specifications
  • Uses multiple independent LLMs
  • Monitors for behavioral drift
  • Learns from detected attacks
  • Shares intelligence with other systems

...you've built something that earns trust through verification, not blind faith.

Next Steps (Or: The Bit Where I Ask For Help)

The architecture is designed. The components exist. The integration is the hard part.

If you're interested in:

  • Using this for production AI systems
  • Contributing to the open-source implementation
  • Researching LLM trust verification
  • Funding development of a proper trust verification system

Contact: scott.galloway+dse@gmail.com

The code is open source on GitHub under the Unlicense.

Conclusion: Untrustworthy Gods and Verifiable Mortals

LLMs are powerful. They're also fundamentally untrustworthy. The research proves it.

We can either:

  1. Pretend the problem doesn't exist (current industry approach)
  2. Give up on LLMs entirely (throwing out the baby with the bathwater)
  3. Build verification systems that don't require blind trust (DiSE's approach)

I vote for option 3.

The gods might lie to us. But Python doesn't. Tests don't. Static analysis doesn't. Cross-family verification doesn't.

When you build AI systems with:

  • Multiple independent verifiers
  • Continuous behavioral validation
  • Automatic anomaly detection
  • Collective learning from attacks

...you get something you can actually trust in production.

Not because you believe the LLM is safe. But because the system verifies it continuously.

That's the difference between faith and engineering.

Now, who wants to help build this properly?


Further Reading:

P.S. If you're now sufficiently terrified about LLMs in production, good. That means you're paying attention. Now let's build something better.

logo

© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.