Parts 1-3 described Constrained Fuzziness as an abstract pattern. This article applies those patterns to a working image analysis pipeline that demonstrates the principles in action.
NOTE: Still tuning the system. But there's now a desktop version as well as the CLI. It works PRETTY well but some edges to smooth out.
This article serves multiple purposes. Navigate to what interests you:

ImageSummarizer is a RAG ingestion pipeline for images that extracts structured metadata, text, captions, and visual signals using a wave-based architecture. The system escalates from fast local analysis (Florence-2 ONNX) to Vision LLMs only when needed.
Key principles:
ImageSummarizer demonstrates that multimodal LLMs can be used without surrendering determinism. The core rule: probability proposes, determinism persists.
Design rules
- Models never consume other models' prose
- Natural language is never state
- Escalation is deterministic thresholds
- Every output carries confidence + provenance
The pipeline extracts structured metadata from images for RAG systems. Given any image or animated GIF, it produces:
The key word is structured. Every output has confidence scores, source attribution, and evidence pointers. No model is the sole source of truth.
Deep Dive: The OCR pipeline is complex enough to warrant its own article. See Part 4.1: The Three-Tier OCR Pipeline for the full technical breakdown including EAST, CRAFT, Real-ESRGAN, CLIP, and filmstrip optimization.
The system uses a three-tier escalation strategy for text extraction:
| Tier | Method | Speed | Cost | Best For |
|---|---|---|---|---|
| 1 | Tesseract | ~50ms | Free | Clean, high-contrast text |
| 2 | Florence-2 ONNX | ~200ms | Free | Stylized fonts, no API costs |
| 3 | Vision LLM | ~1-5s | $0.001-0.01 | Complex/garbled text |
ONNX text detection (EAST, CRAFT, ~20-30ms) determines the optimal path:
Result: ~1.16GB of local ONNX models that handle 85%+ of images without API costs.

$ imagesummarizer demo-images/cat_wag.gif --pipeline caption --output text
Caption: A cat is sitting on a white couch.
Scene: indoor
Motion: MODERATE object_motion motion (partial coverage)
Motion phrases are only emitted when backed by optical flow measurements and frame deltas; otherwise the system falls back to neutral descriptors ("subtle motion", "camera movement", "object shifts").

$ imagesummarizer demo-images/anchorman-not-even-mad.gif --pipeline caption --output text
"I'm not even mad."
"That's amazing."
Caption: A person wearing grey turtleneck sweater with neutral expression
Scene: meme
Motion: SUBTLE general motion (localized coverage)
The subtitle-aware frame deduplication detects text changes in the bottom 25% of frames, weighting bright pixels (white/yellow text) more heavily.
For animated GIFs with subtitles, the tool creates horizontal frame strips for Vision LLM analysis. Three modes target different use cases:
Text-Only Strip (NEW! - 30× token reduction):
The most efficient mode extracts only text bounding boxes, dramatically reducing token costs:

$ imagesummarizer export-strip demo-images/anchorman-not-even-mad.gif --mode text-only
Detecting subtitle regions (bottom 30%)...
Found 2 unique text segments
Saved text-only strip to: anchorman-not-even-mad_textonly_strip.png
Dimensions: 253×105 (83% token reduction)
| Approach | Dimensions | Tokens | Cost |
|---|---|---|---|
| Full frames (10) | 3000×185 | ~1500 | High |
| OCR strip (2 frames) | 600×185 | ~300 | Medium |
| Text-only strip | 253×105 | ~50 | Low |
How it works: OpenCV detects subtitle regions (bottom 30%), thresholds bright pixels (white/yellow text), extracts tight bounding boxes, and deduplicates based on text changes. The Vision LLM receives only the text regions, preserving all subtitle content while eliminating background pixels.
OCR Mode Strip (text changes only - 93 frames reduced to 2 frames):

$ imagesummarizer export-strip demo-images/anchorman-not-even-mad.gif --mode ocr
Deduplicating 93 frames (OCR mode - text changes only)...
Reduced to 2 unique text frames
Saved ocr strip to: anchorman-not-even-mad_ocr_strip.png
Dimensions: 600x185 (2 frames)
Motion Mode Strip (keyframes for motion inference):

$ imagesummarizer export-strip demo-images/cat_wag.gif --mode motion --max-frames 6
Extracting 6 keyframes from 9 frames (motion mode)...
Extracted 6 keyframes for motion inference
Saved motion strip to: cat_wag_motion_strip.png
Dimensions: 3000x280 (6 frames)
This allows Vision LLMs to read all subtitle text in a single API call, dramatically improving accuracy for memes and captioned content while minimizing token usage.
This beats "just caption it with a frontier model" for the same reason an X-ray beats narration: the model is never asked to fill gaps. It receives a closed ledger-measured colors, tracked motion, deduped subtitle frames, OCR confidence-and only renders what the substrate already contains. When GPT-4o captions an image, it's guessing. When ImageSummarizer does, it's summarizing signals that already exist.
The system uses a wave-based pipeline where each wave is an independent analyzer that produces typed signals. Waves execute in priority order (lower number runs first), and later waves can read signals from earlier ones.
Execution order: Wave 10 runs before Wave 50 runs before Wave 80. Lower priority numbers execute earlier in the pipeline.
flowchart TB
subgraph Wave10["Wave 10: Foundational Signals"]
W1[IdentityWave - Format, dimensions]
W2[ColorWave - Palette, saturation]
end
subgraph Wave40["Wave 40: Text Detection"]
W9[TextLikelinessWave - OpenCV EAST/CRAFT]
end
subgraph Wave50["Wave 50: Traditional OCR"]
W3[OcrWave - Tesseract]
end
subgraph Wave51["Wave 51: ML OCR"]
W8[MlOcrWave - Florence-2 ONNX]
end
subgraph Wave55["Wave 55: ML Captioning"]
W10[Florence2Wave - Local captions]
end
subgraph Wave58["Wave 58: Quality Gate"]
W5[OcrQualityWave - Escalation decision]
end
subgraph Wave70["Wave 70: Embeddings"]
W7[ClipEmbeddingWave - Semantic vectors]
end
subgraph Wave80["Wave 80: Vision LLM"]
W6[VisionLlmWave - Cloud fallback]
end
Wave10 --> Wave40 --> Wave50 --> Wave51 --> Wave55 --> Wave58 --> Wave70 --> Wave80
style Wave10 stroke:#22c55e,stroke-width:2px
style Wave40 stroke:#06b6d4,stroke-width:2px
style Wave50 stroke:#f59e0b,stroke-width:2px
style Wave51 stroke:#8b5cf6,stroke-width:2px
style Wave55 stroke:#8b5cf6,stroke-width:2px
style Wave58 stroke:#ef4444,stroke-width:2px
style Wave70 stroke:#3b82f6,stroke-width:2px
style Wave80 stroke:#8b5cf6,stroke-width:2px
Priority order (lower runs first): 10 → 40 → 50 → 51 → 55 → 58 → 70 → 80
This is Constrained Fuzzy MoM applied to image analysis: multiple proposers publish to a shared substrate (the AnalysisContext), and the final output aggregates their signals.
Note on wave ordering: The three OCR tiers (Tesseract/Florence-2/Vision LLM) are the conceptual escalation levels. Individual waves like Advanced OCR or Quality Gate are refinements within those tiers, not separate escalation levels—they perform temporal stabilization and quality checks, respectively.
Every wave produces signals using a standardized contract:
public record Signal
{
public required string Key { get; init; } // "color.dominant", "ocr.quality.is_garbled"
public object? Value { get; init; } // The measured value
public double Confidence { get; init; } = 1.0; // 0.0-1.0 reliability score
public required string Source { get; init; } // "ColorWave", "VisionLlmWave"
public DateTime Timestamp { get; init; } // When produced
public List<string>? Tags { get; init; } // "visual", "ocr", "quality"
public Dictionary<string, object>? Metadata { get; init; } // Additional context
}
This is the Part 2 signal contract in action. Waves do not talk to each other via natural language. They publish typed signals to the shared context, and downstream waves can query those signals.
Note that Confidence is per-signal, not per-wave. A single wave can emit multiple signals with different epistemic strength-ColorWave's dominant color list has confidence 1.0 (computed), but individual color percentages use confidence as a weighting factor for downstream summarisation.
Confidence here means reliability for downstream use, not mathematical certainty. Deterministic signals are reproducible, not infallible-spell-check can be deterministically wrong about proper nouns.
Determinism caveat: "Deterministic" means no sampling randomness and stable results for a given runtime and configuration. ONNX GPU execution providers may introduce minor numerical variance, which is acceptable for routing decisions. The signal contract (thresholds, escalation logic) remains fully deterministic.
To avoid confusion, here's the canonical OCR signal namespace used throughout the system:
| Signal Key | Source | Description |
|---|---|---|
ocr.text |
Tesseract (Tier 1) | Raw single-frame OCR |
ocr.confidence |
Tesseract | Tesseract confidence score |
ocr.ml.text |
Florence-2 (Tier 2) | ML OCR single-frame |
ocr.ml.multiframe_text |
Florence-2 (Tier 2) | Multi-frame GIF OCR (preferred for animations) |
ocr.ml.confidence |
Florence-2 | Florence-2 confidence score |
ocr.quality.spell_check_score |
OcrQualityWave | Deterministic spell-check ratio |
ocr.quality.is_garbled |
OcrQualityWave | Boolean escalation signal |
ocr.vision.text |
VisionLlmWave (Tier 3) | Vision LLM OCR extraction |
caption.text |
VisionLlmWave | Descriptive caption (separate from OCR) |
Important distinction: ocr.vision.text is text extraction (OCR), while caption.text is scene description (captioning). Both may come from the same Vision LLM call, but serve different purposes.
Final text selection priority (highest to lowest):
ocr.vision.text (Vision LLM OCR, if escalated)ocr.ml.multiframe_text (Florence-2 GIF)ocr.ml.text (Florence-2 single-frame)ocr.text (Tesseract)Each wave implements a simple interface:
public interface IAnalysisWave
{
string Name { get; }
int Priority { get; } // Lower number = runs earlier (10 before 50 before 80)
IReadOnlyList<string> Tags { get; }
Task<IEnumerable<Signal>> AnalyzeAsync(
string imagePath,
AnalysisContext context, // Shared substrate with earlier signals
CancellationToken ct);
}
The AnalysisContext is the consensus space from Part 2. Waves can:
context.GetValue<bool>("ocr.quality.is_garbled")context.GetCached<Image<Rgba32>>("ocr.frames")ColorWave runs first (priority 10) and computes facts that constrain everything else:
public class ColorWave : IAnalysisWave
{
public string Name => "ColorWave";
public int Priority => 10; // Runs first (lowest priority number)
public IReadOnlyList<string> Tags => new[] { "visual", "color" };
public async Task<IEnumerable<Signal>> AnalyzeAsync(
string imagePath,
AnalysisContext context,
CancellationToken ct)
{
var signals = new List<Signal>();
using var image = await LoadImageAsync(imagePath, ct);
// Extract dominant colors (computed, not guessed)
var dominantColors = _colorAnalyzer.ExtractDominantColors(image);
signals.Add(new Signal
{
Key = "color.dominant_colors",
Value = dominantColors,
Confidence = 1.0, // Reproducible measurement
Source = Name,
Tags = new List<string> { "color" }
});
// Individual colors for easy access
for (int i = 0; i < Math.Min(5, dominantColors.Count); i++)
{
var color = dominantColors[i];
signals.Add(new Signal
{
Key = $"color.dominant_{i + 1}",
Value = color.Hex,
Confidence = color.Percentage / 100.0,
Source = Name,
Metadata = new Dictionary<string, object>
{
["name"] = color.Name,
["percentage"] = color.Percentage
}
});
}
// Cache the image for other waves (no need to reload)
context.SetCached("image", image.CloneAs<Rgba32>());
return signals;
}
}
The Vision LLM later receives these colors as constraints. It should not claim the image has "vibrant reds" if ColorWave computed that the dominant color is blue-and if it does, the contradiction is detectable and can be rejected downstream.
This is where Constrained Fuzziness shines. OcrQualityWave is the constrainer that decides whether to escalate to expensive Vision LLM:
public class OcrQualityWave : IAnalysisWave
{
public string Name => "OcrQualityWave";
public int Priority => 58; // Runs after OCR waves
public IReadOnlyList<string> Tags => new[] { "content", "ocr", "quality" };
public async Task<IEnumerable<Signal>> AnalyzeAsync(
string imagePath,
AnalysisContext context,
CancellationToken ct)
{
var signals = new List<Signal>();
// Get OCR text from earlier waves (canonical taxonomy)
string? ocrText =
context.GetValue<string>("ocr.ml.multiframe_text") ?? // Florence-2 GIF
context.GetValue<string>("ocr.ml.text") ?? // Florence-2 single
context.GetValue<string>("ocr.text"); // Tesseract
if (string.IsNullOrWhiteSpace(ocrText))
{
signals.Add(new Signal
{
Key = "ocr.quality.no_text",
Value = true,
Confidence = 1.0,
Source = Name
});
return signals;
}
// Tier 1: Spell check (deterministic, no LLM)
var spellResult = _spellChecker.CheckTextQuality(ocrText);
signals.Add(new Signal
{
Key = "ocr.quality.spell_check_score",
Value = spellResult.CorrectWordsRatio,
Confidence = 1.0,
Source = Name,
Metadata = new Dictionary<string, object>
{
["total_words"] = spellResult.TotalWords,
["correct_words"] = spellResult.CorrectWords
}
});
signals.Add(new Signal
{
Key = "ocr.quality.is_garbled",
Value = spellResult.IsGarbled, // < 50% correct words
Confidence = 1.0,
Source = Name
});
// This signal triggers Vision LLM escalation
if (spellResult.IsGarbled)
{
signals.Add(new Signal
{
Key = "ocr.quality.correction_needed",
Value = true,
Confidence = 1.0,
Source = Name,
Tags = new List<string> { "action_required" },
Metadata = new Dictionary<string, object>
{
["quality_score"] = spellResult.CorrectWordsRatio,
["correction_method"] = "llm_sentinel"
}
});
// Cache for Vision LLM to access
context.SetCached("ocr.garbled_text", ocrText);
}
return signals;
}
}
The escalation decision is deterministic: if spell check score < 50%, emit a signal that triggers Vision LLM. No probabilistic judgment. No "maybe we should ask the LLM". Just a threshold.

$ imagesummarizer demo-images/arse_biscuits.gif --pipeline caption --output text
OCR: "ARSE BISCUITS"
Caption: An elderly man dressed as bishop with text reading "arse biscuits"
Scene: meme
OCR got the text; Vision LLM provided scene context. Each wave contributes what it's good at.
The Vision LLM wave only runs when earlier signals indicate it's needed. And when it does run, it's constrained by computed facts:
public class VisionLlmWave : IAnalysisWave
{
public string Name => "VisionLlmWave";
public int Priority => 50; // Runs after quality assessment
public IReadOnlyList<string> Tags => new[] { "content", "vision", "llm" };
public async Task<IEnumerable<Signal>> AnalyzeAsync(
string imagePath,
AnalysisContext context,
CancellationToken ct)
{
var signals = new List<Signal>();
if (!Config.EnableVisionLlm)
{
signals.Add(new Signal
{
Key = "vision.llm.disabled",
Value = true,
Confidence = 1.0,
Source = Name
});
return signals;
}
// Check if OCR was unreliable (garbled text)
var ocrGarbled = context.GetValue<bool>("ocr.quality.is_garbled");
var textLikeliness = context.GetValue<double>("content.text_likeliness");
var ocrConfidence = context.GetValue<double>("ocr.ml.confidence",
context.GetValue<double>("ocr.confidence"));
// Only escalate when: OCR failed OR (text likely but low OCR confidence)
// Models never decide paths; deterministic signals do (no autonomy)
bool shouldEscalate = ocrGarbled ||
(textLikeliness > 0.7 && ocrConfidence < 0.5);
if (shouldEscalate)
{
var llmText = await ExtractTextAsync(imagePath, ct);
if (!string.IsNullOrEmpty(llmText))
{
// Emit OCR signal (Vision LLM tier)
signals.Add(new Signal
{
Key = "ocr.vision.text", // Vision LLM OCR extraction
Value = llmText,
Confidence = 0.95, // High but not 1.0 - still probabilistic
Source = Name,
Tags = new List<string> { "ocr", "vision", "llm" },
Metadata = new Dictionary<string, object>
{
["ocr_was_garbled"] = ocrGarbled,
["escalation_reason"] = ocrGarbled ? "quality_gate_failed" : "low_confidence_high_likeliness",
["text_likeliness"] = textLikeliness,
["prior_ocr_confidence"] = ocrConfidence
}
});
// Optionally emit caption (separate signal)
var llmCaption = await GenerateCaptionAsync(imagePath, ct);
if (!string.IsNullOrEmpty(llmCaption))
{
signals.Add(new Signal
{
Key = "caption.text", // Descriptive caption (not OCR)
Value = llmCaption,
Confidence = 0.90,
Source = Name,
Tags = new List<string> { "caption", "description" }
});
}
}
}
return signals;
}
}
The key insight: Vision LLM text has confidence 0.95, not 1.0. It's better than garbled OCR, but it's still probabilistic. The downstream aggregation knows this. (Why 0.95? Default prior, configured per model/pipeline, recorded in config. The exact value matters less than having a value that isn't 1.0.)
The ImageLedger accumulates signals into structured sections for downstream consumption. This is Context Dragging applied to image analysis:
public class ImageLedger
{
public ImageIdentity Identity { get; set; } = new();
public ColorLedger Colors { get; set; } = new();
public TextLedger Text { get; set; } = new();
public MotionLedger? Motion { get; set; }
public QualityLedger Quality { get; set; } = new();
public VisionLedger Vision { get; set; } = new();
public static ImageLedger FromProfile(DynamicImageProfile profile)
{
var ledger = new ImageLedger();
// Text: Priority order - corrected > voting > temporal > raw
ledger.Text = new TextLedger
{
ExtractedText =
profile.GetValue<string>("ocr.final.corrected_text") ?? // Tier 2/3 corrections
profile.GetValue<string>("ocr.voting.consensus_text") ?? // Temporal voting
profile.GetValue<string>("ocr.full_text") ?? // Raw OCR
string.Empty,
Confidence = profile.GetValue<double>("ocr.voting.confidence"),
SpellCheckScore = profile.GetValue<double>("ocr.quality.spell_check_score"),
IsGarbled = profile.GetValue<bool>("ocr.quality.is_garbled")
};
// Colors: Computed facts, not guessed
ledger.Colors = new ColorLedger
{
DominantColors = profile.GetValue<List<DominantColor>>("color.dominant_colors") ?? new(),
IsGrayscale = profile.GetValue<bool>("color.is_grayscale"),
MeanSaturation = profile.GetValue<double>("color.mean_saturation")
};
return ledger;
}
public string ToLlmSummary()
{
var parts = new List<string>();
parts.Add($"Format: {Identity.Format}, {Identity.Width}x{Identity.Height}");
if (Colors.DominantColors.Count > 0)
{
var colorList = string.Join(", ",
Colors.DominantColors.Take(5).Select(c => $"{c.Name}({c.Percentage:F0}%)"));
parts.Add($"Colors: {colorList}");
}
if (!string.IsNullOrWhiteSpace(Text.ExtractedText))
{
var preview = Text.ExtractedText.Length > 100
? Text.ExtractedText[..100] + "..."
: Text.ExtractedText;
parts.Add($"Text (OCR, {Text.Confidence:F0}% confident): \"{preview}\"");
}
return string.Join("\n", parts);
}
}
The ledger is the anchor in CFCD terms. It carries forward what survived selection, and the LLM synthesis must respect these facts.
You've seen escalation logic in two places: OcrQualityWave emits signals about quality; EscalationService applies policy across those signals. This is intentional separation:
EscalationService aggregates signals and applies global thresholdsThe EscalationService ties it all together. It implements the Part 1 pattern: substrate → proposer → constrainer:
public class EscalationService
{
private bool ShouldAutoEscalate(ImageProfile profile)
{
// Escalate if type detection confidence is low
if (profile.TypeConfidence < _config.ConfidenceThreshold)
return true;
// Escalate if image is blurry
if (profile.LaplacianVariance < _config.BlurThreshold)
return true;
// Escalate if high text content
if (profile.TextLikeliness >= _config.TextLikelinessThreshold)
return true;
// Escalate for complex diagrams or charts
if (profile.DetectedType is ImageType.Diagram or ImageType.Chart)
return true;
return false;
}
}
Every escalation decision is deterministic: same inputs, same thresholds, same decision. No LLM judgment in the escalation logic.
When the Vision LLM does run, it receives the computed facts as constraints:
private static string BuildVisionPrompt(ImageProfile profile)
{
var prompt = new StringBuilder();
prompt.AppendLine("CRITICAL CONSTRAINTS:");
prompt.AppendLine("- Only describe what is visually present in the image");
prompt.AppendLine("- Only reference metadata values provided below");
prompt.AppendLine("- Do NOT infer, assume, or guess information not visible");
prompt.AppendLine();
prompt.AppendLine("METADATA SIGNALS (computed from image analysis):");
if (profile.DominantColors?.Any() == true)
{
prompt.Append("Dominant Colors: ");
var colorDescriptions = profile.DominantColors
.Take(3)
.Select(c => $"{c.Name} ({c.Percentage:F0}%)");
prompt.AppendLine(string.Join(", ", colorDescriptions));
if (profile.IsMostlyGrayscale)
prompt.AppendLine(" → Image is mostly grayscale");
}
prompt.AppendLine($"Sharpness: {profile.LaplacianVariance:F0} (Laplacian variance)");
if (profile.LaplacianVariance < 100)
prompt.AppendLine(" → Image is blurry or soft-focused");
prompt.AppendLine($"Detected Type: {profile.DetectedType} (confidence: {profile.TypeConfidence:P0})");
prompt.AppendLine();
prompt.AppendLine("Use these metadata signals to guide your description.");
prompt.AppendLine("Your description should be grounded in observable facts only.");
return prompt.ToString();
}
The Vision LLM should not claim "vibrant colors" if we computed grayscale-if it does, the contradiction is detectable. It should not claim "sharp details" if we computed low Laplacian variance-if it does, we can reject the output. The deterministic substrate constrains the probabilistic output.
These constraints reduce hallucination but cannot eliminate it-prompts are suggestions, not guarantees. Real enforcement happens downstream via confidence weighting and signal selection. The prompt is one layer; the architecture is the other.
When extracting the final text, the system uses a strict priority order:
static string? GetExtractedText(DynamicImageProfile profile)
{
// Priority chain using canonical signal names (see OCR Signal Taxonomy above)
// 1. Vision LLM OCR (best for complex/garbled)
// 2. Florence-2 multi-frame GIF (temporal stability)
// 3. Florence-2 single-frame (stylized fonts)
// 4. Tesseract (baseline)
var visionText = profile.GetValue<string>("ocr.vision.text");
if (!string.IsNullOrEmpty(visionText))
return visionText;
var florenceMulti = profile.GetValue<string>("ocr.ml.multiframe_text");
if (!string.IsNullOrEmpty(florenceMulti))
return florenceMulti;
var florenceText = profile.GetValue<string>("ocr.ml.text");
if (!string.IsNullOrEmpty(florenceText))
return florenceText;
return profile.GetValue<string>("ocr.text") ?? string.Empty;
}
Note: This selects ONE source, but the ledger exposes ALL sources with provenance. Downstream consumers can inspect profile.Ledger.Signals to see all OCR attempts and their confidence scores.
Each source has known reliability:
The priority order encodes this knowledge. Florence-2 sitting between Vision LLM and Tesseract provides a "sweet spot" for most images—better than traditional OCR, cheaper than cloud Vision LLMs.
Note: this function selects one source, but the ledger exposes all sources with their confidence scores. Downstream consumers can-and should-inspect provenance when the domain requires it. The priority order is a sensible default, not a straitjacket.
-iting a single wave. This punishes sloppy thinking.
This is not the fast path. It's the reliable path. Worth it if you need auditable image understanding at scale; overkill if you just need captions for a photo gallery.
| Failure Mode | What Happens | How It's Handled |
|---|---|---|
| Noisy GIF | Frame jitter, compression artifacts | Temporal stabilisation + SSIM deduplication + voting consensus |
| OCR returns garbage | Tesseract fails on stylized fonts | Spell-check gate detects < 50% correct → escalates to Vision LLM |
| Vision hallucinates | LLM claims text that isn't there | Signals enable contradiction detection (pattern shown above); downstream consumers can compare vision.llm.text vs content.text_likeliness |
| Pipeline changes over time | New waves added, thresholds adjusted | Content-hash caching + full provenance in every signal |
| Model returns nothing | Vision LLM timeout or empty response | Fallback chain continues to OCR sources; GetExtractedText returns next priority source |
Every failure mode has a deterministic response. No silent degradation.
The architecture has structure: every wave is independent, every signal is typed, every escalation is deterministic. The Vision LLM is powerful, but it never operates unconstrained.
If you can do this for images-the messiest input type, with OCR noise, stylized fonts, animated frames, and hallucination-prone captions-you can do it for any probabilistic component.
That's Constrained Fuzziness in practice. Not an abstract pattern. Working code.
| Part | Pattern | Axis |
|---|---|---|
| 1 | Constrained Fuzziness | Single component |
| 2 | Constrained Fuzzy MoM | Multiple components |
| 3 | Context Dragging | Time / memory |
| 4 | Image Intelligence (this article) | Practical implementation |
All four parts follow the same invariant: probabilistic components propose; deterministic systems persist.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.