This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Thursday, 15 January 2026
Status: In development as part of lucidRAG. Source: github.com/scottgal/lucidrag
Where this fits: VideoSummarizer is the orchestrator of the lucidRAG family, combining five pipelines into a unified video analysis engine:
All follow the same Reduced RAG pattern: extract signals once, store evidence, synthesize with bounded LLM input.
Processing a two-hour movie frame-by-frame with CLIP embeddings would take hours and cost hundreds of dollars in compute. The naive approach (send every frame to a Vision LLM) is even worse.
VideoSummarizer solves this with three key optimizations:
The result: a 2-hour movie processes in ~10-15 minutes, not hours. CLIP is used for fast, local semantic embeddings; Vision LLMs are optional and only invoked for escalated analysis.
Same architecture principles as ImageSummarizer and AudioSummarizer, but composed into a unified video analysis pipeline.
Core insight: Video is shots + audio + text. Process each domain with specialized tools, merge results into coherent scenes.
- Process structure first (cuts, I-frames, audio segments)
- Extract cross-modal signals once (embeddings, transcripts, entities)
Terminology:
(start_time, end_time) + signals + pointers + provenanceKey ML models used:
This article covers:
Related articles:
Benchmarks: Numbers below measured on AMD 9950X (16-core) / NVIDIA A4000 (16GB) / 96GB RAM / NVMe, 1080p H.264, Whisper base. These numbers show order-of-magnitude effects, not absolute performance guarantees.
A typical movie contains:
The strawman approach (nobody does this, but it sets the scale):
Even with keyframe extraction (say, 500-1000 frames), that's still 100-200 seconds of serial CLIP inference.
Traditional approach: "Extract keyframes, send to Vision LLM, hope for the best"
Problem: This burns compute on redundant frames (many keyframes are visually similar), processes them serially (GPU idle between frames), and misses audio/text signals entirely.
Solution: Multi-stage filtering, batch processing, and pipeline composition.
VideoSummarizer implements Reduced RAG for video with a three-stage reduction:
flowchart TB
subgraph Input["Video File (.mp4, .mkv, etc.)"]
VF[Video File]
end
subgraph Stage1["Stage 1: Structural Analysis"]
N[NormalizeWave<br/>FFprobe metadata]
SD[ShotDetectionWave<br/>Scene cuts via FFmpeg]
KE[KeyframeExtractionWave<br/>I-frame + dedup]
AE[AudioExtractionWave<br/>FFmpeg audio extract]
end
subgraph Stage2["Stage 2: Content Extraction"]
IS[ImageSummarizer<br/>CLIP, OCR, Vision]
AS[AudioSummarizer<br/>Whisper, Diarization]
NER[NER Service<br/>Entity extraction]
DS[DocSummarizer<br/>Text analysis if long]
GR[GraphRAG<br/>Entity relationships]
end
subgraph Stage3["Stage 3: Scene Assembly"]
SC[SceneClusteringWave<br/>Multi-signal clustering]
EV[EvidenceGenerationWave<br/>RAG chunks]
end
VF --> N
N --> SD --> KE
N --> AE
KE --> IS
AE --> AS
IS --> NER
AS --> NER
NER --> DS
NER --> GR
IS --> SC
DS --> SC
GR --> EV
SC --> EV
style Stage1 stroke:#22c55e,stroke-width:2px
style Stage2 stroke:#3b82f6,stroke-width:2px
style Stage3 stroke:#8b5cf6,stroke-width:2px
Before diving into implementation, here's what you get (the output schema):
| Artifact | Key Fields | Source |
|---|---|---|
| Scene | id, start_time, end_time, key_terms[], speaker_ids[], embedding[512] |
SceneClusteringWave |
| Shot | id, start_time, end_time, cut_type, keyframe_path |
ShotDetectionWave |
| Utterance | id, text, start_time, end_time, speaker_id, confidence |
TranscriptionWave |
| TextTrack | id, text, start_time, text_type (title/credit/subtitle/ocr) |
SubtitleExtractionWave |
| Keyframe | id, timestamp, frame_path, dhash, clip_embedding[512] |
KeyframeExtractionWave |
Each artifact includes provenance: source wave, processing timestamp, confidence score. These artifacts are the primary retrieval units for video queries; raw files are never retrieved directly.
VideoSummarizer uses a signal-based wave architecture where each wave declares its signal contracts explicitly:
public interface ISignalAwareVideoWave
{
/// <summary>Signals this wave requires before it can run.</summary>
IReadOnlyList<string> RequiredSignals { get; }
/// <summary>Signals this wave can optionally use if available.</summary>
IReadOnlyList<string> OptionalSignals { get; }
/// <summary>Signals this wave emits on successful completion.</summary>
IReadOnlyList<string> EmittedSignals { get; }
/// <summary>Cache keys this wave produces for downstream waves.</summary>
IReadOnlyList<string> CacheEmits { get; }
/// <summary>Cache keys this wave consumes from upstream waves.</summary>
IReadOnlyList<string> CacheUses { get; }
}
This enables dynamic wave coordination:
Keyframe extraction is implemented as 7 granular waves for better parallelism and cache efficiency:
| Wave | Priority | Requires | Emits | Time |
|---|---|---|---|---|
| NormalizeWave | 1000 | - | video.duration, video.fps, video.normalized |
~2s |
| FFmpegShotDetectionWave | 900 | video.normalized |
shots.detected, shots.count |
~5-10s |
| IFrameDetectionWave | 850 | video.normalized |
keyframes.iframes_detected, keyframes.iframes_count |
~3s |
| KeyframeSelectionWave | 840 | shots.detected, keyframes.iframes_detected |
keyframes.selected, keyframes.selected_count |
~1s |
| ThumbnailExtractionWave | 830 | keyframes.selected |
keyframes.thumbnails_extracted |
~5s |
| KeyframeDeduplicationWave | 820 | keyframes.thumbnails_extracted |
keyframes.deduplicated, keyframes.duplicates_skipped |
~1s |
| KeyframeFullResExtractionWave | 810 | keyframes.deduplicated |
keyframes.extracted, keyframes.count |
~10s |
| ClipEmbeddingWave | 800 | keyframes.extracted |
clip.embeddings_ready, clip.embeddings_count |
~30s |
| ImageAnalysisWave | 790 | keyframes.deduplicated |
keyframes.analyzed, ocr.extracted |
~60s |
| TitleCreditsDetectionWave | 750 | shots.detected |
title.detected, credits.detected |
~5s |
| AudioExtractionWave | 650 | video.normalized |
audio.extracted, audio.path |
~30s |
| TranscriptionWave | 600 | audio.extracted |
transcription.complete, transcription.utterance_count |
~120s |
| SubtitleExtractionWave | 550 | video.normalized |
subtitles.extracted |
~2s |
| ChapterExtractionWave | 500 | video.normalized |
chapters.extracted |
~1s |
| SceneClusteringWave | 400 | shots.detected |
scenes.detected, scene.count |
~5s |
| EvidenceGenerationWave | 100 | scenes.detected |
evidence.generated |
~2s |
Notes:
keyframes.deduplicated (not full-res): OCR runs on thumbnails; vision captioning uses full-res when available via capability routing.Total for 2-hour movie: ~10-15 minutes (vs. hours without optimization)
Signals are defined as constants for consistency:
public static class VideoSignals
{
// NormalizeWave signals
public const string VideoDuration = "video.duration";
public const string VideoFps = "video.fps";
public const string VideoNormalized = "video.normalized";
// Shot detection signals
public const string ShotsDetected = "shots.detected";
public const string ShotsCount = "shots.count";
// Keyframe signals
public const string IframesDetected = "keyframes.iframes_detected";
public const string KeyframesSelected = "keyframes.selected";
public const string KeyframesDeduplicated = "keyframes.deduplicated";
public const string KeyframesExtracted = "keyframes.extracted";
// CLIP embedding signals
public const string ClipEmbeddingsReady = "clip.embeddings_ready";
// Scene clustering signals
public const string ScenesDetected = "scenes.detected";
public const string SceneCount = "scene.count";
// Transcription signals
public const string TranscriptionComplete = "transcription.complete";
}
VideoSummarizer uses a capability-based architecture: detect GPU once at startup, download models lazily, route work to available components.
Models are defined in models.yaml. No magic strings in code:
# models.yaml (excerpt)
models:
clip-vit-b32:
name: "CLIP ViT-B/32"
download_url: "https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/onnx/visual_model.onnx"
preferred_providers: [CUDAExecutionProvider, DmlExecutionProvider, CPUExecutionProvider]
components:
ClipEmbeddingWave:
models: [clip-vit-b32]
fallback_chain: [ImageAnalysisWave]
// Type-safe constants (no raw strings)
await coordinator.EnsureModelAsync(ModelIds.ClipVitB32);
await coordinator.ActivateWaveAsync(ComponentIds.TranscriptionWave);
// Route with fallback
var route = await coordinator.RouteWorkAsync(new[]
{
ComponentIds.ClipEmbeddingWave, // Primary (GPU)
ComponentIds.ImageAnalysisWave // Fallback (CPU)
});
Rate limiting, time estimation, and adaptive backpressure keep the UI responsive while maximizing throughput:
// Time estimation from actual data
var estimator = CapabilityAtoms.CreateTimeEstimator();
using (estimator.Time("clip_embedding")) { await ProcessAsync(); }
var eta = estimator.GetEstimate("clip_embedding", remaining: 50);
// eta.Estimated, eta.Optimistic, eta.Pessimistic, eta.Confidence
Full capability system docs: See
Mostlylucid.Summarizer.Core/Capabilities/for GPU detection, signal pub/sub, backpressure controllers, and mesh topology design
Before running expensive CLIP embeddings, VideoSummarizer filters out visually similar frames using difference hash (dHash).
public class KeyframeDeduplicationService
{
// dHash parameters: 9x8 grayscale = 64 bits
private const int HashWidth = 9;
private const int HashHeight = 8;
private const int DefaultHammingThreshold = 10;
public async Task<ulong> ComputeDHashAsync(string imagePath, CancellationToken ct)
{
using var image = Image.Load<Rgba32>(imagePath);
// Resize to 9x8 (one extra column for gradient comparison)
image.Mutate(x => x
.Resize(HashWidth, HashHeight)
.Grayscale());
ulong hash = 0;
int bit = 0;
// Compare adjacent pixels horizontally
for (int y = 0; y < HashHeight; y++)
{
for (int x = 0; x < HashWidth - 1; x++)
{
var left = image[x, y].R;
var right = image[x + 1, y].R;
// Set bit if left pixel is brighter than right
if (left > right)
{
hash |= (1UL << bit);
}
bit++;
}
}
return hash;
}
public static int HammingDistance(ulong a, ulong b) =>
BitOperations.PopCount(a ^ b);
}
Example output:
Input: 50 keyframe candidates (from codec I-frames)
Deduplication (Hamming threshold 10):
Frame 0: hash=0x8f3a2c1d → KEEP (first frame)
Frame 1: hash=0x8f3a2c1e → SKIP (distance=1 from frame 0)
Frame 2: hash=0x8f3a2c1f → SKIP (distance=2 from frame 0)
Frame 3: hash=0xc7e1b4a2 → KEEP (distance=28 from frame 0)
...
Result: 50 → 30 frames (40% reduction)
Processing saved: ~8 seconds of CLIP inference
Why this matters:
Instead of processing one image at a time, VideoSummarizer batches 8 images per GPU pass.
public class BatchClipEmbeddingService
{
private const int ClipImageSize = 224;
private const int DefaultBatchSize = 8; // 8 images per GPU pass
public async Task<Dictionary<int, float[]>> GenerateBatchEmbeddingsAsync(
Dictionary<int, string> framePaths,
IBackpressureController backpressure,
ITimeEstimator estimator,
int batchSize = DefaultBatchSize,
CancellationToken ct = default)
{
var session = await GetOrLoadClipModelAsync(ct);
var results = new ConcurrentDictionary<int, float[]>();
// Pre-index batch for O(1) lookup
var batches = framePaths
.Select((kvp, idx) => (idx, kvp.Key, kvp.Value))
.Chunk(batchSize)
.ToList();
foreach (var batch in batches)
{
// Acquire slot from backpressure controller (adaptive concurrency)
using var slot = await backpressure.AcquireSlotAsync(ct);
using var timer = estimator.Time("clip_batch");
// Create batch tensor [batchSize, 3, 224, 224]
var tensor = new DenseTensor<float>(new[] { batch.Length, 3, ClipImageSize, ClipImageSize });
// Preprocess images into tensor slots
for (int i = 0; i < batch.Length; i++)
{
var (_, frameIndex, path) = batch[i];
PreprocessImageToTensor(path, tensor, slotIndex: i);
}
// Single GPU pass for entire batch
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input", tensor)
};
using var outputResults = session.Run(inputs);
// Extract embeddings, record latency for backpressure adjustment
}
backpressure.RecordLatency(estimator.GetAverageDuration("clip_batch"));
return new Dictionary<int, float[]>(results);
}
}
Performance comparison:
Input: 30 keyframes (after deduplication)
Serial processing (1 frame at a time):
30 × 200ms = 6,000ms (6.0 seconds)
Batch processing (8 frames per pass):
4 batches × 350ms = 1,400ms (1.4 seconds)
Speedup: 4.3x
Why batch processing works:
[8, 3, 224, 224] uses same GPU memory as single image (almost)VideoSummarizer doesn't reinvent ImageSummarizer or AudioSummarizer. It chains them.
The keyframe extraction is split into 7 granular waves (see wave table above). Here's the coordination pattern showing how they chain together:
// IFrameDetectionWave → KeyframeSelectionWave → ThumbnailExtractionWave
// → KeyframeDeduplicationWave → KeyframeFullResExtractionWave → ClipEmbeddingWave
// ClipEmbeddingWave coordinates with ImageSummarizer
public class ClipEmbeddingWave : IVideoWave, ISignalAwareVideoWave
{
private readonly BatchClipEmbeddingService _batchClipService;
private readonly IBackpressureController _backpressure;
private readonly ITimeEstimator _estimator;
public IReadOnlyList<string> RequiredSignals => [VideoSignals.KeyframesExtracted];
public IReadOnlyList<string> EmittedSignals => [VideoSignals.ClipEmbeddingsReady];
public async Task ProcessAsync(VideoContext context, CancellationToken ct)
{
var keyframes = context.GetCached<Dictionary<int, string>>("keyframes.paths");
// Batch CLIP embedding with backpressure control
var embeddings = await _batchClipService.GenerateBatchEmbeddingsAsync(
keyframes, _backpressure, _estimator, batchSize: 8, ct);
foreach (var (frameIndex, embedding) in embeddings)
context.KeyframeEmbeddings[frameIndex] = embedding;
}
}
// ImageAnalysisWave runs ImageSummarizer on deduplicated frames
public class ImageAnalysisWave : IVideoWave, ISignalAwareVideoWave
{
public IReadOnlyList<string> RequiredSignals => [VideoSignals.KeyframesDeduplicated];
public async Task ProcessAsync(VideoContext context, CancellationToken ct)
{
var keyframePaths = context.GetCached<List<string>>("keyframes.deduplicated_paths");
foreach (var path in keyframePaths)
{
// Run ImageSummarizer for OCR, vision, captions
var result = await _imageOrchestrator.AnalyzeAsync(path, ct);
context.SetCached($"image_analysis.{Path.GetFileName(path)}", result);
}
}
}
Audio extraction and transcription are now separate signal-aware waves:
// AudioExtractionWave runs first (extracts audio track from video)
public class AudioExtractionWave : IVideoWave, ISignalAwareVideoWave
{
public IReadOnlyList<string> RequiredSignals => [VideoSignals.VideoNormalized];
public IReadOnlyList<string> EmittedSignals => ["audio.extracted", "audio.path"];
public async Task ProcessAsync(VideoContext context, CancellationToken ct)
{
var audioPath = await _ffmpegService.ExtractAudioAsync(
context.VideoPath, context.WorkingDirectory, ct);
context.SetCached("audio.path", audioPath);
}
}
// TranscriptionWave depends on audio.extracted signal
public class TranscriptionWave : IVideoWave, ISignalAwareVideoWave
{
public IReadOnlyList<string> RequiredSignals => ["audio.extracted"];
public IReadOnlyList<string> EmittedSignals => [
VideoSignals.TranscriptionComplete,
"transcription.utterance_count"
];
public async Task ProcessAsync(VideoContext context, CancellationToken ct)
{
var audioPath = context.GetCached<string>("audio.path");
// Run AudioSummarizer pipeline (Whisper + diarization)
var audioProfile = await _audioOrchestrator.AnalyzeAsync(audioPath, ct);
// Extract utterances with speaker info
var turns = audioProfile.GetValue<List<SpeakerTurn>>("speaker.turns");
foreach (var turn in turns ?? [])
{
context.Utterances.Add(new Utterance
{
Id = Guid.NewGuid(),
Text = turn.Text,
StartTime = turn.StartSeconds,
EndTime = turn.EndSeconds,
SpeakerId = turn.SpeakerId,
Confidence = turn.Confidence
});
}
// Run NER on full transcript for entity extraction
var transcript = audioProfile.GetValue<string>("transcription.full_text");
if (!string.IsNullOrEmpty(transcript))
{
var entities = await _nerService.ExtractEntitiesAsync(transcript, ct);
context.SetCached("transcript_entities", entities);
// Emit entity signals by type (PER, ORG, LOC, MISC)
foreach (var group in entities.GroupBy(e => e.Type))
{
context.AddSignal($"transcript.entities.{group.Key.ToLowerInvariant()}",
group.Select(e => e.Text).Distinct().ToList());
}
}
}
}
VideoSummarizer extracts named entities from transcripts using BERT-based NER (ONNX).
public class OnnxNerService
{
// Model: dslim/bert-base-NER (ONNX exported)
// Entities: PER (Person), ORG (Organization), LOC (Location), MISC (Miscellaneous)
public async Task<List<EntitySpan>> ExtractEntitiesAsync(string text, CancellationToken ct)
{
var entities = new List<EntitySpan>();
// Chunk long text (BERT max 512 tokens)
foreach (var chunk in ChunkText(text, maxTokens: 400, overlap: 50))
{
// Tokenize with WordPiece
var tokens = _tokenizer.Tokenize(chunk);
// Run ONNX inference
var inputs = PrepareInputs(tokens);
using var results = _session.Run(inputs);
// Decode BIO tags
var predictions = DecodePredictions(results);
var chunkEntities = ExtractEntitySpans(tokens, predictions);
entities.AddRange(chunkEntities);
}
// Deduplicate entities
return entities
.GroupBy(e => (e.Text.ToLowerInvariant(), e.Type))
.Select(g => g.First())
.ToList();
}
}
Example output:
Transcript: "Today we're speaking with John Smith from Microsoft about
their new AI lab in Seattle. The project, codenamed Phoenix, builds
on research from Stanford University."
Entities extracted:
PER: John Smith
ORG: Microsoft, Stanford University
LOC: Seattle
MISC: Phoenix
Signals emitted:
transcript.entities.per = ["John Smith"]
transcript.entities.org = ["Microsoft", "Stanford University"]
transcript.entities.loc = ["Seattle"]
transcript.entities.misc = ["Phoenix"]
Why NER matters for video:
Using CLIP embeddings alone for scene detection doesn't work well. Keyframes are sparse by design (one per shot change), but shots are dense. With 39 embeddings for 1881 shots (~2% coverage), pure embedding clustering produces just 1 scene for a 2-hour movie.
VideoSummarizer uses a multi-signal approach that combines 4 weighted signals for robust scene boundary detection. Scene boundaries are selected by deterministic weighted scoring, not learned end-to-end:
public class SceneClusteringWave : IVideoWave, ISignalAwareVideoWave
{
// Signal weights for boundary scoring
private const double EmbeddingWeight = 0.4; // CLIP embedding dissimilarity
private const double TranscriptWeight = 0.3; // Semantic shift in transcript
private const double CutTypeWeight = 0.2; // Fade/dissolve detection
private const double TemporalWeight = 0.1; // Time since last scene
// Temporal constraints
private const double MinSceneDuration = 15.0; // Don't split scenes < 15s
private const double MaxSceneDuration = 300.0; // Force split at 5 minutes
private const double TargetSceneDuration = 90.0; // Prefer ~90s scenes
public IReadOnlyList<string> RequiredSignals => [VideoSignals.ShotsDetected];
public IReadOnlyList<string> OptionalSignals => [
VideoSignals.ClipEmbeddingsReady,
VideoSignals.TranscriptionComplete,
VideoSignals.KeyframesDeduplicated
];
public IReadOnlyList<string> EmittedSignals => [
VideoSignals.ScenesDetected,
"scene.count",
"scene.avg_duration",
"scene.clustering_method"
];
private List<(int shotIndex, double score)> ComputeBoundaryScores(VideoContext context)
{
var shots = context.Shots.OrderBy(s => s.StartTime).ToList();
var scores = new List<(int, double)>();
// Build embedding map with nearest-neighbor interpolation
var shotEmbeddings = PropagateEmbeddingsToNearbyShots(context, shots);
// Build transcript windows for semantic shift detection
var transcriptWindows = BuildTranscriptWindows(context, shots, windowSeconds: 10);
for (int i = 0; i < shots.Count - 1; i++)
{
double score = 0;
var currentShot = shots[i];
var nextShot = shots[i + 1];
// 1. Embedding dissimilarity (40%)
if (shotEmbeddings.TryGetValue(i, out var currentEmbed) &&
shotEmbeddings.TryGetValue(i + 1, out var nextEmbed))
{
var similarity = CosineSimilarity(currentEmbed, nextEmbed);
score += (1.0 - similarity) * EmbeddingWeight;
}
// 2. Transcript semantic shift (30%)
if (transcriptWindows.TryGetValue(i, out var currentWords) &&
transcriptWindows.TryGetValue(i + 1, out var nextWords))
{
var overlap = currentWords.Intersect(nextWords).Count();
var union = currentWords.Union(nextWords).Count();
var jaccard = union > 0 ? (double)overlap / union : 0;
score += (1.0 - jaccard) * TranscriptWeight;
}
// 3. Cut type signal (20%) - fades/dissolves suggest scene boundaries
if (currentShot.CutType is "fade" or "dissolve")
{
score += CutTypeWeight;
}
// 4. Temporal pressure (10%) - encourage splits near target duration
var timeSinceLastScene = currentShot.EndTime - GetLastSceneBoundary();
if (timeSinceLastScene > TargetSceneDuration)
{
var pressure = Math.Min(1.0, (timeSinceLastScene - TargetSceneDuration) / 60);
score += pressure * TemporalWeight;
}
scores.Add((i, score));
}
return scores;
}
}
Nearest-Neighbor Embedding Propagation: Only ~2% of shots have direct CLIP embeddings. The new approach propagates embeddings to nearby shots within 30 seconds using temporal proximity weighting.
Transcript Semantic Windows: Builds 10-second word windows around each shot and detects semantic shifts via Jaccard distance (a cheap but robust semantic drift proxy). Low overlap = topic change. BM25 overlap or embedding drift can be used when available.
Cut Type Awareness: Fade-to-black and dissolve transitions strongly indicate scene boundaries, boosting the boundary score.
Adaptive Thresholding: Instead of a fixed threshold, selects boundaries from the top 25% of scores (adaptive to content).
Temporal Constraints: Enforces minimum 15s scenes and forces boundaries at 5-minute maximum.
Example:
Input: 1881 shots from a 2-hour movie
39 keyframes with CLIP embeddings
2302 utterances from transcript
Boundary scoring per shot:
Shot 45-46: embedding=0.15, transcript=0.32, cut=0.0, temporal=0.0 → score=0.156
Shot 46-47: embedding=0.08, transcript=0.12, cut=0.0, temporal=0.0 → score=0.068
Shot 47-48: embedding=0.35, transcript=0.41, cut=0.2, temporal=0.05 → score=0.388 ← BOUNDARY
...
Adaptive threshold (top 25%): 0.25
Natural boundaries found: 45
Output: 47 scenes (avg 2.6 minutes per scene)
- Min scene: 15.2s
- Max scene: 298.4s
- Total coverage: 100%
Signals:
scenes.detected = true
scene.count = 47
scene.avg_duration = 156.3
scene.clustering_method = "multi_signal_weighted"
VideoSummarizer extends the signal contract from ImageSummarizer and AudioSummarizer:
public record VideoSignal
{
public required string Key { get; init; } // "scene.count", "transcript.entities.per"
public object? Value { get; init; }
public double Confidence { get; init; } = 1.0;
public required string Source { get; init; } // "SceneClusteringWave"
// Video-specific: time range
public double? StartTime { get; init; }
public double? EndTime { get; init; }
public DateTime Timestamp { get; init; }
public Dictionary<string, object>? Metadata { get; init; }
public List<string>? Tags { get; init; } // ["visual", "scene"]
}
public static class VideoSignalTags
{
public const string Visual = "visual";
public const string Audio = "audio";
public const string Speech = "speech";
public const string Ocr = "ocr";
public const string Motion = "motion";
public const string Scene = "scene";
public const string Shot = "shot";
public const string Metadata = "metadata";
}
Key signals emitted:
| Signal | Source | Description |
|---|---|---|
video.duration |
NormalizeWave | Total duration in seconds |
video.resolution |
NormalizeWave | Width×Height |
video.fps |
NormalizeWave | Frame rate |
shots.count |
ShotDetectionWave | Number of detected shots |
keyframes.count |
KeyframeExtractionWave | Unique keyframes after dedup |
keyframes.duplicates_skipped |
KeyframeExtractionWave | Frames filtered by dHash |
scene.count |
SceneClusteringWave | Coherent scene segments |
transcript.entities.per |
TranscriptionWave | Person names from NER |
transcript.entities.org |
TranscriptionWave | Organization names |
transcript.word_count |
TranscriptionWave | Total words in transcript |
The VideoPipeline converts video signals into ContentChunk for RAG indexing:
public class VideoPipeline : PipelineBase
{
public override string PipelineId => "video";
public override IReadOnlySet<string> SupportedExtensions => new HashSet<string>
{
".mp4", ".mkv", ".avi", ".mov", ".wmv", ".webm", ".flv", ".m4v", ".mpeg", ".mpg"
};
private List<ContentChunk> BuildContentChunks(VideoContext context, string filePath)
{
var chunks = new List<ContentChunk>();
// 1. Scene-based chunks (best for video retrieval)
foreach (var scene in context.Scenes)
{
var sceneText = BuildSceneText(context, scene);
var embedding = context.GetCached<float[]>($"scene_centroid.{scene.Id}");
chunks.Add(new ContentChunk
{
Text = sceneText,
ContentType = ContentType.Summary,
Embedding = embedding, // Proper vector column, not metadata
Metadata = new Dictionary<string, object?>
{
["source"] = "video_scene",
["scene_id"] = scene.Id,
["key_terms"] = scene.KeyTerms,
["speakers"] = scene.SpeakerIds,
["start_time"] = scene.StartTime,
["end_time"] = scene.EndTime
}
});
}
// 2. Transcript chunks (1-minute windows)
var transcriptChunks = BuildTranscriptChunks(context, filePath);
chunks.AddRange(transcriptChunks);
// 3. Text track chunks (on-screen text/subtitles)
foreach (var textTrack in context.TextTracks)
{
chunks.Add(new ContentChunk
{
Text = $"On-screen text: {textTrack.Text}",
ContentType = ContentType.ImageOcr,
Metadata = new Dictionary<string, object?>
{
["source"] = "video_ocr",
["text_type"] = textTrack.TextType.ToString(),
["start_time"] = textTrack.StartTime
}
});
}
return chunks;
}
private string BuildSceneText(VideoContext context, SceneSegment scene)
{
var parts = new List<string>();
if (!string.IsNullOrEmpty(scene.Label))
parts.Add($"Scene: {scene.Label}");
parts.Add($"[{FormatTime(scene.StartTime)} - {FormatTime(scene.EndTime)}]");
if (scene.KeyTerms.Count > 0)
parts.Add($"Topics: {string.Join(", ", scene.KeyTerms)}");
// Add utterances in this scene
var sceneUtterances = context.Utterances
.Where(u => u.StartTime >= scene.StartTime && u.EndTime <= scene.EndTime)
.OrderBy(u => u.StartTime);
if (sceneUtterances.Any())
parts.Add($"Speech: {string.Join(" ", sceneUtterances.Select(u => u.Text))}");
return string.Join("\n", parts);
}
}
Example output for a movie:
{
"chunks": [
{
"text": "Scene: Opening montage\n[0:00 - 2:34]\nTopics: city, night, traffic\nSpeech: The year is 2049. The world has changed.",
"contentType": "Summary",
"metadata": {
"source": "video_scene",
"scene_id": "abc123",
"key_terms": ["city", "night", "traffic"],
"start_time": 0.0,
"end_time": 154.0
}
},
{
"text": "The detective arrived at the crime scene. Forensics had already processed the area.",
"contentType": "Transcript",
"metadata": {
"source": "video_transcript",
"time_window": "2:34 - 3:34",
"utterance_count": 4
}
},
{
"text": "On-screen text: LOS ANGELES 2049",
"contentType": "ImageOcr",
"metadata": {
"source": "video_ocr",
"text_type": "Title"
}
}
]
}
| Stage | Time | Notes |
|---|---|---|
| FFprobe metadata | ~2s | |
| Shot detection | ~10s | FFmpeg scene filter |
| Keyframe extraction | ~30s | 500 I-frames |
| dHash deduplication | ~0.5s | 500 → 300 frames |
| Batch CLIP embedding | ~60s | 300 frames, batch 8 |
| ImageSummarizer OCR | ~120s | 50 keyframes with text |
| Audio extraction | ~30s | FFmpeg |
| Whisper transcription | ~180s | 2 hours of speech |
| Speaker diarization | ~60s | ECAPA-TDNN |
| NER extraction | ~10s | BERT-NER on transcript |
| Scene clustering | ~5s | |
| Evidence generation | ~2s | |
| Total | ~8-10 minutes |
| Optimization | Savings |
|---|---|
| dHash deduplication | ~40% frames filtered = ~24s CLIP saved |
| Batch CLIP | 3-5x faster = ~180s saved |
| Pipeline composition | Reuses ImageSummarizer/AudioSummarizer waves |
| Total savings | ~3-4 minutes |
| Component | Memory |
|---|---|
| CLIP ViT-B/32 ONNX | ~350MB |
| Whisper base | ~500MB |
| ECAPA-TDNN | ~100MB |
| BERT-NER | ~500MB |
| Peak | ~1.5GB |
VideoSummarizer registers as an IPipeline for automatic routing:
// In Program.cs
builder.Services.AddDocSummarizer(builder.Configuration.GetSection("DocSummarizer"));
builder.Services.AddDocSummarizerImages(builder.Configuration.GetSection("Images"));
builder.Services.AddVideoSummarizer(); // NEW
builder.Services.AddPipelineRegistry(); // Must be last
// Auto-routing by extension
var registry = services.GetRequiredService<IPipelineRegistry>();
var pipeline = registry.FindForFile("movie.mp4"); // Returns VideoPipeline
var result = await pipeline.ProcessAsync("movie.mp4");
Supported extensions:
.mp4, .mkv, .avi, .mov, .wmv, .webm, .flv, .m4v, .mpeg, .mpgVideoSummarizer demonstrates that pipeline composition scales:
The result: a 2-hour movie becomes a structured signal ledger with scenes, transcripts, entities, and embeddings-ready for RAG queries like:
The Reduced RAG pattern for video:
Ingestion: Video → 16 waves → Signals + Evidence (scenes, transcripts, entities)
Storage: Signals (indexed) + Embeddings (CLIP, voice) + Evidence (chunks)
Query: Filter (SQL) → Search (BM25 + vector) → Synthesize (LLM, ~5 results)
The Capability System:
Startup: Detect GPU → Load ModelManifest (YAML) → Initialize SignalSink
Activation: Component requests model → Lazy download → Signal "ModelAvailable"
Routing: Route to best provider → Fallback chain → Backpressure control
Atoms: Rate limiting + Time estimation + Pipeline balancing
This is Constrained Fuzziness at scale:
The LLM operates on pre-computed, auditable evidence, never raw video.
Core Patterns:
Reduced RAG Implementations:
| Part | Pattern | Focus |
|---|---|---|
| 1 | Constrained Fuzziness | Single component |
| 2 | Constrained Fuzzy MoM | Multiple components |
| 3 | Context Dragging | Time / memory |
| 4 | Image Intelligence | Wave architecture, 22 waves |
| 4.1 | Three-Tier OCR Pipeline | OCR, ONNX models, filmstrips |
| 4.2 | AudioSummarizer | Forensic audio, speaker diarization |
| 4.3 | VideoSummarizer (this article) | Video orchestration, batch CLIP, NER |
Next: Multi-modal graph RAG with lucidRAG. Composing all five summarizers into a unified knowledge graph with cross-modal entity linking.
All parts follow the same invariant: probabilistic components propose; deterministic systems persist.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.