In which we build a voice-to-form system that doesn't hand the keys to the AI kingdom
Here's the thing about voice interfaces: everyone wants them, nobody trusts them. And they're right not to.
Most voice forms fail the moment the model mishears a date or decides to "helpfully" skip ahead. You say "May 12th 1984" and the LLM extracts "May 2014" and advances to the next field before you can correct it. Or worse: it hallucinates confidence and auto-submits your form with wrong data.
The moment you let an LLM "control" your form flow, you've introduced a non-deterministic black box into what should be a predictable user experience.
This isn't hypothetical. McDonald's ran AI-powered drive-thru ordering across 100+ US locations. It routinely failed to understand basic modifications. They ended the partnership in 2024. The lesson: order-taking is a structured data problem-a finite menu, known modifiers, clear pricing. It's a form with speech recognition, not a reasoning task.
But what if we could have our cake and eat it too? Voice input for convenience, LLM for translation, but deterministic code for everything that actually matters?
That's what we're building today-a "Ten Commandments compliant" voice form system using Blazor Server, local Whisper for speech-to-text, and local Ollama for field extraction. No cloud dependencies. No unpredictable AI overlords. Just clean architecture.
This project follows The Ten Commandments of LLM Use. If you haven't read that yet, the short version: LLMs are probability engines, not decision-makers. They translate and classify; they don't control.
For voice forms, the relevant commandments are:
Beyond the commandments, this system follows specific principles:
Here's how this looks in practice:
flowchart LR
A[User Speaks] --> B[Whisper STT]
B --> C[Raw Transcript]
C --> D[Ollama LLM]
D --> E[Extracted Value]
E --> F{Validation}
F -->|Pass| G{Policy Check}
F -->|Fail| H[Show Error]
G -->|High Confidence| I[Auto-Confirm]
G -->|Low Confidence| J[Ask User]
J --> K{User Confirms?}
K -->|Yes| L[State Machine: Next Field]
K -->|No| M[Retry Recording]
I --> L
H --> M
style D stroke:#f96,stroke-width:2px
style F stroke:#6f6,stroke-width:2px
style G stroke:#ff9,stroke-width:2px
Let's be clear about scope:
These constraints are features, not bugs. They make the system predictable, testable, and trustworthy.
If you want a conversational agent, build a conversational agent. If you want correct data, don't pretend conversation helps.
We're building a standalone Blazor Server app. Here's the architecture:
Mostlylucid.VoiceForm/
├── Config/
│ └── VoiceFormConfig.cs # Configuration binding
├── Models/
│ ├── FormSchema/ # Form definitions
│ ├── State/ # Session state
│ └── Extraction/ # LLM input/output
├── Services/
│ ├── Stt/ # Speech-to-text (Whisper)
│ ├── Extraction/ # Field extraction (Ollama)
│ ├── Validation/ # Type-based validators
│ ├── StateMachine/ # Form flow control
│ └── Orchestration/ # Coordinates everything
├── Components/
│ └── Pages/ # Blazor UI
└── wwwroot/js/
└── audio-recorder.js # Web Audio API capture
Dependency rule: Services depend only on abstractions below them. Nothing upstream knows what happens downstream.
First, you'll need Whisper and Ollama running locally. Add these to your devdeps-docker-compose.yml:
services:
whisper:
image: onerahmet/openai-whisper-asr-webservice:latest
container_name: whisper-stt
ports:
- "9000:9000"
environment:
- ASR_MODEL=base.en
- ASR_ENGINE=faster_whisper
volumes:
- whisper-models:/root/.cache/huggingface
restart: unless-stopped
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama-models:/root/.ollama
restart: unless-stopped
volumes:
whisper-models:
ollama-models:
After spinning up, pull a model for Ollama:
docker exec -it ollama ollama pull llama3.2:3b
Why local? Testability + determinism + data locality. CI spins up the same containers. Tests hit the same endpoints. No rate limits, no network variability.
Note: The base.en Whisper model is fast and accurate for English. For production, consider small.en or medium.en for better accuracy.
Forms are defined in JSON. This is the single source of truth for what fields exist and how they behave:
{
"id": "customer-intake",
"name": "Customer Intake Form",
"fields": [
{
"id": "fullName",
"label": "Full Name",
"type": "Text",
"prompt": "Please say your full name.",
"required": true
},
{
"id": "dateOfBirth",
"label": "Date of Birth",
"type": "Date",
"prompt": "What is your date of birth?",
"required": true,
"confirmationPolicy": {
"alwaysConfirm": true
}
},
{
"id": "email",
"label": "Email Address",
"type": "Email",
"prompt": "What is your email address?",
"required": true
},
{
"id": "phone",
"label": "Phone Number",
"type": "Phone",
"prompt": "What is your phone number?",
"required": false
},
{
"id": "notes",
"label": "Additional Notes",
"type": "Text",
"prompt": "Any additional notes?",
"required": false
}
]
}
The C# models that represent this:
public record FormDefinition(
string Id,
string Name,
List<FieldDefinition> Fields);
public record FieldDefinition(
string Id,
string Label,
FieldType Type,
string Prompt,
bool Required = true,
ConfirmationPolicy? ConfirmationPolicy = null);
public enum FieldType
{
Text,
Date,
Email,
Phone,
Choice // Constrained vocabulary - especially important for voice
}
public record ConfirmationPolicy(
bool AlwaysConfirm = false,
double ConfidenceThreshold = 0.85);
Because the schema is declarative, adding fields never requires touching the state machine. Drop a new field definition in the JSON, and the form automatically includes it with proper flow control.
The anti-pattern we're avoiding is "LLM-driven navigation": advancing form state based on model output. This state machine explicitly forbids that.
The state machine doesn't use AI. It's pure C# logic that determines transitions based on explicit rules.
public class FormStateMachine : IFormStateMachine
{
private FormSession _session = null!;
private int _currentFieldIndex;
public FormSession StartSession(FormDefinition form)
{
_session = new FormSession
{
Id = Guid.NewGuid().ToString(),
Form = form,
Status = FormStatus.InProgress,
StartedAt = DateTime.UtcNow,
FieldStates = form.Fields.ToDictionary(
f => f.Id,
f => new FieldState { FieldId = f.Id, Status = FieldStatus.Pending })
};
// First field starts in progress
if (form.Fields.Count > 0)
{
_session.FieldStates[form.Fields[0].Id].Status = FieldStatus.InProgress;
}
return _session;
}
public FieldDefinition? GetCurrentField()
{
if (_currentFieldIndex >= _session.Form.Fields.Count)
return null;
return _session.Form.Fields[_currentFieldIndex];
}
}
The state transitions are explicit and testable:
public StateTransitionResult ProcessExtraction(
ExtractionResponse extraction,
ValidationResult validation)
{
var currentField = GetCurrentField();
if (currentField == null)
return new StateTransitionResult(false, "No current field");
var fieldState = _session.FieldStates[currentField.Id];
fieldState.AttemptCount++;
// Validation failed? Stay on current field
if (!validation.IsValid)
{
return new StateTransitionResult(
false,
$"Validation failed: {validation.ErrorMessage}");
}
// Store the pending value
fieldState.PendingValue = extraction.Value;
fieldState.PendingConfidence = extraction.Confidence;
// Check confirmation policy - this is rules, not AI
var policy = currentField.ConfirmationPolicy
?? new ConfirmationPolicy();
var needsConfirmation = policy.AlwaysConfirm
|| extraction.Confidence < policy.ConfidenceThreshold;
if (needsConfirmation)
{
fieldState.Status = FieldStatus.AwaitingConfirmation;
return new StateTransitionResult(
true,
"Please confirm this value",
RequiresConfirmation: true);
}
// Auto-confirm high confidence values
return ConfirmValue();
}
Key point: The _currentFieldIndex only advances on confirmation, never on extraction. This means a failed extraction or rejected confirmation keeps you on the same field. The user stays in control.
Notice the confirmation logic: it's a simple policy check, not LLM reasoning. High confidence + no alwaysConfirm = auto-confirm. Low confidence or sensitive field = ask the user.
The Ollama extractor has one job: turn messy human speech into structured field values.
If you can't write this interface without embarrassment, the LLM is doing too much:
public interface IFieldExtractor
{
Task<ExtractionResponse> ExtractAsync(
ExtractionContext context,
CancellationToken ct = default);
}
public record ExtractionContext(
FieldDefinition Field,
string Prompt,
string Transcript);
public record ExtractionResponse(
string FieldId,
string? Value,
double Confidence,
bool NeedsConfirmation, // Suggestion only - policy has final say
string? Reason);
Important: The extractor may suggest NeedsConfirmation: true, but the confirmation policy always has the final say. The LLM's opinion is advisory, not authoritative.
And the implementation:
public class OllamaFieldExtractor : IFieldExtractor
{
private readonly HttpClient _httpClient;
private readonly string _model;
public async Task<ExtractionResponse> ExtractAsync(
ExtractionContext context,
CancellationToken ct = default)
{
var systemPrompt = BuildSystemPrompt(context.Field);
var userPrompt = $"User said: \"{context.Transcript}\"";
var request = new
{
model = _model,
messages = new[]
{
new { role = "system", content = systemPrompt },
new { role = "user", content = userPrompt }
},
format = "json",
stream = false,
options = new { temperature = 0.1 } // Low = deterministic
};
var response = await _httpClient.PostAsJsonAsync(
"/api/chat", request, ct);
return ParseResponse(response, context.Field.Id);
}
}
Why temperature 0.1? We want the LLM to be boring and predictable. Given the same transcript, it should extract the same value every time. Temperature 0.1 minimizes creative variation-exactly what we want for data extraction.
The system prompt is explicit about the LLM's limited role:
private string BuildSystemPrompt(FieldDefinition field)
{
return $"""
You are a data extraction assistant. Extract the {field.Label}
from the user's speech.
Field type: {field.Type}
Return JSON only:
{{
"fieldId": "{field.Id}",
"value": "<extracted value or null>",
"confidence": <0.0-1.0>,
"needsConfirmation": <true/false>,
"reason": "<brief explanation>"
}}
Rules:
- For dates, output ISO format (YYYY-MM-DD)
- For emails, output lowercase
- For phones, output digits only
- If you can't extract, set value to null
- Be conservative with confidence scores
DO NOT:
- Ask follow-up questions
- Suggest next steps
- Make assumptions beyond the transcript
""";
}
The LLM extracts, validates type format, and reports confidence. It doesn't decide what happens next.
The JavaScript side captures microphone audio and converts it to 16kHz mono WAV:
window.voiceFormAudio = (function () {
let mediaRecorder = null;
let audioChunks = [];
let audioContext = null;
let dotNetRef = null;
async function startRecording() {
const stream = await navigator.mediaDevices
.getUserMedia({ audio: true });
audioContext = new AudioContext();
audioChunks = [];
const options = { mimeType: 'audio/webm;codecs=opus' };
mediaRecorder = new MediaRecorder(stream, options);
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) {
audioChunks.push(event.data);
}
};
mediaRecorder.onstop = async () => {
stream.getTracks().forEach(track => track.stop());
const audioBlob = new Blob(audioChunks, { type: 'audio/webm' });
const wavBlob = await convertToWav(audioBlob);
const wavBytes = await wavBlob.arrayBuffer();
// Send to Blazor
const uint8Array = new Uint8Array(wavBytes);
await dotNetRef.invokeMethodAsync(
'OnRecordingComplete',
Array.from(uint8Array));
};
mediaRecorder.start(100);
}
return { initialize, startRecording, stopRecording };
})();
Why WAV? Whisper is trained on 16kHz mono WAV. Same audio in → same bytes out → same transcript.
The convertToWav function handles resampling-I'll spare you the WAV header bit-twiddling, but it's in the repo.
The UI shows the current field prominently on the left, with all fields visible in a sidebar on the right. The sidebar is not cosmetic-it's a trust boundary. Users can see progress, state, and intent at all times.
graph LR
subgraph "Left Panel"
A[Current Field Prompt]
B[Record Button]
C[Transcript Display]
D[Confirmation Dialog]
end
subgraph "Right Sidebar"
E[Field 1: Full Name ✓]
F[Field 2: DOB - Active]
G[Field 3: Email - Pending]
H[Field 4: Phone - Pending]
end
style F stroke:#36f,stroke-width:2px
style E stroke:#6f6,stroke-width:2px
Here's the Blazor page structure:
@page "/voiceform/{FormId}"
@inject IFormOrchestrator Orchestrator
@rendermode InteractiveServer
<div class="top-bar">
<h1>Voice Form</h1>
<button class="theme-toggle" onclick="voiceFormTheme.toggle()">
Dark Mode
</button>
</div>
<main>
<div class="voice-form-layout">
<!-- Left: Active Field -->
<div class="active-field-panel">
@if (_currentField != null)
{
<div class="current-prompt">
<h2>@_currentField.Label</h2>
<p class="prompt-text">@_message</p>
</div>
<AudioRecorder OnAudioCaptured="HandleAudioCaptured"
IsRecording="_isRecording"
IsProcessing="_isProcessing" />
@if (!string.IsNullOrEmpty(_transcript))
{
<TranscriptDisplay Transcript="_transcript"
Confidence="_transcriptConfidence" />
}
@if (_showConfirmation)
{
<ConfirmationDialog ExtractedValue="_pendingValue"
OnConfirm="HandleConfirm"
OnReject="HandleReject" />
}
}
</div>
<!-- Right: Form Overview -->
<div class="form-sidebar">
<h3>@_session.Form.Name</h3>
<div class="form-fields-list">
@foreach (var field in _session.Form.Fields)
{
var state = _session.GetFieldState(field.Id);
<div class="form-field-item @GetFieldClass(field, state)">
<div class="field-status-icon">
@GetStatusIcon(state.Status)
</div>
<div class="field-info">
<div class="field-name">@field.Label</div>
<div class="field-value">
@(state.Value ?? "Waiting")
</div>
</div>
</div>
}
</div>
</div>
</div>
</main>
The orchestrator coordinates the services. It has no branching logic of its own-it just calls services in sequence and passes results along:
public class FormOrchestrator : IFormOrchestrator
{
private readonly ISttService _sttService;
private readonly IFieldExtractor _extractor;
private readonly IFormValidator _validator;
private readonly IFormStateMachine _stateMachine;
private readonly IFormEventLog _eventLog;
public async Task<ProcessingResult> ProcessAudioAsync(byte[] audioData)
{
var currentField = _stateMachine.GetCurrentField();
if (currentField == null)
return ProcessingResult.Error("No current field");
// Step 1: Speech to text
var sttResult = await _sttService.TranscribeAsync(audioData);
await _eventLog.LogAsync(new TranscriptReceivedEvent(
currentField.Id, sttResult.Transcript, sttResult.Confidence));
// Step 2: Extract structured value
var context = new ExtractionContext(
currentField, currentField.Prompt, sttResult.Transcript);
var extraction = await _extractor.ExtractAsync(context);
await _eventLog.LogAsync(new ExtractionAttemptEvent(
currentField.Id, extraction));
// Step 3: Validate
var validation = _validator.Validate(currentField, extraction.Value);
// Step 4: State machine decides next step
var transition = _stateMachine.ProcessExtraction(extraction, validation);
return new ProcessingResult(
Success: transition.Success,
Message: transition.Message,
Session: _stateMachine.GetSession(),
RequiresConfirmation: transition.RequiresConfirmation,
PendingValue: extraction.Value);
}
}
Each step is independent and testable. The state machine never touches the network. The LLM never touches the state.
Dark mode is implemented with CSS custom properties:
:root {
--bg-primary: #f8fafc;
--bg-secondary: #ffffff;
--text-primary: #1e293b;
--accent-blue: #3b82f6;
--accent-green: #22c55e;
}
[data-theme="dark"] {
--bg-primary: #0f172a;
--bg-secondary: #1e293b;
--text-primary: #f1f5f9;
--accent-blue: #60a5fa;
--accent-green: #4ade80;
}
body {
background: var(--bg-primary);
color: var(--text-primary);
}
The theme toggle persists via localStorage:
window.voiceFormTheme = (function () {
const THEME_KEY = 'voiceform-theme';
function toggle() {
const current = document.documentElement
.getAttribute('data-theme') || 'light';
const newTheme = current === 'dark' ? 'light' : 'dark';
document.documentElement.setAttribute('data-theme', newTheme);
localStorage.setItem(THEME_KEY, newTheme);
}
// Initialize from saved preference or system preference
function init() {
const saved = localStorage.getItem(THEME_KEY);
const systemDark = window.matchMedia(
'(prefers-color-scheme: dark)').matches;
const theme = saved || (systemDark ? 'dark' : 'light');
document.documentElement.setAttribute('data-theme', theme);
}
return { toggle, init };
})();
The deterministic architecture pays off in testing. Here's a state machine unit test for the happy path:
[Fact]
public void ProcessExtraction_HighConfidence_AutoConfirms()
{
// Arrange
var form = CreateTestForm();
var stateMachine = new FormStateMachine();
stateMachine.StartSession(form);
var extraction = new ExtractionResponse(
FieldId: "fullName",
Value: "John Smith",
Confidence: 0.95, // High confidence
NeedsConfirmation: false,
Reason: "Clear speech");
var validation = ValidationResult.Success();
// Act
var result = stateMachine.ProcessExtraction(extraction, validation);
// Assert
result.Success.Should().BeTrue();
result.RequiresConfirmation.Should().BeFalse();
var fieldState = stateMachine.GetSession()
.GetFieldState("fullName");
fieldState.Status.Should().Be(FieldStatus.Confirmed);
fieldState.Value.Should().Be("John Smith");
}
And here's a negative test-low confidence forces confirmation regardless of what the LLM suggests:
[Fact]
public void ProcessExtraction_LowConfidence_RequiresConfirmation()
{
// Arrange
var form = CreateTestForm();
var stateMachine = new FormStateMachine();
stateMachine.StartSession(form);
var extraction = new ExtractionResponse(
FieldId: "fullName",
Value: "John Smith",
Confidence: 0.65, // Below threshold
NeedsConfirmation: false, // LLM says no, but policy overrides
Reason: "Noisy audio");
var validation = ValidationResult.Success();
// Act
var result = stateMachine.ProcessExtraction(extraction, validation);
// Assert
result.RequiresConfirmation.Should().BeTrue(
"Policy threshold (0.85) should override LLM suggestion");
var fieldState = stateMachine.GetSession()
.GetFieldState("fullName");
fieldState.Status.Should().Be(FieldStatus.AwaitingConfirmation);
}
And a PuppeteerSharp browser test that actually clicks things:
[Fact]
public async Task HomePage_ThemeToggle_ShouldSwitchTheme()
{
await _page!.GoToAsync(BaseUrl);
await _page.WaitForSelectorAsync(".theme-toggle");
var initialTheme = await _page.EvaluateFunctionAsync<string?>(
"() => document.documentElement.getAttribute('data-theme')");
// Click toggle
await _page.ClickAsync(".theme-toggle");
await Task.Delay(100);
var newTheme = await _page.EvaluateFunctionAsync<string?>(
"() => document.documentElement.getAttribute('data-theme')");
newTheme.Should().NotBe(initialTheme);
}
[Fact]
public async Task VoiceFormPage_Sidebar_ShouldShowAllFields()
{
await _page!.GoToAsync($"{BaseUrl}/voiceform/customer-intake");
await _page.WaitForSelectorAsync(".form-fields-list");
var fieldItems = await _page.QuerySelectorAllAsync(".form-field-item");
fieldItems.Should().HaveCount(5, "Customer intake form has 5 fields");
}
The full test suite runs 98 tests-76 unit tests for the business logic, 22 browser integration tests for the UI.
The important thing is not the tests themselves-it's that the architecture makes them possible.
Let's revisit why we built it this way:
graph TD
subgraph "Traditional Voice Form"
A1[User Speaks] --> B1[LLM]
B1 --> C1[LLM decides field]
C1 --> D1[LLM validates]
D1 --> E1[LLM confirms]
E1 --> F1[LLM advances form]
end
subgraph "Ten Commandments Approach"
A2[User Speaks] --> B2[Whisper STT]
B2 --> C2[Ollama Extract]
C2 --> D2[C# Validator]
D2 --> E2[Policy Check]
E2 --> F2[State Machine]
end
style B1 stroke:#f96,stroke-width:2px
style C1 stroke:#f96,stroke-width:2px
style D1 stroke:#f96,stroke-width:2px
style E1 stroke:#f96,stroke-width:2px
style F1 stroke:#f96,stroke-width:2px
style C2 stroke:#f96,stroke-width:2px
style D2 stroke:#6f6,stroke-width:2px
style E2 stroke:#6f6,stroke-width:2px
style F2 stroke:#6f6,stroke-width:2px
In the traditional approach, everything is the LLM. Every decision is non-deterministic. Every test is probabilistic.
Every bug is "it just sometimes does that"-unreproducible and unexplainable.
In our approach, the LLM does one thing: translate speech to structured values. Everything else is deterministic C# you can debug, test, and trace.
This is Commandments II, III, and IV in action: LLM translates (no side-effects), state machine controls flow (causality in code), policy decides confirmation (probability where acceptable).
docker compose -f devdeps-docker-compose.yml up -d
docker exec -it ollama ollama pull llama3.2:3b
cd Mostlylucid.VoiceForm
dotnet run
Navigate to http://localhost:5000
Click "Customer Intake" and start speaking!
Voice interfaces don't have to be black boxes. By treating speech as just another input method, LLMs as translators, and keeping flow control in deterministic code, you get:
The "Ten Commandments" aren't about being anti-AI. They're about putting AI in its proper place: a powerful tool that translates between human messiness and computer precision, not an oracle that makes decisions for us.
AI earns its place when it reduces ambiguity-not when it replaces responsibility.
Now go build something where you-not the model-decide what happens next.
Right now, the user reads prompts and speaks responses. But what if the system could speak back? Part 2 will add text-to-speech feedback, turning this into a full IVR (Interactive Voice Response) system:
The architecture is already prepared: the state machine emits events, the orchestrator coordinates services. Adding voice output is just another service that listens to those events.
The current validation is type-based: emails look like emails, dates parse as dates. But what about business rules? "Date of birth must be at least 18 years ago." "Phone number must match the region selected." "Notes field cannot contain profanity."
Part 3 will add:
Same principle: the LLM advises, the code decides.
The full source code is in the Mostlylucid.VoiceForm project in the repo.
© 2026 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.