BWB Kiosk — Voice Ordering Architecture
> *"Break every component down to its grills... define a subsection and a sub-subsection that further builds upon the previous section, then expands it in a recursive manner."*
Full Public Reader
# BWB Kiosk — Voice Ordering Architecture
### Deep Recursive Decomposition & Evolutionary Design
v2.0 — February 10, 2026 — Verified against codebase
---
> "Break every component down to its grills... define a subsection and a sub-subsection that further builds upon the previous section, then expands it in a recursive manner."
---
Table of Contents
1. [Vision & Philosophy](#1-vision--philosophy)
2. [System Topology](#2-system-topology)
3. [Layer 1: Audio Foundation](#3-layer-1-audio-foundation)
4. [Layer 2: Speech-to-Text Pipeline](#4-layer-2-speech-to-text-pipeline)
5. [Layer 3: Natural Language Understanding](#5-layer-3-natural-language-understanding)
6. [Layer 4: Dialogue Engine](#6-layer-4-dialogue-engine)
7. [Layer 5: Order State Machine](#7-layer-5-order-state-machine)
8. [Layer 6: Synthesis & Feedback](#8-layer-6-synthesis--feedback)
9. [Layer 7: Interaction Surface](#9-layer-7-interaction-surface)
10. [Layer 8: Learning & Telemetry](#10-layer-8-learning--telemetry)
11. [Cross-Cutting: Error Taxonomy & Recovery](#11-cross-cutting-error-taxonomy--recovery)
12. [Cross-Cutting: Performance & Latency Budget](#12-cross-cutting-performance--latency-budget)
13. [Cross-Cutting: Privacy, Security, Offline](#13-cross-cutting-privacy-security-offline)
14. [State Machine Formal Specification](#14-state-machine-formal-specification)
15. [End-to-End Flow Traces](#15-end-to-end-flow-traces)
16. [Gap Analysis: Current vs Next-Gen](#16-gap-analysis-current-vs-next-gen)
17. [Evolution Roadmap](#17-evolution-roadmap)
18. [Codebase Map (Verified)](#18-codebase-map-verified)
---
1. Vision & Philosophy
### 1.1 Core Vision
A voice ordering system that feels like the best barista conversation — one that understands messy human speech, self-corrections, contextual references, and emotional undertones — delivered at kiosk speed with zero training required.
1.2 Design Principles
#### 1.2.1 Conversational Intelligence
##### [ip] Natural Speech Tolerance
- Disfluency handling: "um", "uh", "like", "so" stripped by `TranscriptNormalizer` (228 LOC)
- Self-correction: "large, no wait, medium" — correction markers detected: "actually", "scratch that", "wait no", "I mean", "not that"
- Partial utterances: Streaming partial transcripts handled by `TranscriptPipeline` (188 LOC) — deduplicate "med" → "medi" → "medium"
- Run-on orders: "a latte and two cappuccinos and oh also a muffin" — multi-item extraction via `EntityExtractor`
##### [ip] Contextual Inference
- Pronoun resolution (future): "Make it bigger" = upsize current item
- Ellipsis completion (future): "Same thing" = repeat last order
- Implicit slots: "And a cappuccino" after "large oat milk latte" → infer large? oat milk? (currently: no inference, defaults used)
- Session memory: Last 10 utterances tracked in `sessionHistory[]` for context-aware parsing
##### [ip] Proactive Disambiguation
- Confidence-gated clarification: `ClarificationPolicy` (336 LOC) with three tiers:
- `strict` config: threshold 0.7, gap 0.15, max 3 clarifications/turn
- `default` config: threshold 0.6, gap 0.1, max 2/turn
- `lenient` config: threshold 0.45, gap 0.05, max 1/turn
- High-importance slots: `milk` and `caffeine` get elevated thresholds (0.75 vs 0.6) due to allergen/health implications
- Menu item confirmation: If confidence < threshold, ask "Did you mean a [item]?"
##### [ip] Multi-Turn Memory
- `sessionHistory: [String]` — last 10 utterances preserved
- `sessionPreferences: [String: String]` — learned preferences (e.g., "milk" → "oat")
- `lastClarification: (question: String, answer: String)?` — recent Q&A context injected into next AI parse
- `consecutiveFailures: Int` — drives recovery escalation level
#### 1.2.2 Zero-Friction Interaction
##### [ip] First-Time Success
- No onboarding, no tutorial
- Welcome prompt: "What would you like to order? Say 'help' for available commands."
- Example-based help: "To order, just say what you'd like. For example: 'I'll have a medium latte'"
##### [ip] Graceful Modality Switching
- Voice → Touch: "Let me show you the menu. You can also tap items to order." (after 3 failures)
- `KioskTouchOrderingView` always accessible via button
- `showTouchOrdering` state triggers `.fullScreenCover`
##### [ip] Speed Budget
- Target: End-of-speech → confirmation response < 2.0 seconds
- Breakdown: Utterance detection (≤1.5s) + NLU (≤100ms) + AI (≤1.5s parallel) + TTS start (≤500ms)
- Current: Hybrid parse = AI + NLU in parallel, fastest wins or merge
#### 1.2.3 Acoustic Resilience
##### [ip] Noise Handling
- Audio mode: `.voiceChat` for noise suppression + `.measurement` for speech recognition
- Buffer: 16kHz sample rate, 5ms IO buffer, mono channel
- VAD: Energy-based with adaptive threshold (currently basic)
##### [ip] Echo Prevention (CRITICAL)
- Current: `speakWithIsolation()` — pause mic → play TTS → resume mic
- Actual delay chain: 100ms cleanup + TTS duration + 300ms echo dissipation + 100ms stabilization
- Total overhead per TTS: ~500ms + TTS duration
- Gap: No AEC — simultaneous talk/listen impossible
---
2. System Topology
2.1 Architecture Layers (with verified file counts)
┌─────────────────────────────────────────────────────────────────────┐
│ LAYER 7: INTERACTION SURFACE │
│ BWB_Kiosk/Views/ (7 files, ~1200 LOC kiosk-specific) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 6: SYNTHESIS & FEEDBACK │
│ AudioSessionManager (373 LOC) + FeedbackCoordinator (358 LOC) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 5: ORDER STATE MACHINE │
│ CartCoordinator (452) + ConfirmationCoord (626) + SessionMgr (270) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 4: DIALOGUE ENGINE │
│ VoiceDialogueManager (572) + ClarificationPolicy (336) │
│ + ConfirmationGenerator (334) + ContextAwareRecovery (476) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 3: NATURAL LANGUAGE UNDERSTANDING │
│ OrderParsingPipeline (531) + AITranscriptParser (1343) │
│ + VoiceNLUEngine (1321) + EnhancedNLU (745) + ConstraintEngine(776)│
│ + IntentClassifier (186) + EntityExtractor (314) │
│ + SlotClassifier (362) + Embeddings/ (4 files, ~1400 LOC) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 2: SPEECH-TO-TEXT PIPELINE │
│ TranscriptPipeline (188) + TranscriptNormalizer (228) │
│ + TranscriptState (136) + UtteranceCompletionDetector (344) │
│ + TranscriptStabilityTracker (125) + TranscriptPreprocessor (226) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 1: AUDIO FOUNDATION │
│ SpeechAnalyzerService (575) + LegacyVoiceService (373) │
│ + VoiceServiceProtocol (300) + WakeWordDetector (540) │
│ + AudioSessionManager (373) │
├─────────────────────────────────────────────────────────────────────┤
│ LAYER 8: LEARNING & TELEMETRY (cross-cutting) │
│ PatternLearner (318) + FeedbackCollector (512) │
│ + LearningTypes (179) │
└─────────────────────────────────────────────────────────────────────┘
TOTAL: ~19,700 LOC across 50 Swift files in BWBCore/Voice/
+ ~1,400 LOC kiosk-specific in BWB_Kiosk/### 2.2 The Orchestrator (Verified)
File: `BWB_Kiosk/Services/VoiceOrderingOrchestrator.swift` (~750 LOC)
The orchestrator is a `@MainActor` singleton that:
1. Owns 8 component instances (injected in `init`):
- `TranscriptPipeline`
- `LiveOrderPreviewGenerator`
- `UtteranceCompletionDetector`
- `OrderParsingPipeline(strategy: .hybrid)`
- `CartCoordinator`
- `ConfirmationCoordinator`
- `FeedbackCoordinator`
- `SessionManager`
2. Publishes 25+ `@Published` properties for UI binding
3. Coordinates the voice service (`VoiceServiceProtocol`) via async streams
4. Manages wake word detection via `WakeWordDetector`
2.3 Data Flow (Verified End-to-End)
🎤 Hardware Microphone
│
▼
AVAudioSession (.playAndRecord, .voiceChat, 16kHz)
│
├──── VoiceServiceProtocol.startListening() ──────┐
│ ├── SpeechAnalyzerService (iOS 26+) │
│ └── LegacyVoiceService (SFSpeechRecognizer) │
│ │
│ Two AsyncStreams produced: │
│ ├── transcriptionResults → VoiceTranscriptionResult
│ │ { transcript: String, isFinal: Bool, confidence: Float? }
│ │
│ └── voiceActivityResults → VoiceActivityResult
│ { isSpeechDetected: Bool, audioLevel: Float? }
│
▼
handleTranscriptionResult(_:)
│
├── 1. transcriptPipeline.processIncoming(text, isFinal)
│ → Updates displayTranscript (@Published)
│
├── 2. livePreviewGenerator.generatePreview(from: text)
│ → Updates livePreviews: [BWBCore.OrderPreview] (@Published)
│
├── 3. utteranceDetector.updateTranscript(text)
│ → Resets stability count if changed
│
└── 4. utteranceDetector.analyze(transcript:, isSpeechDetected:, isListening:, ...)
→ Returns UtteranceAnalysis { isComplete, confidence, reason, countdown }
│
├── NOT complete → update processingCountdown (@Published)
│
└── COMPLETE → processUtterance(transcript)
│
├── Deduplicate (skip if == lastProcessedTranscript)
├── addToSessionHistory(transcript)
│
├── [IF confirming] → handleConfirmationResponse(transcript)
│ └── confirmationCoordinator.processResponse(text)
│ → .confirmed / .rejected / .modified / .additionalOrder / .unclear
│
└── [ELSE] → parsingPipeline.parseWithAutoContext(
transcript,
cartItems: cartItems,
sessionHistory: sessionHistory,
userPreferences: sessionPreferences,
lastClarification: lastClarification,
menuItems: nil
)
│
▼
OrderParseResult { transcript, intent, items, confidence, source }
│
├── isMetaCommand? → handleMetaCommand(result)
│ └── .clearOrder / .readCart / .help / .confirm / .decline / .checkout
│ → cartCoordinator or speakWithIsolation
│
├── has items? → cartCoordinator.addToPending(validatedItems)
│ → confirmationCoordinator.startConfirmation(items, transcript, confidence)
│ → speakWithIsolation(confirmationMessage, thenListen: true)
│
└── no items? → ContextAwareRecoveryService.generateRecovery(
type: .unknownItem,
context: RecoveryContext { cart, history, failed, prefs, failureCount }
)
→ speakWithIsolation(recoveryMessage, thenListen: shouldContinue)---
3. Layer 1: Audio Foundation
### 3.1 Audio Session Management
File: `BWBCore/Voice/AudioSessionManager.swift` (373 LOC)
Class: `AudioSessionManager` — `@MainActor`, singleton, `ObservableObject`
3.1.1 Audio Session States (Verified Enum)
public enum AudioSessionState: Sendable {
case idle // No audio activity
case listening // Microphone active for speech recognition
case speaking // TTS playback active
case transitioning // Switching between modes
}3.1.2 AVAudioSession Configurations (Verified)
STATE: idle
Category: .playAndRecord
Mode: .voiceChat
Options: [.defaultToSpeaker, .allowBluetooth]
STATE: listening
Category: .playAndRecord
Mode: .measurement ← Optimized for recognition accuracy
Options: [.defaultToSpeaker, .allowBluetooth, .duckOthers]
Active: true (.notifyOthersOnDeactivation)
STATE: speaking
Category: .playback ← Output only, no mic
Mode: .spokenAudio ← Optimized for voice TTS
Options: [.duckOthers]
Active: true (.notifyOthersOnDeactivation)
STATE: transitioning
No changes (keep current config)3.1.3 Voice Service Callback Registration (Verified)
public func registerVoiceServiceCallbacks(
pause: @escaping () async -> Void, // voiceService.suspend()
resume: @escaping () async throws -> Void // voiceService.resume()
)Registered by `VoiceOrderingOrchestrator.registerAudioSessionCallbacks()` at init.
3.1.4 Pause/Resume Lifecycle (Verified Timing)
pauseListening():
guard !isListeningPaused && !isTransitioning
→ state = .transitioning
→ await pauseCallback() // voiceService.suspend()
→ sleep(100ms) // Audio engine cleanup
→ configureAudioSession(for: .speaking) // Switch to playback
→ state = .speaking
resumeListening():
guard isListeningPaused && !isTransitioning
→ state = .transitioning
→ sleep(postTTSListeningDelay: 300ms) // Echo dissipation
→ configureAudioSession(for: .listening) // Switch to recording
→ sleep(100ms) // Audio session stabilize
→ try await resumeCallback() // voiceService.resume()
→ state = .listeningTotal isolation overhead: 100ms + TTS duration + 300ms + 100ms = 500ms + TTS duration
3.1.5 TTS Configuration (Verified)
voiceIdentifier: String? // nil = system default
speechRate: Float = 0.5 // 0.0-1.0
pitchMultiplier: Float = 1.0
volume: Float = 1.0
preUtteranceDelay: TimeInterval = 0.15
postUtteranceDelay: TimeInterval = 0.2
postTTSListeningDelay: TimeInterval = 0.3 // After TTS, before mic resumeVoice selection priority (verified):
1. Custom `voiceIdentifier` if set
2. Enhanced quality US English voice: `quality == .enhanced && language == "en-US"`
3. Standard US English: `AVSpeechSynthesisVoice(language: "en-US")`
### 3.2 Voice Service Protocol
File: `BWBCore/Voice/VoiceServiceProtocol.swift` (300 LOC)
3.2.1 Protocol (Verified)
public protocol VoiceServiceProtocol: Sendable {
func startListening() async throws
func stopListening() async
func suspend() async
func resume() async throws
var transcriptionResults: AsyncStream<VoiceTranscriptionResult> { get }
var voiceActivityResults: AsyncStream<VoiceActivityResult> { get }
}3.2.2 Result Types (Verified)
public struct VoiceTranscriptionResult: Sendable {
public let transcript: String
public let isFinal: Bool
public let confidence: Float?
public let timestamp: Date
// + segments, language, alternatives
}
public struct VoiceActivityResult: Sendable {
public let isSpeechDetected: Bool
public let audioLevel: Float?
public let timestamp: Date
}3.2.3 Implementation Selection (Verified)
// In VoiceOrderingOrchestrator.startListening():
if #available(iOS 26.0, *) {
voiceService = SpeechAnalyzerService() // 575 LOC
isUsingEnhancedService = true
} else {
voiceService = LegacyVoiceService() // 373 LOC
isUsingEnhancedService = false
}##### [ip] SpeechAnalyzerService (iOS 26+, 575 LOC)
- Uses new `SpeechAnalyzer` API
- On-device ML inference
- Lower latency, better accuracy
- Supports asset download for offline use
- `downloadAssetsIfNeeded()` exposed to UI
##### [ip] LegacyVoiceService (iOS 17+, 373 LOC)
- Uses `SFSpeechRecognizer` + `SFSpeechAudioBufferRecognitionRequest`
- Network-dependent for best accuracy
- Has on-device fallback (lower accuracy)
- Well-tested, stable
### 3.3 Wake Word Detection
File: `BWBCore/Voice/WakeWordDetector.swift` (540 LOC)
3.3.1 State Machine (Verified)
DISABLED ──enable()──▶ ENABLED ──startListening()──▶ LISTENING
│
onWakeWordDetected
│
▼
PAUSED (during session)
│
resume()
│
▼
LISTENING3.3.2 Detection Callback (Verified)
wakeWordDetector.onWakeWordDetected = { [weak self] word in
Task { @MainActor in
Logger.voice.info("Wake word '\(word)' detected")
VoiceAudioFeedback.wakeWordDetected.play() // System sound 1057 (Tink)
VoiceHapticFeedback.wakeWord.play() // Haptics.success()
await self?.startSession()
}
}#### 3.3.3 Session Integration
- `startSession()` calls `wakeWordDetector.pause()` → `isWakeWordListening = false`
- `endSession()` calls `wakeWordDetector.resume()` → `isWakeWordListening = true`
- Always-on when enabled, paused only during active ordering
### 3.4 Silence Detection & Prompting
Implemented in: `VoiceOrderingOrchestrator` (not a separate component)
3.4.1 Timer Logic (Verified)
silencePromptDuration: TimeInterval = 5.0
hasPromptedForSilence: Bool = false
// Trigger conditions (all must be true):
// phase == .listening
// displayTranscript.isEmpty
// !isSpeechDetected
// !hasPromptedForSilence3.4.2 Prompt Messages (Verified)
// Cart empty:
"What would you like to order? Say 'help' for available commands."
// Cart has items:
"You have {N} {item/items} in your cart. Would you like anything else,
or say 'checkout' when ready."#### 3.4.3 Cancellation
- Any speech detected → `cancelSilencePromptTimer()` + `hasPromptedForSilence = false`
- Session end → cancel + reset
---
4. Layer 2: Speech-to-Text Pipeline
### 4.1 Transcript Pipeline
File: `BWBCore/Voice/Pipeline/TranscriptPipeline.swift` (188 LOC)
#### 4.1.1 Pipeline Responsibilities
1. Receive raw transcript from voice service
2. Normalize (delegated to `TranscriptNormalizer`)
3. Deduplicate repeated partials
4. Track display transcript for UI
5. Record speech activity timestamps
4.1.2 Key Method (Verified)
func processIncoming(_ transcript: String, isFinal: Bool)
→ Updates @Published displayTranscript
→ Calls recordSpeechActivity() on speech input### 4.2 Transcript Normalization
File: `BWBCore/Voice/Pipeline/TranscriptNormalizer.swift` (228 LOC)
4.2.1 Normalization Steps
1. Lowercase
2. Whitespace normalization (collapse multiple spaces)
3. Filler word removal: um, uh, like, so, you know, basically
4. Correction marker detection (returns flag, doesn't strip)
5. Punctuation cleanup4.2.2 Correction Markers (Used by Orchestrator)
Correction: "actually", "scratch that", "wait no", "I mean", "not that"
Uncertainty: "maybe", "I think", "not sure"
Priority: "most important", "must have", "critical"
Anti-requirement: "don't want", "avoid", "skip"### 4.3 Transcript State
File: `BWBCore/Voice/Pipeline/TranscriptState.swift` (136 LOC)
4.3.1 State Types
enum TranscriptState {
case empty // No transcript received
case partial // Receiving streaming partials
case stable // Transcript hasn't changed recently
case final_ // ASR marked as final
}### 4.4 Transcript Preprocessing
File: `BWBCore/Voice/Parsing/TranscriptPreprocessor.swift` (226 LOC)
#### 4.4.1 Purpose
Pre-processes transcript BEFORE sending to AI parser:
- Strip filler words
- Expand abbreviations
- Normalize quantities ("a couple" → "2")
- Mark correction segments
- Extract and tag modifiers
### 4.5 Utterance Completion Detection
File: `BWBCore/Voice/Detection/UtteranceCompletionDetector.swift` (344 LOC)
4.5.1 Analysis Result (Verified)
public struct UtteranceAnalysis: Equatable {
public let isComplete: Bool
public let confidence: Double
public let reason: CompletionReason
public let processingCountdown: TimeInterval?
public let isWaitingToProcess: Bool
}4.5.2 Completion Reasons (Verified)
public enum CompletionReason: String, Sendable {
case notReady // Not yet complete
case silenceTimeout // Silence threshold exceeded
case stableTranscript // Transcript stable for required duration
case orderEndingPhrase // "please", "thanks", etc.
case endOfItemKeyword // Item-ending keyword detected
case hardTimeout // Maximum silence timeout reached
case explicitStop // User explicitly stopped
case empty // No transcript content
case alreadyProcessed // Same as last processed transcript
}4.5.3 Multi-Signal Analysis (Verified Logic)
analyze(transcript:, isSpeechDetected:, isListening:, isSpeaking:, isProcessing:)
Guard checks (return NOT_COMPLETE if):
- transcript is empty → .empty
- transcript == lastProcessedTranscript → .alreadyProcessed
- isSpeaking (TTS playing) → .notReady
- isProcessing → .notReady
- !isListening → .notReady
Signal 1: Transcript stability
→ stabilityTracker.isStable(requiredCount: 1, interval: 0.3s)
Signal 2: Silence detection
→ !isSpeechDetected && timeSinceLastSpeech > silenceTimeout (1.0s)
Signal 3: Hard timeout
→ timeSinceLastChange > hardTimeout (1.5s)
Signal 4: Order-ending phrases
→ "please", "thanks", "that's all", "that's it"
Processing countdown:
→ If has content and silence started: show countdown (hardTimeout - elapsed)
→ Updates processingCountdown for UI display4.5.4 Configuration (Verified Defaults)
silenceTimeout: 1.0s // Silence → process
requiredStabilityCount: 1 // Checks needed
transcriptStabilityInterval: 0.3s // Between checks
hardTimeout: 1.5s // Force-process regardless### 4.6 Transcript Stability Tracker
File: `BWBCore/Voice/Detection/TranscriptStabilityTracker.swift` (125 LOC)
4.6.1 Algorithm (Verified)
updateTranscript(text):
if text != currentTranscript:
currentTranscript = text
stabilityCount = 0
lastChangeTime = now
return true // changed
checkStability():
if now - lastCheckTime > interval:
if currentTranscript == lastCheckedTranscript:
stabilityCount += 1
lastCheckedTranscript = currentTranscript
lastCheckTime = now
isStable:
stabilityCount >= requiredCount### 4.7 Live Order Preview Generation
File: `BWBCore/Voice/Detection/LiveOrderPreviewGenerator.swift` (310 LOC)
4.7.1 Streaming Preview Pipeline
Partial transcript → lightweight NLU (no AI, local only)
→ Extract: item name guess, size, temperature, modifiers
→ Compute: confidence score based on match quality
→ Return: BWBCore.OrderPreview or nil
Debounce: 150ms (VoiceOrderingConfig.shared.liveItemsDebounceInterval)4.7.2 OrderPreview Mapping (Verified)
// BWBCore.OrderPreview (internal) → displayed as VoiceParsedOrder via computed property:
public var liveItems: [VoiceParsedOrder] {
livePreviews.map { preview in
var order = VoiceParsedOrder(itemName: preview.itemName)
order.quantity = preview.quantity
order.confidence = preview.confidence
if let size = preview.size {
order.size = DrinkSize(rawValue: size.lowercased())
}
if let temp = preview.temperature {
order.temperature = DrinkTemperature(rawValue: temp.lowercased())
}
order.syrups = preview.modifiers
return order
}
}---
5. Layer 3: Natural Language Understanding
### 5.1 Order Parsing Pipeline
File: `BWBCore/Voice/Pipeline/OrderParsingPipeline.swift` (531 LOC)
Class: `OrderParsingPipeline` — `@MainActor`, `ObservableObject`
5.1.1 Parsing Strategies (Verified Enum + Implementations)
public enum ParsingStrategy: String, Sendable {
case aiFirst // AI → NLU fallback (if AI confidence < 0.85)
case nluFirst // NLU → AI enhancement (if NLU confidence < 0.8)
case hybrid // Both parallel → OrderResultMerger (DEFAULT)
case aiOnly // Only AI parser
case nluOnly // Only NLU parser (offline capable)
case fastest // TaskGroup race, first non-empty wins
}##### [ip] AI-First Strategy (`parseAIFirst`)
1. If aiParser.isAvailable:
a. Run AI parser (with context if available)
b. If result.confidence >= aiConfidenceThreshold (0.85): return AI result
c. If AI found items but low confidence: run NLU, merge
2. Fallback: run NLU parser##### [ip] NLU-First Strategy (`parseNLUFirst`)
1. Run NLU parser
2. If confidence >= 0.8: return NLU result
3. If uncertain and AI available: run AI, merge
4. Return NLU result##### [ip] Hybrid Strategy (`parseHybrid`) — DEFAULT
1. Run AI parser (with context) — async
2. Run NLU parser — async
(Currently sequential, not truly parallel — see gap)
3. Merge via OrderResultMerger##### [ip] Fastest Strategy (`parseFastest`)
1. TaskGroup with both parsers
2. First non-empty result wins → cancelAll()
3. If both empty → return empty5.1.2 Context Building (Verified)
static func buildContext(
cartItems: [VoiceParsedOrder] = [],
sessionHistory: [String] = [],
userPreferences: [String: String] = [:],
lastClarification: (question: String, answer: String)? = nil,
includeConstraints: Bool = true
) -> AITranscriptParser.OrderParsingContext
// constraintsSummary: String? from YAMLConstraintEngine.shared.generateConstraintSummary()5.1.3 Parse Result (Verified)
public struct OrderParseResult: Sendable {
let transcript: String
let intent: OrderParseIntent // 11 intent types
let items: [VoiceParsedOrder] // Parsed orders
let confidence: Double // 0-1
let source: OrderParseSource // .ai / .nlu / .hybrid / .fallback
let clarificationsNeeded: [ClarificationRequest]
let warnings: [String]
let processingTimeMs: Int
let pickupName: String?
let metadata: [String: String]
}5.1.4 Parse Intent Taxonomy (Verified — 11 Intents)
public enum OrderParseIntent: String, Sendable {
case order // Normal order
case clearOrder // "Start over", "clear everything"
case readCart // "What's in my order?"
case help // "What can I order?"
case confirm // "Yes", "correct"
case decline // "No", "cancel"
case modify // "Change the size"
case remove // "Remove the croissant"
case checkout // "That's it", "check out"
case repeat_ // "Say again"
case unknown // Unclassified
}### 5.2 Result Merging
File: `BWBCore/Voice/Pipeline/OrderResultMerger.swift` (390 LOC)
5.2.1 Merge Strategies (Verified)
public enum MergeStrategy: Sendable {
case preferAI // DEFAULT — use AI when confidence ≥ 0.85
case preferNLU // Use NLU when confidence ≥ 0.80
case consensusRequired // Both must agree
case itemUnion // Union of items from both
case itemIntersection // Only items both detected
}5.2.2 Merge Logic (Verified Config)
aiConfidenceThreshold: 0.85 // Prefer AI above this
nluConfidenceThreshold: 0.80 // Prefer NLU above this
consensusBoost: 0.1 // Boost when both agree
defaultStrategy: .preferAI#### 5.2.3 Consensus Detection
When both AI and NLU detect the same item:
- Match by item name similarity
- Merge slot values (AI takes precedence for ambiguous slots)
- Boost confidence by `consensusBoost` (0.1)
### 5.3 Intent Classification
File: `BWBCore/Voice/Parsing/IntentClassifier.swift` (186 LOC)
Struct: `IntentClassifier` — singleton
5.3.1 Pattern Matching (Verified — All Patterns)
##### [ip] Order Patterns
"i'll have", "i want", "can i get", "give me", "i'd like",
"let me get", "i'll take", "order", "make me", "please get me"##### [ip] Confirm Patterns
"yes", "yeah", "correct", "that's right", "sounds good",
"perfect", "confirmed", "confirm", "yep", "right"##### [ip] Cancel Patterns
"cancel", "never mind", "forget it", "stop", "no thanks",
"don't want", "changed my mind"##### [ip] Modify Patterns
"change", "modify", "instead", "actually", "make it",
"switch", "different"##### [ip] Remove Patterns (17 patterns — most extensive)
"remove the", "remove my", "remove a", "remove",
"take off the", "take off", "take away",
"delete the", "delete", "get rid of", "drop the",
"i don't want the", "don't want the", "don't want",
"scratch the", "scratch that",
"cancel the", "never mind the"##### [ip] Checkout Patterns
"checkout", "check out", "pay", "that's all", "done ordering",
"finished", "ready to pay", "complete", "i'm done", "im done", "done"##### [ip] Help Patterns (24 patterns — extensive)
"help", "help me", "what do you have", "menu", "options", "recommend",
"what can i order", "what can i get", "what's available", "whats available",
"what drinks do you have", "what drinks", "show me", "tell me what",
"what's on the menu", "what do you sell", "what do you serve",
"popular", "best seller", "top seller", "most popular",
"suggestions", "suggest", "any recommendations", "i don't know what",
"not sure what", "what should i get", "what should i order"##### [ip] Repeat Patterns
"repeat", "say again", "what was that", "pardon", "sorry"##### [ip] Clear Order Patterns
"clear my order", "clear the order", "clear everything",
"start over", "start fresh", "reset",
"cancel everything", "cancel my order", "never mind", "forget it",
"empty the cart", "remove everything"##### [ip] Cart Inquiry Patterns
"what's in my cart", "what did i order", "read my order back",
"what do i have", "show my order", "repeat my order", "tell me my order",
"what have i got", "read back my order", "what's my order", "whats my order"5.3.2 Classification Priority (Verified Order)
1. Clear order (meta) → (.cancel, 1.0)
2. Cart inquiry (meta) → (.help, config.helpIntentConfidence)
3. Cancel → (.cancel, config.cancelIntentConfidence)
4. Confirm → (.confirm, config.confirmIntentConfidence)
5. Checkout → (.checkout, config.checkoutIntentConfidence)
6. Remove → (.remove, config.removeIntentConfidence)
7. Modify → (.modify, config.modifyIntentConfidence)
8. Help → (.help, config.helpIntentConfidence)
9. Repeat → (.repeat_, config.repeatIntentConfidence)
10. Order (explicit) → (.order, config.orderIntentExplicitConfidence)
11. Order (implicit) → (.order, config.orderIntentImplicitConfidence) [if menu matches]
12. Unknown → (.unknown, config.unknownIntentConfidence)### 5.4 Entity Extraction
File: `BWBCore/Voice/Parsing/EntityExtractor.swift` (314 LOC)
Struct: `EntityExtractor` — singleton
5.4.1 Slot Extraction Methods (All Verified)
##### [ip] Size Extraction
let sizeAliases: [String: DrinkSize] = [
// Small aliases (8oz)
"small", "short", "8oz", "8 oz", "tall", "little" → .small
// Medium aliases (12oz)
"medium", "regular", "12oz", "12 oz", "grande", "normal" → .medium
// Large aliases (16oz)
"large", "big", "16oz", "16 oz", "venti", "20oz", "20 oz",
"extra large" → .large
]##### [ip] Temperature Extraction
let temperatureAliases: [String: DrinkTemperature] = [
"hot", "warm", "heated", "steaming" → .hot
"iced", "ice", "cold", "chilled", "frozen", "on ice" → .iced
]Note: `.blended` defined in `DrinkTemperature` enum (with aliases "blended", "frozen", "frappe", "frappuccino", "smoothie") but NOT in EntityExtractor's aliases. Gap: "blended" won't be extracted.
##### [ip] Milk Extraction
let milkAliases: [String: MilkType] = [
"whole", "regular milk", "full fat" → .whole
"skim", "nonfat", "non fat", "skinny", "fat free" → .skim
"oat", "oat milk", "oatmilk", "oatly" → .oat
"almond", "almond milk", "almondmilk" → .almond
"soy", "soy milk", "soymilk" → .soy
"coconut", "coconut milk", "coco" → .coconut
"2%", "2 percent", "two percent" → .twoPercent
"lactose free", "lactose-free" → .lactoseFree
]Enhancement: Also supports "with X milk" regex pattern.
##### [ip] Caffeine Extraction
let caffeineAliases: [String: CaffeineOption] = [
"regular", "normal", "caffeinated" → .regular
"decaf", "decaffeinated", "no caffeine" → .decaf
"half caf", "half-caf", "halfcaf", "half caff", "split shot" → .halfCaf
]##### [ip] Syrup Extraction
let syrupKeywords = [
"vanilla", "caramel", "hazelnut", "mocha", "chocolate",
"lavender", "honey", "maple", "pumpkin spice", "pumpkin",
"cinnamon", "peppermint", "mint", "raspberry", "almond"
]##### [ip] Shots Extraction
"triple shot" / "3 shots" → 3
"double shot" / "2 shots" / "extra shot" → 2
"single shot" / "1 shot" → 1
"quad" / "4 shots" → 4##### [ip] Extras Extraction
"whipped cream" / "whip" → "Whipped Cream"
"extra foam" → "Extra Foam"
"light foam" → "Light Foam"
"no foam" → "No Foam"
"light ice" → "Light Ice"
"no ice" → "No Ice"
"extra ice" → "Extra Ice"##### [ip] Quantity Extraction
Number words: one/a/an→1, two/couple/pair→2, three→3, four→4,
five→5, six→6, seven→7, eight→8, nine→9, ten→10
Digits: regex \\b(\\d+)\\b, capped at config.maxQuantity
Default: 15.4.2 Full Slot Extraction (Verified)
func extractAllSlots(from text: String, itemIndex: Int)
-> (order: VoiceParsedOrder, predictions: [String: SlotPrediction])Returns a `VoiceParsedOrder` with ALL slots filled + `SlotPrediction` objects for confidence tracking.
### 5.5 Voice NLU Engine
File: `BWBCore/Voice/VoiceNLUEngine.swift` (1,321 LOC)
5.5.1 Processing Pipeline
process(transcript:) → NLUResult
│
├── 1. Normalize transcript
├── 2. IntentClassifier.classify(text, hasMenuMatches)
├── 3. MenuAliasMatcher.findMatches(text)
├── 4. EntityExtractor.extractAllSlots(text)
├── 5. QuantityExtractor.extract(text)
├── 6. ModifierDetector.detect(text)
├── 7. Assemble VoiceParsedOrder[]
├── 8. ConfidenceScorer.score(items, intent)
└── 9. Return NLUResult { transcript, intent, confidence, parsedOrders, slotPredictions }### 5.6 Enhanced NLU Engine
File: `BWBCore/Voice/EnhancedVoiceNLUEngine.swift` (745 LOC)
#### 5.6.1 Enhancements Over Base NLU
- Multi-item extraction (split on "and", "also", "plus")
- Relative modifiers ("make it bigger" → upsize)
- Better scoring with `ConfidenceScorer`
- Slot prediction with alternatives
### 5.7 AI Transcript Parser
File: `BWBCore/Voice/AITranscriptParser.swift` (1,343 LOC)
5.7.1 Provider Abstraction
AIOrderParser (130 LOC) → wraps AITranscriptParser
├── OpenAI prompts: OpenAIPrompts.swift (209 LOC)
└── Gemini prompts: GeminiPrompts.swift (162 LOC)5.7.2 AI Parse Intent (Separate from OrderParseIntent)
public enum AIParseIntent: String, Codable, Sendable {
case order, clearOrder, readCart, help, confirm, decline, modify
}5.7.3 Context Injection (Verified)
struct OrderParsingContext {
cartItems: [VoiceParsedOrder]
sessionHistory: [String]
constraintsSummary: String? // From YAMLConstraintEngine
userPreferences: [String: String]
lastClarification: (question: String, answer: String)?
}### 5.8 Slot Classification System
Files: `BWBCore/Voice/Slots/` (3 files, ~886 LOC total)
5.8.1 Slot Types (Verified Enum)
public enum SlotType: String, Codable, Sendable, CaseIterable {
case size, temperature, milk, caffeine, shots, syrup, quantity, menuItem
var isHighImportance: Bool {
self == .milk || self == .caffeine // Allergen/health
}
var confidenceThreshold: Double {
switch self {
case .milk, .caffeine: return 0.7
case .menuItem: return 0.6
default: return 0.5
}
}
}5.8.2 Slot Definition Structure
struct SlotDefinition {
let type: SlotType
let classes: [SlotClass] // Possible values with keywords
let defaultValue: String?
let isRequired: Bool
let isHighImportance: Bool
}
struct SlotClass {
let value: String // "small"
let keywords: [String] // ["small", "short", "tall"]
let confidence: Double // Base confidence for this class
}5.8.3 Enhanced Slot Prediction
struct EnhancedSlotPrediction {
let slotType: SlotType
let predictedValue: String
let displayValue: String
let confidence: Double
let gapToSecond: Double // Confidence gap to #2 candidate
let isExplicit: Bool // Was explicitly mentioned
let alternatives: [Alternative]
}### 5.9 Constraint Engine
File: `BWBCore/Voice/ConstraintEngine.swift` (776 LOC)
5.9.1 Constraint Validation Pipeline
Parsed order → ConstraintEngine.validate(order, against: constraints)
│
├── Item exists in menu?
├── Size available for this item?
├── Temperature valid? (e.g., no hot cold brew)
├── Milk compatible?
├── Modifier compatible?
├── Quantity within limits? (capped at 10)
│
├── ALL PASS → continue to cart
├── SOFT VIOLATION → auto-correct + inform
└── HARD VIOLATION → clarification needed### 5.10 YAML Constraint Engine
File: `BWBCore/Voice/Constraints/YAMLConstraintEngine.swift` (746 LOC)
#### 5.10.1 Purpose
- Data-driven constraint definitions (no code changes for menu updates)
- `generateConstraintSummary()` → injected into AI parse context
- Hot-reloadable constraints
### 5.11 Constraint Types
File: `BWBCore/Voice/Constraints/ConstraintTypes.swift` (623 LOC)
5.11.1 Violation Types
// Soft violations → auto-correct
// Hard violations → block + clarify
// Warnings → proceed but inform### 5.12 Menu Matching & Embeddings
Directory: `BWBCore/Voice/Embeddings/` (4 files, ~1,400 LOC)
#### 5.12.1 Menu Alias Matching
File: `BWBCore/Voice/Detection/MenuAliasMatcher.swift` (256 LOC)
"flat white" → Flat White (exact match)
"flat wite" → Flat White (fuzzy, Levenshtein ≤ 2)
"cortado" → Cortado (exact)#### 5.12.2 Text Embeddings
File: `BWBCore/Voice/Embeddings/TextEmbedder.swift` (357 LOC)
Text → Embedding vector
Used for semantic similarity when exact/fuzzy matching fails#### 5.12.3 Vector Index
File: `BWBCore/Voice/Embeddings/VectorIndex.swift` (297 LOC)
Query embedding → Cosine similarity search → Top-K menu matches#### 5.12.4 Menu Document
File: `BWBCore/Voice/Embeddings/MenuDocument.swift` (365 LOC)
struct MenuDocument {
let itemName: String
let description: String
let category: String
let tags: [String]
// + embedding vector
}
struct MenuSearchResult {
let document: MenuDocument
let confidence: Double
}### 5.13 Modifier & Quantity Detection
File: `BWBCore/Voice/Detection/ModifierDetector.swift` (261 LOC)
File: `BWBCore/Voice/Detection/QuantityExtractor.swift` (123 LOC)
5.13.1 Modifier Categories
Addition: "with...", "add...", "extra..."
Removal: "without...", "no...", "hold the..."
Substitution: "instead of...", "swap..."5.13.2 Quantity Safety (Verified in Orchestrator)
// VoiceOrderingOrchestrator.handleParseResult():
if validated.quantity > 10 {
Logger.voice.warning("⚠️ Suspicious quantity: \(validated.quantity). Capping at 10.")
validated.quantity = 10
}
if validated.quantity < 1 {
validated.quantity = 1
}### 5.14 Confidence Scoring
File: `BWBCore/Voice/Parsing/ConfidenceScorer.swift` (230 LOC)
5.14.1 Scoring Factors
Base confidence from intent classification
+ Slot fill rate bonus (more slots filled → higher)
+ Menu match quality bonus
+ Consensus bonus (AI + NLU agree)
- Ambiguity penalty (close alternatives)
- Correction penalty (self-correction detected)---
6. Layer 4: Dialogue Engine
### 6.1 Voice Dialogue Manager
File: `BWBCore/Voice/VoiceDialogueManager.swift` (572 LOC)
Class: `VoiceDialogueManager` — `@MainActor`, singleton, `ObservableObject`
6.1.1 State Machine (Verified)
public enum VoiceDialogueState: String, Codable, Sendable {
case idle, listening, processing, clarifying, confirming, complete, error
}6.1.2 Dependencies
private let nluEngine: VoiceNLUEngine
private let constraintEngine: ConstraintEngine6.1.3 Intent Handlers (Verified)
processTranscript(_:) → VoiceDialogueState
switch result.intent:
case .order → handleOrderIntent(result)
case .confirm → handleConfirmIntent(previousState:)
case .cancel → handleCancelIntent(previousState:)
case .modify → handleModifyIntent(result, previousState:)
case .remove → handleRemoveIntent(result)
case .checkout → handleCheckoutIntent()
case .help → handleHelpIntent()
case .repeat_ → handleRepeatIntent()
case .unknown → handleUnknownIntent(result)Note: The `VoiceDialogueManager` is a standalone component in BWBCore that can manage its own cart. In the Kiosk app, the `VoiceOrderingOrchestrator` handles dialogue flow directly using its own components (CartCoordinator, ConfirmationCoordinator) rather than delegating to VoiceDialogueManager. This creates a partial duplication — see Gap Analysis.
### 6.2 Clarification Policy
File: `BWBCore/Voice/Dialogue/ClarificationPolicy.swift` (336 LOC)
Class: `ClarificationPolicy` — `@unchecked Sendable`
6.2.1 Configuration Presets (Verified)
// Default
confidenceThreshold: 0.6
gapThreshold: 0.1 // Gap between #1 and #2 candidate
highImportanceSlots: [.milk, .caffeine]
highImportanceThreshold: 0.75
maxClarificationsPerTurn: 2
clarifyMenuItem: true
// Strict
confidenceThreshold: 0.7, gapThreshold: 0.15, highImportanceThreshold: 0.85, max: 3
// Lenient
confidenceThreshold: 0.45, gapThreshold: 0.05, highImportanceThreshold: 0.6, max: 16.2.2 Clarification Decision Logic (Verified)
shouldClarify(_ prediction: EnhancedSlotPrediction) -> Bool:
let threshold = isHighImportance ? highImportanceThreshold : confidenceThreshold
Rule 1: prediction.confidence < threshold → YES
Rule 2: prediction.gapToSecond < gapThreshold → YES (too ambiguous)
Rule 3: isHighImportance && isExplicit && confidence < 0.9 → YES
Otherwise: NO6.2.3 Question Generation (Verified Strategies)
HIGH IMPORTANCE (allergen):
Milk: "What type of milk would you like? This is important for allergen information."
Caffeine: "Would you like regular, decaf, or half-caf? Just want to make sure."
AMBIGUOUS (gap < threshold):
"Did you want [option1] or [option2]?"
LOW CONFIDENCE:
"I heard [value]. Is that right?"
DEFAULT:
"What [slotName] would you like?"6.2.4 Clarification Context Tracking (Verified)
struct ClarificationContext: Sendable {
var askedClarifications: [String: ClarificationRequest]
var receivedResponses: [String: String]
var turnCount: Int
func alreadyAsked(_ slotType: SlotType) -> Bool
var pendingClarifications: [ClarificationRequest]
}6.2.5 Response Processing (Verified)
processResponse(_ response:, for clarification:, slotDefinitions:)
-> (value: String, confidence: Double)?
1. Direct match: response == option → (option, 0.95)
2. Partial match: contains → (option, 0.8)
3. Slot keyword match: → (value, 0.75)
4. No match: → nil### 6.3 Confirmation Generation
File: `BWBCore/Voice/Dialogue/ConfirmationGenerator.swift` (334 LOC)
6.3.1 Confirmation Styles
Single item, high confidence:
"Got it — one large iced latte with vanilla. Anything else?"
Multiple items:
"So that's two large lattes and a blueberry muffin. Sound right?"
After correction:
"Changed to medium. One medium iced latte with vanilla. Good?"
Cart summary (for readCart intent):
"You have: one large iced oat latte with vanilla, and one chocolate croissant."### 6.4 Confirmation Coordinator
File: `BWBCore/Voice/Coordination/ConfirmationCoordinator.swift` (626 LOC)
Class: `ConfirmationCoordinator` — `@MainActor`, `ObservableObject`
6.4.1 Confirmation Response Types (Verified)
enum ConfirmationResponse {
case confirmed // "Yes", "correct"
case rejected // "No", "cancel"
case modified(String) // "Change the size..."
case additionalOrder(String) // "And also a croissant"
case unclear // Can't classify
case ignored // No response (timeout)
}6.4.2 Auto-Confirm Logic (Verified)
// Configuration:
autoConfirmMinConfidence: 0.85 // Minimum confidence to auto-confirm
autoConfirmCountdownDuration: 3.0 // Seconds of visual countdown
autoConfirmEnabled: false // DISABLED by default
// Trigger conditions (all must be true):
// autoConfirmEnabled == true
// confidence >= minConfidence
// no pending clarification
// no speech detected for countdown duration6.4.3 Delegate Pattern (Verified)
public protocol ConfirmationCoordinatorDelegate: AnyObject {
func confirmationDidAccept(_ orders: [VoiceParsedOrder])
func confirmationDidReject(_ orders: [VoiceParsedOrder])
func confirmationDidRequestModification(_ reason: String)
func confirmationCountdownDidUpdate(_ remaining: TimeInterval)
func confirmationDidAutoConfirm(_ orders: [VoiceParsedOrder])
}### 6.5 Context-Aware Recovery
File: `BWBCore/Voice/Coordination/ContextAwareRecoveryService.swift` (476 LOC)
Class: `ContextAwareRecoveryService` — singleton
6.5.1 Recovery Types (Verified — 11 Types)
public enum RecoveryType: String, Sendable {
case unknownItem // Can't identify item
case ambiguousItem // Multiple interpretations
case invalidCombination // Bad item+modifier combo
case unavailableItem // Out of stock/seasonal
case emptyTranscript // Nothing recognized
case lowConfidence // Parsed but very uncertain
case generalHelp // User asked for help
case menuHelp // Wants to see menu
case sizeHelp // Help with sizes
case milkHelp // Help with milk options
case customizationHelp // Help with customizations
}6.5.2 Recovery Response (Verified)
public struct RecoveryResponse: Sendable {
let message: String
let suggestions: [String]
let shouldContinueListening: Bool
let isFatal: Bool
let recoveryType: RecoveryType
}6.5.3 Escalation Model (Verified in Orchestrator)
consecutiveFailures: 0 → Normal parsing
consecutiveFailures: 1 → "I didn't catch that. Could you repeat?"
consecutiveFailures: 2 → "Try saying something like 'a medium iced latte'"
consecutiveFailures: 3+ → "Let me show you the menu. You can tap items to order."### 6.6 Session Management
File: `BWBCore/Voice/Coordination/SessionManager.swift` (270 LOC)
6.6.1 Session Lifecycle (Verified)
startSession() → UUID
→ isSessionActive = true
→ Start timeout timer (120 seconds)
→ Delegate: sessionDidStart(sessionId:)
endSession()
→ isSessionActive = false
→ Cancel timeout timer
→ Delegate: sessionDidEnd(sessionId:, duration:)
// Timeout:
sessionTimeoutInterval: 120.0 // 2 minutes of inactivity
→ Delegate: sessionDidTimeout(sessionId:)
→ Orchestrator calls endSession()6.6.2 Session Tracking in Orchestrator (Verified)
// lastSessionId: UUID? — tracks current session
// isNewSession check prevents cart clearing on resume:
let isNewSession = lastSessionId != sessionId
if isNewSession {
cartCoordinator.clearAll() // Fresh start
} else {
// Resuming — preserve cart
}---
7. Layer 5: Order State Machine
### 7.1 Cart Coordinator
File: `BWBCore/Voice/Coordination/CartCoordinator.swift` (452 LOC)
Class: `CartCoordinator` — `@MainActor`, `ObservableObject`
7.1.1 Dual-Track Cart Model (Verified)
@Published confirmedOrders: [VoiceParsedOrder] // Accepted items
@Published pendingOrders: [VoiceParsedOrder] // Awaiting confirmation
@Published pendingClarification: ClarificationRequest?7.1.2 Operations (Verified)
addToPending([items]) // New parsed items → pending
confirmPending() // Pending → confirmed
clearPending() // Discard pending (rejected)
clearAll() // Empty everything
removeItems(at: IndexSet) // Remove by index
getAllOrders() → [VoiceParsedOrder] // All confirmed
exportToOrderItems() → [OrderItem] // For checkout7.1.3 Delegate (Verified)
protocol CartCoordinatorDelegate: AnyObject {
func cartDidAddPendingItems(_ items: [VoiceParsedOrder])
func cartDidConfirmItems(_ items: [VoiceParsedOrder])
func cartDidRejectItems(_ items: [VoiceParsedOrder])
func cartDidClear()
func cartDidEncounterViolations(_ violations: [String])
}7.2 VoiceParsedOrder (Verified — Complete Model)
public struct VoiceParsedOrder: Codable, Sendable {
let itemId: String?
let itemName: String
var size: DrinkSize? // .small/.medium/.large/.extraLarge
var temperature: DrinkTemperature? // .hot/.iced/.blended
var milk: MilkType? // 9 types including .none
var caffeine: CaffeineOption? // .regular/.decaf/.halfCaf
var syrups: [String] // ["vanilla", "caramel"]
var shots: Int? // 1-4
var modifiers: [String] // ["No Foam", "Extra Ice"]
var quantity: Int // 1-10 (capped)
var confidence: Double // 0.0-1.0
var metadata: [String: String] // Extensible
}7.3 Drink Attribute Enums (Verified — All Aliases)
7.3.1 DrinkSize (with dual alias systems)
DrinkSize:
.small (8oz) — aliases: small, s, 8, 8oz, eight, short, tall
.medium (12oz) — aliases: medium, m, 12, 12oz, twelve, regular, grande
.large (16oz) — aliases: large, l, 16, 16oz, sixteen, big, venti
.extraLarge (20oz) — aliases: extra large, xl, 20, 20oz, twenty, trenta
NOTE: Two alias systems exist:
1. DrinkSize.fromAlias(_:) — static method on enum (comprehensive)
2. EntityExtractor.sizeAliases — dictionary (subset, maps "extra large"→.large NOT .extraLarge)
GAP: Inconsistency — EntityExtractor maps "venti" and "extra large" to .large, not .extraLarge7.3.2 DrinkTemperature
DrinkTemperature:
.hot — aliases: hot, warm, heated, steaming
.iced — aliases: iced, ice, cold, on ice, chilled, over ice
.blended — aliases: blended, frozen, frappe, frappuccino, smoothie
GAP: EntityExtractor only maps hot/warm/heated/steaming and iced/ice/cold/chilled/frozen/on ice
"blended", "frappe", "frappuccino", "smoothie" NOT in EntityExtractor — only in enum's fromAlias()7.3.3 MilkType (9 types)
MilkType:
.whole, .skim, .twoPercent, .oat, .almond, .soy, .coconut, .lactoseFree, .none
Each with comprehensive aliases in both enum.fromAlias() and EntityExtractor.milkAliases7.3.4 CaffeineOption
CaffeineOption:
.regular — aliases: regular, normal, caffeinated, full caffeine
.decaf — aliases: decaf, decaffeinated, no caffeine, caffeine free
.halfCaf — aliases: half caf, half-caf, halfcaf, half caffeine, split7.4 Checkout Flow (Verified)
// In Orchestrator:
func exportCartForCheckout() -> [OrderItem] {
cartCoordinator.exportToOrderItems()
}
func finalizeOrder() -> [OrderItem] {
let items = exportCartForCheckout()
clearCart()
endSession()
return items
}
// In View:
.sheet(isPresented: $showingPayment) {
KioskPaymentView(orderItems: voiceService.finalizeOrder()) { completedOrder in
order = completedOrder
dismiss()
}
}---
8. Layer 6: Synthesis & Feedback
### 8.1 Feedback Coordinator
File: `BWBCore/Voice/Coordination/FeedbackCoordinator.swift` (358 LOC)
Class: `FeedbackCoordinator` — `@MainActor`, `ObservableObject`
8.1.1 Audio Feedback (Verified — System Sound IDs)
public enum VoiceAudioFeedback {
case wakeWordDetected // SystemSound 1057 (Tink)
case itemAdded // SystemSound 1104 (Pop)
case orderConfirmed // SystemSound 1025 (Success)
case clarificationNeeded // SystemSound 1315 (Alert)
case clarificationReceived // SystemSound 1104 (Pop)
case helpProvided // SystemSound 1114 (Informational)
case error // SystemSound 1053 (Error)
case errorRecovery // SystemSound 1007 (Soft notification)
case listening // SystemSound 1306 (Click)
case stopped // SystemSound 1306 (Click)
}8.1.2 Haptic Feedback (Verified)
public enum VoiceHapticFeedback {
case selection // Haptics.selection()
case itemAdded // Haptics.light()
case orderConfirmed // Haptics.success()
case error // Haptics.error()
case warning // Haptics.warning()
case wakeWord // Haptics.success()
case prompt // Haptics.light()
}8.1.3 Convenience Methods (Verified)
playStartListening() // medium haptic + listening sound
playStopListening() // light haptic
playWakeWordDetected() // sound 1057 + success haptic
playItemAdded() // sound 1104 + light haptic
playOrderConfirmed() // sound 1025 + success haptic
playClarificationNeeded() // sound 1315 + medium haptic
playClarificationReceived() // sound 1104 + medium haptic
playHelpProvided() // sound 1114 + light haptic
playError() // sound 1053 + error haptic
playErrorRecovery() // sound 1007 + warning haptic8.1.4 TTS Delegation (Verified)
useAudioSessionManager: Bool = true // DEFAULT: delegate to AudioSessionManager
speakAndWait(_ text:, thenListen:):
if useAudioSessionManager:
→ AudioSessionManager.shared.playTTS(text:) // Proper isolation
else:
→ Local AVSpeechSynthesizer (legacy, no isolation)### 8.2 Audio Session Manager TTS (Verified)
See Layer 1 §3.1 — handles full isolation lifecycle.
---
9. Layer 7: Interaction Surface
9.1 View Hierarchy (Verified)
KioskVoiceOrderingView (root) ← @StateObject VoiceOrderingOrchestrator.shared
├── Background: LinearGradient(vinyl._900 → vinyl._800)
│
├── KioskVoiceHeader
│ ├── Cart badge (cartItemCount)
│ ├── Settings button → showingSettings sheet
│ └── Close button → dismiss()
│
├── Main Content (@ViewBuilder, state-dependent)
│ ├── [idle && !isPushToTalkRecording]
│ │ └── KioskVoiceWelcomeView
│ │ ├── Push-to-talk button
│ │ ├── Wake word status indicator
│ │ └── "Or browse the menu" → showTouchOrdering
│ │
│ ├── [isPushToTalkRecording]
│ │ └── KioskVoiceWelcomeView (recording mode)
│ │
│ └── [active session]
│ └── KioskVoiceActiveOrderingView
│ ├── VoiceWaveformView (audio visualization)
│ ├── TranscriptDisplayView (streaming text + countdown)
│ ├── Live preview cards
│ └── ConfirmationOverlayView (when confirming)
│
├── KioskVoiceActionBar
│ ├── Voice mode toggle (continuous/push-to-talk/wake-word)
│ ├── Force process button
│ └── Checkout shortcut
│
├── Checkout prompt overlay (after 5s inactivity with items in cart)
│ ├── "Ready to checkout?"
│ ├── Checkout button → showingPayment
│ └── Continue ordering → dismiss overlay
│
└── Sheets
├── .sheet(showingCart) → KioskVoiceCartSheet
├── .sheet(showingPayment) → KioskPaymentView
├── .sheet(showingSettings) → KioskVoiceSettingsSheet
└── .fullScreenCover(showTouchOrdering) → KioskTouchOrderingView9.2 Input Modes (Verified)
public enum VoiceInputMode: String, CaseIterable, Sendable {
case continuous // "Continuous" — auto-detect end of speech
case pushToTalk // "Push to Talk" — hold button, release to process
case wakeWord // "Wake Word" — "Hey Brews" activation
}9.2.1 Push-to-Talk Lifecycle (Verified)
startPushToTalkRecording():
→ Request auth
→ Reset transcript/detection state
→ phase = .listening
→ startListening()
→ Play listening sound + selection haptic
stopPushToTalkAndProcess():
→ isPushToTalkRecording = false
→ Selection haptic
→ Get final displayTranscript
→ stopListening()
→ If empty: speak error prompt
→ If content: processUtterance(transcript)
→ If confirming: resume listening for response
cancelPushToTalkRecording():
→ stopListening()
→ Reset state
→ Warning haptic9.3 Speech Phase Mapping (Verified)
var speechPhase: SpeechPhase {
if isSpeaking → .confirming
if isProcessing → .processing
if isSpeechDetected → .speaking
if phase == .confirming → .confirming
if isListening:
if processingCountdown != nil → .paused
else → .waiting
return .idle
}9.4 Checkout Prompt Timing (Verified)
// Triggers when cart count increases:
onChange(of: voiceService.cartItemCount) { old, new in
if new > old && new > 0 {
checkoutPromptDismissed = false
Task {
try? await Task.sleep(nanoseconds: 5_000_000_000) // 5 seconds
if voiceState == .listening && !cart.isEmpty && !isSpeechDetected {
showCheckoutPrompt = true // with spring animation
}
}
}
}
// Dismissed by speech detection or confirming state9.5 Help Message Generation (Verified)
func generateHelpMessage() -> String {
// Context-aware:
if cartItems.isEmpty:
"To order, just say what you'd like. For example: 'I'll have a medium latte'..."
else:
"Say 'repeat my order' to hear your current items..."
"Say 'clear order' or 'start over' to remove all items..."
"Say 'checkout' when you're ready to pay..."
// Always:
"You can say sizes like small, medium, or large"
"You can say 'iced' or 'hot' for temperature"
"Say 'yes' or 'no' to confirm items"
}---
10. Layer 8: Learning & Telemetry
### 10.1 Pattern Learner
File: `BWBCore/Voice/Learning/PatternLearner.swift` (318 LOC)
Class: `PatternLearner` — `@MainActor`, singleton, `ObservableObject`
10.1.1 Learned Alias Storage
struct LearnedAliasStore {
var menuItems: [String: LearnedAlias]
var sizes: [String: LearnedAlias]
var temperatures: [String: LearnedAlias]
var milks: [String: LearnedAlias]
var syrups: [String: LearnedAlias]
}
struct LearnedAlias {
let spokenPhrase: String // What user said
let mappedValue: String // What it maps to
let slotType: String // Which slot category
}10.1.2 Persistence
Storage: [home-path]
Load: on init
Save: after applying approved suggestions10.1.3 Learning Loop
FeedbackCollector tracks corrections → generates suggestions
PatternLearner.applyApprovedSuggestions() → loads approved
→ Stores as LearnedAlias
→ Marks as applied in FeedbackCollector
→ Saves to disk### 10.2 Feedback Collector
File: `BWBCore/Voice/Learning/FeedbackCollector.swift` (512 LOC)
Class: `FeedbackCollector` — `@MainActor`, singleton, `ObservableObject`
10.2.1 Feedback Tracking
Implicit signals:
- Confirmation accepted → parsing was correct
- Confirmation rejected → parsing error
- Clarification resolved → slot was ambiguous
- Session abandoned → overall failure
Stored corrections:
- Original transcript
- Original parse
- Corrected value
- Slot type
→ Generates suggested aliases for PatternLearner### 10.3 Learning Types
File: `BWBCore/Voice/Learning/LearningTypes.swift` (179 LOC)
10.3.1 Type Definitions
struct FeedbackEvent { ... } // Single feedback instance
struct AliasSuggestion { ... } // Suggested new alias
struct LearningStats { ... } // Accuracy metrics---
11. Cross-Cutting: Error Taxonomy & Recovery
11.1 Error Categories by Layer
Layer 1 Errors
MIC_UNAVAILABLE → Show permission alert (openSettingsURLString)
AUDIO_SESSION_INTERRUPTED → Pause gracefully, resume on restoration
AUDIO_SESSION_CONFIG_FAIL → Log error, continue with previous config
WAKE_WORD_AUTH_DENIED → Disable wake word, log warningLayer 2 Errors
NO_SPEECH_DETECTED → 5s silence → promptForOrder()
RECOGNITION_ERROR → Log, reset recognizer, continue
RECOGNITION_UNAVAILABLE → Switch to NLU-only parsing
ASSET_DOWNLOAD_FAIL (iOS 26) → Fall back to LegacyVoiceServiceLayer 3 Errors
LOW_CONFIDENCE_PARSE (< 0.6) → Clarification flow
AI_API_TIMEOUT (> 5s) → Use NLU result only
AI_API_ERROR → Fall back to NLU-only
UNKNOWN_ITEM → ContextAwareRecoveryService.generateRecovery(.unknownItem)
AMBIGUOUS_ITEM → generateRecovery(.ambiguousItem)
INVALID_COMBINATION → generateRecovery(.invalidCombination)
EMPTY_PARSE → Escalating recovery (failures 1→2→3+)Layer 4 Errors
CLARIFICATION_TIMEOUT → Re-ask or proceed with default
CONFIRMATION_TIMEOUT → Auto-confirm (if enabled) or re-ask
REPEATED_FAILURES (3+) → Offer touch fallbackLayer 5 Errors
CONSTRAINT_VIOLATION_SOFT → Auto-correct + inform
CONSTRAINT_VIOLATION_HARD → Block + clarify
QUANTITY_OVERFLOW (>10) → Cap at 10, warnLayer 6 Errors
TTS_FAILURE → Skip speech, show text only
AUDIO_ROUTING_ERROR → Reset audio session
VOICE_NOT_AVAILABLE → Use system default voice---
12. Cross-Cutting: Performance & Latency Budget
12.1 Latency Breakdown
Component Current Target
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Utterance detection ≤1.5s ≤1.0s
NLU parsing <100ms <50ms
AI parsing 1.0-3.0s <1.5s
Result merging <10ms <10ms
Confirmation generation <10ms <10ms
TTS isolation overhead ~500ms 0ms (with AEC)
TTS speech variable variable
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total (end-of-speech → TTS): 2.0-5.0s <2.0s12.2 Accuracy Targets
Metric Current Target
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Intent classification ~90% >95%
Entity extraction ~85% >92%
Overall order accuracy ~85% >92%
First-try success rate ~80% >88%
Session completion rate Unknown >90%---
13. Cross-Cutting: Privacy, Security, Offline
13.1 Privacy
✓ Audio processed in real-time, not stored after session
✓ Transcripts cleared on endSession()
✓ No biometric voiceprints
✓ No PII extraction
✗ No opt-in analytics yet13.2 Security
✓ API keys in xcconfig (Phase 1 completed 2026-01-19)
✓ fatalError() replaced with throws/Result (Phase 1 completed)
✓ HTTPS for all API calls
✓ No hardcoded secrets13.3 Offline Capability
WORKS OFFLINE:
✓ NLU parsing (IntentClassifier + EntityExtractor + SlotClassifier)
✓ Wake word detection
✓ Audio feedback + haptics
✓ Cart management (all local state)
✓ All UI (SwiftUI, no network)
✓ Speech recognition (iOS 26+ on-device mode)
✓ Constraint validation
✓ Learned alias lookup
REQUIRES NETWORK:
✗ AI parsing (OpenAI/Gemini API calls)
✗ Speech recognition (LegacyVoiceService server mode)
✗ Menu embedding updates
✗ Order submission to backend
DEGRADED MODE:
→ ParsingStrategy falls to .nluOnly
→ Lower accuracy but functional
→ Queue orders locally (Phase 7 — not yet implemented)---
14. State Machine Formal Specification
14.1 Orchestrator Phase State Machine (Verified)
States: {idle, listening, processing, confirming, clarifying, complete, error}
Transitions:
idle → listening [startSession(), resume successful]
listening → processing [utterance complete, transcript non-empty]
listening → idle [endSession()]
listening → error [startListening() failed]
processing → confirming [items parsed, pending confirmation]
processing → listening [meta-command handled, or no items + recovery]
processing → error [unhandled exception]
confirming → listening [confirmed → items added, or rejected → items cleared]
confirming → processing [additional order detected in confirmation response]
clarifying → listening [clarification resolved]
clarifying → confirming [clarification leads to confirmed order]
* → idle [endSession() from any state]
* → error [unrecoverable failure]14.2 Audio Session State Machine (Verified)
States: {idle, listening, speaking, transitioning}
Transitions:
idle → listening [configureAudioSession(.listening)]
listening → transitioning [pauseListening()]
transitioning → speaking [configureAudioSession(.speaking)]
speaking → transitioning [TTS complete, resumeListening()]
transitioning → listening [configureAudioSession(.listening), voiceService.resume()]14.3 Wake Word State Machine (Verified)
States: {disabled, enabled, listening, paused}
Transitions:
disabled → enabled [enable()]
enabled → listening [startListening()]
listening → paused [pause() — during active session]
paused → listening [resume() — after session ends]
* → disabled [disable()]14.4 Confirmation State Machine (Verified)
States: {idle, awaiting, auto_confirming, resolved}
Transitions:
idle → awaiting [startConfirmation(items, transcript, confidence)]
awaiting → resolved [processResponse() → .confirmed/.rejected]
awaiting → auto_confirming [autoConfirm conditions met, countdown starts]
auto_confirming → resolved [countdown reaches 0]
auto_confirming → awaiting [speech detected → cancelAutoConfirm]
resolved → idle [reset()]---
15. End-to-End Flow Traces
15.1 Happy Path: Single Item Order
t=0.0s Customer: "Hey Brews"
→ WakeWordDetector.onWakeWordDetected("Hey Brews")
→ SystemSound 1057 (Tink) + Haptics.success()
→ startSession() → UUID created
→ wakeWordDetector.pause()
→ transcriptPipeline.reset(), cartCoordinator.clearAll()
→ startListening() → SpeechAnalyzerService or LegacyVoiceService
→ phase = .listening
t=0.5s System TTS: "Hi! What can I get you?" (if greeting enabled)
→ AudioSessionManager.pauseListening()
→ voiceService.suspend()
→ sleep(100ms)
→ configureAudioSession(.speaking)
→ AudioSessionManager.playTTS("Hi! What can I get you?")
→ AVSpeechSynthesizer.speak(utterance)
→ [TTS plays ~2s]
→ AudioSessionManager.resumeListening()
→ sleep(300ms)
→ configureAudioSession(.listening)
→ sleep(100ms)
→ voiceService.resume()
t=3.0s Customer: "Can I get a large iced oat milk latte with vanilla"
→ voiceActivityResults: isSpeechDetected = true
→ transcriptPipeline.recordSpeechActivity()
→ utteranceDetector.recordSpeechActivity()
→ cancelSilencePromptTimer()
→ transcriptionResults (streaming):
t+0.2s: "can I"
t+0.5s: "can I get a"
t+0.8s: "can I get a large"
t+1.0s: "can I get a large iced"
t+1.3s: "can I get a large iced oat milk latte"
t+1.5s: "can I get a large iced oat milk latte with vanilla"
Each partial → handleTranscriptionResult():
→ transcriptPipeline.processIncoming()
→ livePreviewGenerator.generatePreview()
→ At "large iced": preview = OrderPreview(itemName: "?", size: "large", temp: "iced")
→ At "oat milk latte": preview = OrderPreview(itemName: "Latte", size: "large", ...)
→ At "with vanilla": preview = OrderPreview(itemName: "Latte", syrups: ["vanilla"])
→ utteranceDetector.analyze()
→ NOT complete (speech still active)
t=4.5s Customer stops speaking
→ voiceActivityResults: isSpeechDetected = false
→ utteranceDetector begins:
→ silence timer starts
→ stability checks begin (every 0.3s)
t=5.5s utteranceDetector.analyze():
→ Silence: 1.0s > silenceTimeout (1.0s) ✓
→ Stability: stabilityCount >= 1 ✓
→ reason: .silenceTimeout, confidence: 1.0
→ isComplete = true
t=5.5s processUtterance("can I get a large iced oat milk latte with vanilla")
→ addToSessionHistory()
→ phase = .processing, isProcessing = true
parsingPipeline.parseWithAutoContext(transcript, cart: [], history: [...])
→ buildContext(cartItems: [], sessionHistory: [...], includeConstraints: true)
HYBRID STRATEGY:
AI Parser: send to OpenAI/Gemini with context
NLU Parser:
IntentClassifier: "can i get" matches orderPatterns → (.order, 0.9)
EntityExtractor:
size: "large" → .large (0.85)
temperature: "iced" → .iced (0.85)
milk: "oat milk" → .oat (0.85)
syrups: ["vanilla"] (0.85)
quantity: implicit → 1
MenuAliasMatcher: "latte" → "Latte" (exact, 1.0)
AI returns: { intent: "order", items: [{name: "Latte", size: "large", ...}], confidence: 0.95 }
NLU returns: { intent: .order, items: [{name: "Latte", size: .large, ...}], confidence: 0.88 }
OrderResultMerger.merge():
→ Both found same item → consensusBoost (+0.1)
→ AI confidence (0.95) > aiConfidenceThreshold (0.85) → prefer AI
→ Final: confidence = 0.95, source = .hybrid
t=6.0s handleParseResult(result):
→ result.items = [VoiceParsedOrder(name: "Latte", size: .large, temp: .iced, milk: .oat, syrups: ["vanilla"], quantity: 1, confidence: 0.95)]
→ Validate quantities (1 ≤ 10 ✓)
→ cartCoordinator.addToPending(items)
→ confirmationCoordinator.startConfirmation(items, "can I get...", 0.95)
→ generates: "One large iced oat latte with vanilla — anything else?"
→ phase = .confirming
speakWithIsolation("One large iced oat latte with vanilla — anything else?", thenListen: true)
→ pauseListening() → sleep(100ms) → configureAudioSession(.speaking)
→ playTTS(text) → ~3s speech
→ resumeListening() → sleep(300ms) → configureAudioSession(.listening) → voiceService.resume()
t=9.5s Customer: "That's it"
→ handleTranscriptionResult → utterance detected
→ processUtterance("that's it")
→ confirmationCoordinator.isAwaitingConfirmation = true
→ handleConfirmationResponse("that's it")
→ "that's" not in confirmPatterns but in checkoutPatterns → handled as checkout
OR if interpreted as confirm:
→ confirmationCoordinator.processResponse("that's it")
→ checks patterns → "that's" matches "that's all" (checkout)
→ returns .additionalOrder or handled in meta
t=10.0s handleMetaCommand(.checkout):
→ cartItems not empty ✓
→ shouldProceedToCheckout = true
→ speakWithIsolation("Taking you to checkout.", thenListen: false)
t=10.5s View: onChange(shouldProceedToCheckout):
→ showingPayment = true
→ KioskPaymentView(orderItems: voiceService.finalizeOrder())
→ exportCartForCheckout() → [OrderItem]
→ clearCart()
→ endSession()15.2 Error Recovery Flow: Unknown Item
t=0s Customer: "I'd like a mackiado"
→ Parsing:
NLU: MenuAliasMatcher("mackiado") → fuzzy → "macchiato" (Levenshtein=2, score 0.72)
AI: "macchiato" (confidence: 0.78)
Merge: {name: "macchiato", confidence: 0.75}
→ confidence 0.75 ≥ 0.6 → proceed (but below 0.85 → won't auto-confirm)
→ confirmationCoordinator.startConfirmation(items, ..., 0.75)
→ "Did you mean a macchiato?"
t=2s Customer: "Yes"
→ handleConfirmationResponse("yes")
→ .confirmed
→ cartCoordinator.confirmPending()
→ "Added to your order. Anything else?"15.3 Error Recovery Flow: Total Parse Failure
t=0s Customer: [unintelligible mumble]
→ Parsing: NLU → no items, AI → no items
→ consecutiveFailures = 1
ContextAwareRecoveryService.generateRecovery(
type: .unknownItem,
context: RecoveryContext(failureCount: 1)
)
→ "I didn't catch that. Could you repeat your order?"
t=3s Customer: [still unclear]
→ consecutiveFailures = 2
→ "I'm having trouble understanding. Try saying something like 'a medium iced latte'"
t=6s Customer: [still unclear]
→ consecutiveFailures = 3
→ "Let me show you the menu. You can also tap items to order."
→ [Touch ordering suggested]---
16. Gap Analysis: Current vs Next-Gen
16.1 Verified Gaps
| ID | Component | Current State | Issue | Severity |
|---|---|---|---|---|
| G1 | Audio isolation | pause/resume cycle | 500ms+ overhead per TTS, no simultaneous talk/listen | Critical |
| G2 | Dialogue context | VoiceDialogueManager (572 LOC) exists but Orchestrator manages its own flow | Duplication, VDM not used by Kiosk | Medium |
| G3 | Hybrid parsing | Sequential not truly parallel | `parseHybrid` awaits AI then NLU, not concurrent | Medium |
| G4 | EntityExtractor aliases | Missing "blended", "frappe", "frappuccino", "smoothie" | Temperature `.blended` not extractable | Low |
| G5 | Size alias inconsistency | EntityExtractor maps "extra large"→.large, enum has .extraLarge | xl orders misclassified | Low |
| G6 | Analytics pipeline | No telemetry | Zero visibility into real-world accuracy | High |
| G7 | Multi-language | English only | No French, N'Ko | Medium |
| G8 | Branded TTS | AVSpeechSynthesizer | Generic voice, no personality | Medium |
| G9 | Audio feedback fix | Phase 2 in roadmap — "not started" | But AudioSessionManager IS implemented | Stale doc |
| G10 | Slot inference | No cross-item inference | "And a cappuccino" doesn't inherit previous item's size/milk | Medium |
| G11 | User profiles | No cross-session memory | "The usual" not supported | Low priority |
| G12 | Streaming AI parse | AI parses after utterance complete | Could parse incrementally as user speaks | High |
| G13 | Sentiment detection | None | Can't detect frustration or satisfaction | Medium |
| G14 | A/B testing | None | Can't compare NLU improvements | Medium |
| G15 | Edge NLU model | Pattern matching only | No ML classifier on-device | High |
### 16.2 Stale Roadmap Items (Verified)
- Phase 2 (Audio Feedback Fix): AudioSessionManager IS implemented with `speakWithIsolation()`. Phase may be partially complete.
- Phase 5, Plan 05-01 (Split KioskVoiceOrderingView): Already done — extracted to 6 component files in `Components/`
- Phase 8, Plan 08-02 (Audit @unchecked Sendable): `ClarificationPolicy` still marked `@unchecked Sendable`
---
17. Evolution Roadmap
17.1 Phase A: Audio Excellence (Critical Path)
A.1 Acoustic Echo Cancellation
Problem: 500ms+ overhead per TTS turn
Solution: AVAudioEngine with custom AEC tap
→ Run TTS + mic simultaneously
→ Subtract TTS waveform from mic input
→ Remove pause/resume cycle entirely
Impact: -500ms latency per turn, more natural conversation
Complexity: High (audio DSP, timing synchronization)A.2 Noise-Robust VAD
Problem: Basic energy threshold in noisy coffee shop
Solution: Spectral analysis VAD + speaker diarization
→ CoreML on-device VAD model
→ Trained on coffee shop noise profiles (grinder, steamer, music)
→ Focus on nearest speaker
Impact: Better utterance detection in real environments
Complexity: MediumA.3 Branded TTS Voice
Problem: Generic AVSpeechSynthesizer
Solution: ElevenLabs custom voice
→ Train on barista-style speech
→ Warm, friendly, slightly upbeat persona
→ Fallback chain: ElevenLabs → Apple Neural → AVSpeech
Impact: Distinctive brand identity, more engaging
Complexity: Low (API integration)17.2 Phase B: Intelligence Upgrade
B.1 Fix Hybrid Parsing to True Parallel
Problem: parseHybrid() is sequential
Solution: Use TaskGroup for true concurrent execution
Impact: Faster response when both AI and NLU available
Complexity: Low (restructure async calls)B.2 Streaming AI Parse
Problem: AI parses only after utterance complete
Solution: Send partial transcripts to AI with streaming response
→ Token-level intent detection
→ Progressive slot filling
→ Cancel/correct mid-stream
Impact: Sub-second apparent response time
Complexity: HighB.3 Multi-Turn Context Engine
Problem: No anaphora resolution, no implicit inference
Solution: Dialogue state tracker with frame semantics
→ "Make it bigger" → upsize current
→ "Same thing" → repeat last order
→ "And a cappuccino" → infer preferences from previous item
Impact: Natural multi-turn conversations
Complexity: HighB.4 Edge NLU Model
Problem: Pattern matching has ceiling
Solution: Fine-tuned DistilBERT → CoreML
→ Train on coffee ordering data
→ Sub-100ms inference
→ Intent + entity joint model
Impact: Major accuracy improvement offline
Complexity: High (data collection + training)17.3 Phase C: Personalization
C.1 User Profiles
"The usual" → SwiftData profile → one-tap reorder
Dietary prefs → auto-filter incompatible items
Session learning persists → cross-visit intelligenceC.2 Smart Upselling
Context-aware: "Want a pastry?" after drink order
Time-aware: Breakfast combos before 11am
Weather-aware: Iced drinks on hot days
Never pushy, always value-addingC.3 Multi-Language
French: Parisian coffee vocabulary
N'Ko: Manding language family
Auto-detect from first utterance
Code-switching: "Un latte grande please"17.4 Phase D: Observability
D.1 Pipeline Telemetry
Every utterance logged:
→ Transcript, intent, entities, confidence, latency
→ Parse source (AI/NLU/hybrid)
→ Confirmation response
→ Recovery events
Dashboard: accuracy by item, slot, intent over timeD.2 Active Learning
Flag low-confidence utterances → human review queue
Corrections → PatternLearner integration
A/B test NLU improvements---
18. Codebase Map (Verified — All Files with Accurate LOC)
18.1 BWBCore/Sources/BWBCore/Voice/ (19,703 LOC total)
Voice/
├── AITranscriptParser.swift 1,343 LOC — AI parsing + context
├── AudioSessionManager.swift 373 LOC — Audio routing + TTS
├── ConstraintEngine.swift 776 LOC — Order validation
├── EnhancedVoiceNLUEngine.swift 745 LOC — Advanced NLU
├── LegacyVoiceService.swift 373 LOC — SFSpeechRecognizer
├── SpeechAnalyzerService.swift 575 LOC — iOS 26+ speech
├── VoiceDialogueManager.swift 572 LOC — Dialogue state (unused by Kiosk)
├── VoiceNLUEngine.swift 1,321 LOC — Core NLU
├── VoiceServiceProtocol.swift 300 LOC — Service abstraction
├── VoiceTypes.swift 447 LOC — Shared types
├── WakeWordDetector.swift 540 LOC — Wake word
│
├── Constraints/
│ ├── ConstraintParser.swift 253 LOC
│ ├── ConstraintTypes.swift 623 LOC
│ └── YAMLConstraintEngine.swift 746 LOC
│
├── Coordination/
│ ├── CartCoordinator.swift 452 LOC
│ ├── ConfirmationCoordinator.swift 626 LOC
│ ├── ContextAwareRecoveryService.swift 476 LOC
│ ├── FeedbackCoordinator.swift 358 LOC
│ └── SessionManager.swift 270 LOC
│
├── Detection/
│ ├── LiveOrderPreviewGenerator.swift 310 LOC
│ ├── MenuAliasMatcher.swift 256 LOC
│ ├── ModifierDetector.swift 261 LOC
│ ├── QuantityExtractor.swift 123 LOC
│ ├── TranscriptStabilityTracker.swift 125 LOC
│ └── UtteranceCompletionDetector.swift 344 LOC
│
├── Dialogue/
│ ├── ClarificationPolicy.swift 336 LOC
│ └── ConfirmationGenerator.swift 334 LOC
│
├── Embeddings/
│ ├── MenuDocument.swift 365 LOC
│ ├── MenuEmbeddingService.swift 387 LOC
│ ├── TextEmbedder.swift 357 LOC
│ └── VectorIndex.swift 297 LOC
│
├── Learning/
│ ├── FeedbackCollector.swift 512 LOC
│ ├── LearningTypes.swift 179 LOC
│ └── PatternLearner.swift 318 LOC
│
├── Parsing/
│ ├── ConfidenceScorer.swift 230 LOC
│ ├── EntityExtractor.swift 314 LOC
│ ├── IntentClassifier.swift 186 LOC
│ ├── TranscriptPreprocessor.swift 226 LOC
│ └── Prompts/
│ ├── GeminiPrompts.swift 162 LOC
│ └── OpenAIPrompts.swift 209 LOC
│
├── Pipeline/
│ ├── AIOrderParser.swift 130 LOC
│ ├── NLUOrderParser.swift 90 LOC
│ ├── OrderParsingPipeline.swift 531 LOC
│ ├── OrderResultMerger.swift 390 LOC
│ ├── TranscriptNormalizer.swift 228 LOC
│ ├── TranscriptPipeline.swift 188 LOC
│ └── TranscriptState.swift 136 LOC
│
└── Slots/
├── SlotClassifier.swift 362 LOC
├── SlotDefinitions.swift 315 LOC
└── SlotPrediction.swift 209 LOC18.2 BWB_Kiosk/BWB_Kiosk/ (Kiosk-Specific)
BWB_Kiosk/
├── BWB_KioskApp.swift
├── Services/
│ ├── VoiceOrderingOrchestrator.swift ~750 LOC — Central coordinator
│ └── VoiceOrderingTypes.swift ~280 LOC — Kiosk types + config
└── Views/
├── Setup/
│ ├── KioskSetupView.swift
│ └── KioskSettingsView.swift
└── Ordering/
├── KioskMainView.swift
├── KioskVoiceOrderingView.swift — Voice root (uses components)
├── KioskTouchOrderingView.swift — Touch fallback
└── Components/
├── VoiceWaveformView.swift
├── TranscriptDisplayView.swift
├── CartSummaryView.swift
├── ConfirmationOverlayView.swift
├── KioskVoiceSubviews.swift — Header, Welcome, Active, ActionBar
└── KioskPaymentViews.swift — Payment, CartSheet, SettingsSheet18.3 Tests
BWBCore/Tests/BWBCoreTests/
├── VoiceNLUEngineTests.swift
├── VoicePipelineTests.swift
├── Integration/KioskVoiceTests.swift
└── Mocks/MockVoiceService.swift
BWB_Kiosk/BWB_KioskTests/
└── BWB_KioskTests.swift
BWB_Kiosk/BWB_KioskUITests/
├── BWB_KioskUITests.swift
└── BWB_KioskUITestsLaunchTests.swift---
This document was verified against the actual codebase on 2026-02-10.
Every LOC count, enum variant, method signature, and pattern list was confirmed by reading source files.
Update this document when code changes. It is the single source of truth for the voice ordering architecture.
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
BWB/BWB_Kiosk/docs/VOICE-ORDERING-ARCHITECTURE.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture