Stage 0: RESEARCH
**Scale:** 237 Swift files total. BWBCore: 155 files, 43K+ lines, 434 tests passing. BWB_Kiosk: 25 files, 8,456 lines. BWB_POS: 20+ files.
Full Public Reader
# Stage 0: RESEARCH
## LUME Commerce -- Experiential Commerce Infrastructure
---
1. WHAT EXISTS TODAY
1a. BWB Codebase (Desktop/BWB/)
Scale: 237 Swift files total. BWBCore: 155 files, 43K+ lines, 434 tests passing. BWB_Kiosk: 25 files, 8,456 lines. BWB_POS: 20+ files.
Voice Pipeline (COMPLETE, production-grade):
- 10-component modular voice ordering pipeline decomposed from a 2,464-line monolith
- `VoiceOrderingOrchestrator` -- thin coordinator managing state: idle -> listening -> processing -> confirming -> complete
- `TranscriptPipeline` -- transcript normalization, stability tracking, session accumulation
- `UtteranceCompletionDetector` -- silence/completion via stability count, order-ending phrases
- `LiveOrderPreviewGenerator` -- regex-based pattern matching (~50ms latency), menu alias matching
- `OrderParsingPipeline` -- HYBRID MERGE strategy: AI + NLU parsers run in parallel, results merged
- `CartCoordinator` -- pending vs confirmed cart, clarification flow
- `ConfirmationCoordinator` -- auto-confirm at 0.8 confidence after 3-second countdown
- `FeedbackCoordinator` -- TTS (AVSpeechSynthesizer), haptics, audio cues (10 audio types, 6 haptic types)
- `SessionManager` -- 120-second timeout, activity tracking, session lifecycle
- `AudioCaptureController` -- wraps `SpeechAnalyzerService` (iOS 26+) and `LegacyVoiceService` (iOS 17-25)
AI Routing (COMPLETE):
- 12-component architecture decomposed from 7 monoliths (4,059 lines -> 12 focused files)
- 6-factor routing: CartAvailability (25
- Dynamic + Standard + Fallback routing strategies
- LocationService with BLE beacon ranging, trilateration, movement analysis
- DrinkComplexityAnalyzer: Simple/Medium/Complex/Premium classification
NLU Engine:
- `EnhancedVoiceNLUEngine.swift` (28K lines) -- full NLU for coffee ordering domain
- `VoiceNLUEngine.swift` (33K lines) + vocabulary extension (17K lines)
- `AITranscriptParser.swift` (46K lines) -- AI-powered order parsing
- Fuzzy matching at 70-80
- Menu intelligence: handles "large coffee" -> "16oz house blend" mapping
Payment (PARTIAL, 2,937 lines across 6 files):
- `PaymentService.swift` (507 lines) -- Apple Pay + Stripe backend integration
- `SquareMobilePaymentHandler.swift` (555 lines) -- Square Mobile Payments SDK (BLE reader pairing, payment flow, reader state management)
- `CardPaymentHandler.swift` (347 lines)
- `SquareInAppPaymentHandler.swift` (559 lines)
- `SquarePaymentTypes.swift` (453 lines) -- full type system for Square integration
- `PaymentTypes.swift` (516 lines)
- Status: Apple Pay works. Square SDK integrated with `#if canImport(SquareMobilePaymentsSDK)` guards. Stripe PaymentIntent creation via Supabase Edge Functions. Card payment without Apple Pay requires Stripe Payment Sheet (NOT integrated).
Queue Management (BWB_POS):
- `QueueService.swift` -- Supabase Realtime subscriptions, pendingOrders/inProgressOrders/readyOrders, 30s polling fallback, cart assignment filtering, multi-POS device awareness
- `KitchenDisplayView.swift` -- Dark-themed KDS for baristas, real-time order grid, active/done/avg-wait stats, connection status indicator
- `QueueManagementView.swift` -- Manager view for order queue
Kiosk (BWB_Kiosk, 8,456 lines):
- `KioskVoiceOrderingView.swift` (221 lines) -- voice ordering UI
- `KioskTouchOrderingView.swift` (959 lines) -- touch fallback
- `KioskPaymentViews.swift` (688 lines) -- payment flows
- `VoiceOrderingOrchestrator.swift` (1,324 lines, kiosk-specific version)
- `KioskVoiceEngine.swift` (234 lines) -- STT + TTS + wake word
- Attract mode, screensaver, large touch targets
- Cart summary, item detail sheet, confirmation overlay
Backend: Supabase (141 tables)
- Order management, user accounts, payments, analytics
- Realtime subscriptions for order status updates
- Edge Functions for Stripe PaymentIntent creation
1b. SpeakFlow (Desktop/SpeakFlow/)
Scale: 21 Swift source files in SpeakFlowCore, 6 in SpeakFlowKeyboard, 4 in SpeakFlow app. Full test suite.
Architecture:
- Privacy-first, offline-first voice OS competing with Wispr Flow ($10M ARR)
- `SpeechService.swift` -- SFSpeechRecognizer with on-device requirement, failable init with locale fallback, AsyncStream-based recording
- `AudioProcessingService.swift` -- Noise gate (RMS threshold), high-pass EQ (80Hz), input gain
- `CommandModeService.swift` -- Voice-driven text editing, rule-based transformations (formal/casual/grammar/case/list), no LLM needed for basic commands
- `SmartFormattingService.swift` -- NLP punctuation, capitalization, number formatting
- `VoiceCommandService.swift` -- 20+ hands-free commands (delete, undo, copy, new line)
- `ContextAwarenessService.swift` -- App bundle ID -> tone mapping (no screen capture)
- `NKoTransliterationService.swift` -- Latin <-> N'Ko via IPA intermediary
V2 Planned:
- CoreML Whisper (distilled, Apple Silicon optimized) running parallel with SFSpeechRecognizer
- MLX Gemma 3 4B for command mode on Mac
- Mesh fallback routing to Mac4/Mac5 via Tailscale
Key Insight for LUME: SpeakFlow proves on-device STT works with <200ms latency via SFSpeechRecognizer. The audio processing pipeline (noise gate, high-pass filter) is directly portable. However, SpeakFlow is iOS/macOS only -- LUME runs Linux on Jetson, where SFSpeechRecognizer is not available.
1c. LUME Hardware Specs (from prior Evo3)
Device: 22"W x 5"H x 3"D, matte black anodized aluminum, 4.2 lbs
- Jetson Orin Nano Super: 1024 CUDA cores, 67 TOPS INT8, 8GB LPDDR5X
- Orbbec Femto Bolt: 640x576 @30fps depth, 120-degree FOV, rated to 5.5m
- Two content cameras (wide + tight, Sony IMX577 4K)
- 3x MEMS mic array with ADC board
- Stereo 3W speakers
- NVMe 512GB
- WiFi 6E (Intel AX211)
- HDMI 2.1 out, USB-C x2, 3.5mm out
Visual Pipeline Budget (proven on paper):
| Component | GPU Time |
|-----------|---------|
| Depth processing | 1.5ms |
| Optical flow | 1.0ms |
| Fluid sim (256x256) | 3.0ms |
| VFX Graph (50K particles) | 4.0ms |
| Color palette | 0.1ms |
| UI/overlay | 1.0ms |
| Compositing | 2.0ms |
| Total | ~12.6ms (~79fps) |
BOM (3000 qty): $715. Retail: $1,299. Margin: 45
Content Pipeline: Real-time compositor + AI Director (Murch scoring) + auto-editor -> 60s reel + 30s highlight ready <90s after session end. WiFi Direct transfer to phone.
---
2. WHAT HAS BEEN TRIED / PRIOR ART
### 2a. Voice POS Systems
- Square Voice POS (concept): Square added voice ordering to their ecosystem but only via companion iPad with their Mobile Payments SDK. No embedded/Linux SDK.
- Toast, Clover, Lightspeed: All iPad-based POS. No headless/embedded options. No voice-first interfaces.
- Amazon Just Walk Out: Camera + weight sensors, no voice ordering. $1M+ installation cost. Amazon exiting retail locations in 2024 in favor of Dash Carts.
### 2b. Depth-Based People Counting
- FootfallCam Centroid: Dedicated depth-camera people counting. 3D stereo imaging. Heatmaps, dwell time, queue metrics. MSRP ~$500-2,000/sensor.
- V-Count Nano AI: AI-on-chip depth processing. Anonymous silhouettes only. $400-800/unit.
- Xovis: Premium depth sensors for airports/retail. People counting + dwell time. $2,000-5,000/unit.
- Ariadne: Hybrid Fusion (ToF depth + signal sensor). Multi-zone journey analytics.
- Market: $2.1B by 2029, 70
Key Gap: All existing solutions are DEDICATED single-purpose sensors. None combine queue analytics with entertainment or commerce. LUME would be the first device that counts bodies, takes orders, AND entertains customers from a single sensor.
### 2c. On-Device Voice for Embedded Linux
- Whisper.cpp + CUDA on Jetson Orin Nano Super: Proven to work (Thomas Thelliez, Jan 2026). Base model goes from 20-30s latency (CPU-only) to "a few seconds" with CUDA. TensorRT optimization gives ~3x speedup vs PyTorch.
- NVIDIA Nemotron 3 Nano 9B: Runs on Jetson Orin Nano Super at ~9 tokens/sec via llama.cpp.
- Qwen2.5 7B: 21.75 tok/s on Jetson Orin Nano Super 8GB (INT4, MLC API).
- Estimated Qwen3-4B Q4: ~30-40 tok/s on Jetson Orin Nano Super (extrapolated from 7B numbers, ~2x for half params). Sufficient for intent parsing (typical output: 20-50 tokens = <1.5s).
- Piper TTS (local): <100ms per sentence on ARM. Well-tested on embedded Linux.
### 2d. Payment Terminal on Linux
- Stripe Terminal: No native Linux SDK. Server-driven API integration possible (REST calls from Jetson -> Stripe API -> reader manages its own display). Stripe Reader M2 connects via BLE to Android SDK only for USB. For Linux: server-driven integration with Stripe S700 smart reader (WiFi-based, self-contained).
- Square Terminal API: Server-driven integration possible. Square Terminal is self-contained (screen, receipt printer, card reader). LUME sends order via REST API -> Square Terminal displays and processes payment independently. This is the path of least resistance for a Linux embedded device.
- SumUp: REST API for payment processing. SumUp Air reader (BLE, iOS/Android SDK). No Linux SDK.
### 2e. Entertainment-While-Waiting Prior Art
- McDonald's PlayPlace: Physical entertainment for kids, no digital content creation
- Starbucks + Spotify: Digital music partnership, passive listening only
- ShakeShack Pager System: Queue management, no entertainment
- No precedent found for: Interactive depth-reactive entertainment that creates shareable content during queue time. This is genuinely novel.
---
3. REAL CONSTRAINTS
### 3a. Technical Constraints
1. Jetson GPU budget is shared. Visual pipeline (12.6ms/frame) + Whisper STT + LLM intent parsing must coexist. The visual pipeline runs continuously; voice processing is bursty. GPU scheduling must be time-sliced or the visual pipeline stutters during voice processing.
2. No Apple SpeechAnalyzer on Linux. The entire BWB voice pipeline is built on Apple's Speech framework. Every STT call must be rewritten for Whisper.cpp on Jetson.
3. No iOS SDK for Square/Stripe on Linux. Payment reader connection must go through REST APIs, not native SDKs. This adds latency (WiFi round-trip) and limits reader pairing to BLE/WiFi-based readers.
4. Memory constraint: 8GB LPDDR5X shared between GPU and CPU. Visual pipeline + LLM + depth processing must fit in memory. Qwen3-4B Q4 uses ~2.5GB. Whisper base uses ~200MB. Leave ~5GB for visual pipeline + OS.
5. NVMe bandwidth shared between session recording (H.265 + camera feeds, ~5-10 MB/s) and potential order data writes.
### 3b. Business Constraints
1. BOM impact of Commerce features: Square Reader ($49 for Reader, free for Square Terminal as separate device). USB-C hub for reader connection (~$5). NFC sticker for content QR transfer (~$0.10/unit). Total BOM impact: $5-54 depending on bundled accessories. Minimal impact on $715 base BOM.
2. Certification: FCC/CE already in plan. Adding payment processing requires PCI compliance scope analysis (though card data never touches LUME if using Stripe/Square managed readers).
3. Single-person company: Must avoid support burden explosion. Voice ordering adds complexity (misunderstood orders, payment failures, refunds). Need strong fallback (touch ordering on companion iPad) and clear error recovery.
4. **$199/month Commerce tier pricing** must compete with Square POS ($0-60/month for software) + dedicated hardware. Value proposition must be "you're already paying for the LUME device for entertainment; commerce is an incremental feature that pays for itself."
3c. BWB Code Reuse Assessment
| Component | Reuse Level | Why |
|---|---|---|
| VoiceOrderingOrchestrator state machine | HIGH (port logic, rewrite IO) | State machine is platform-agnostic |
| TranscriptPipeline | HIGH (port) | Pure data processing, no iOS deps |
| LiveOrderPreviewGenerator | HIGH (port) | Pure regex/string matching |
| OrderParsingPipeline (AI + NLU merge) | MEDIUM (rewrite AI parser) | NLU patterns reusable, AI parser needs local LLM |
| CartCoordinator | HIGH (port) | Pure business logic |
| ConfirmationCoordinator | HIGH (port) | Pure state machine |
| FeedbackCoordinator | LOW (rewrite) | AVSpeechSynthesizer -> Piper TTS. System sounds -> local audio files |
| SessionManager | HIGH (port) | Timer-based logic |
| QueueService | MEDIUM (adapt) | Supabase realtime still works, but no iOS-specific RealtimeService |
| KitchenDisplayView | LOW (rewrite) | SwiftUI -> web dashboard or HDMI overlay |
| PaymentService | LOW (rewrite) | iOS Square SDK -> REST API to Square Terminal |
| AIRoutingFacade | LOW (rewrite) | Location-based routing not needed for single-device LUME |
| SquareMobilePaymentHandler | LOW (rewrite) | Entire BLE reader workflow changes to REST API |
| EnhancedVoiceNLUEngine | MEDIUM (port NLU patterns) | Coffee-domain vocabulary and patterns are gold |
| Domain Models (Order, MenuItem, etc.) | HIGH (port) | Pure data structures |
Bottom line: ~60
---
4. OPEN QUESTIONS
1. GPU contention: Can Whisper.cpp + Qwen3-4B run during visual pipeline without visible frame drops? Need benchmark. Hypothesis: voice is bursty (process 5-10s of audio in ~2s), so the GPU contention window is short.
2. Payment reader selection: Square Terminal (self-contained, WiFi, $299 device) vs Stripe S700 (smart reader, WiFi, $349) vs cheap BLE reader with server-driven flow? Square Terminal has its own screen and receipt printer, which simplifies the LUME integration to just API calls.
3. Companion iPad for payment fallback? If voice ordering happens on LUME and payment happens on a companion iPad running BWB_Kiosk with Square Reader, the existing BWB code works unchanged. Is this the pragmatic V1?
4. Content QR transfer mechanics: Customer interacts with LUME visuals while waiting. How do they receive the content clip? NFC tap (requires phone NFC)? QR code on LUME display? Companion app? SMS link?
5. Multi-tenant configuration: Different coffee shops have different menus, prices, tax rates. How does LUME get configured per-venue? Cloud config management vs local config file?
6. Privacy in commerce context: Depth camera in a payment area. Need clear signage that no facial recognition occurs (depth only, no RGB identification during payment). Legal review needed for different jurisdictions.
7. Menu update mechanism: When a coffee shop adds/removes items, how does LUME's NLU vocabulary update? OTA config push from Supabase? Manual update via admin app?
---
5. RESEARCH BRIEF SUMMARY
The convergence opportunity is real and unprecedented. No product combines:
- Depth-reactive visual entertainment
- Voice-first POS ordering
- Privacy-preserving queue analytics
- Auto-generated shareable content
The BWB codebase provides ~60
The GPU budget is tight but feasible: visual pipeline uses ~12.6ms/frame of GPU, leaving headroom for bursty voice processing. The key technical risk is concurrent GPU usage during active voice sessions.
The business model is compelling: existing LUME subscribers ($19-149/month) get commerce as an upsell ($199/month), which positions against $0-60/month Square/Toast POS by bundling entertainment + analytics that competitors cannot offer. The entertainment flywheel (content creation during queue time) is the moat no POS competitor can replicate.
Stage 1 should explore 6 genuinely different architectural approaches to this convergence.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
evo-cube-output/lume-commerce-pos/stage0-research.md
Detected Structure
Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research