SpeakFlow V2 — Architecture & Business Plan
SpeakFlow is a **privacy-first, offline-first voice OS** that replaces typing across every app on Mac, iOS, and eventually Windows. It competes directly with Wispr Flow ($10M ARR, $700M valuation, 270 Fortune 500 customers) by exploiting their three biggest vulnerabilities: cloud-only processing, 800MB RAM bloat, and zero customer support.
Full Public Reader
SpeakFlow V2 — Architecture & Business Plan
Vision
SpeakFlow is a privacy-first, offline-first voice OS that replaces typing across every app on Mac, iOS, and eventually Windows. It competes directly with Wispr Flow ($10M ARR, $700M valuation, 270 Fortune 500 customers) by exploiting their three biggest vulnerabilities: cloud-only processing, 800MB RAM bloat, and zero customer support.
Competitive Position
Why We Win
| Dimension | Wispr Flow | SpeakFlow |
|---|---|---|
| Processing | 100 | |
| Privacy | Screen capture + cloud upload | Zero data leaves device |
| RAM | 800MB idle | Target: <80MB |
| CPU idle | 8 | |
| Offline | No. Dead without internet | Full functionality offline |
| Price | $12/mo ($144/yr) | $49 lifetime or $4/mo |
| Latency | ~700ms (network round-trip) | <200ms (on-device) |
| Platforms | Mac, Win, iOS, Android | Mac, iOS (Win later) |
| Support | 0 | |
| N'Ko | No | Native transliteration + keyboard |
| AI Commands | Cloud LLM (Llama on Baseten) | On-device MLX (Gemma 3 4B) + mesh fallback |
| Architecture | Electron (Windows) | Native Swift (all platforms) |
Attack Surfaces
1. Privacy refugees: Developers, lawyers, medical professionals actively searching for local alternatives (documented in Reddit threads, Trustpilot cancellations)
2. Resource-conscious users: 800MB RAM is absurd for dictation. Position as "dictation that doesn't tax your system"
3. Price-sensitive users: $12/mo for a utility feels wrong. $49 lifetime matches proven price points (Voibe, Superwhisper)
4. Offline workers: Trains, planes, cafes with bad wifi, rural areas. Wispr is dead without internet.
Architecture
Core Pipeline (On-Device)
Audio Input (AVAudioEngine)
│
├── Noise Gate (RMS threshold, hysteresis)
├── High-Pass EQ (80Hz, removes rumble)
└── Input Gain (-20 to +20 dB)
│
▼
Speech Recognition
│
├── PRIMARY: SFSpeechRecognizer (iOS 26 on-device, free, private)
│ └── requiresOnDeviceRecognition = true
│ └── addsPunctuation = true (iOS 17+)
│
├── ENHANCED (V2): CoreML Whisper (distilled, Apple Silicon optimized)
│ └── whisper-large-v3-turbo via coremltools
│ └── Handles: accents, code jargon, whisper-level input
│ └── Runs parallel, merges with SF results for confidence boost
│
└── FALLBACK: MLX Whisper (Mac only, for maximum accuracy)
└── whisper-large-v3 via mlx-whisper
│
▼
Post-Processing Pipeline
│
├── Smart Formatting (existing SmartFormattingService)
│ └── NLP punctuation, capitalization, number formatting
│
├── Voice Commands (existing VoiceCommandService)
│ └── 20+ commands: delete, undo, copy, new line, etc.
│
├── Command Mode (NEW — V2)
│ └── "Hey Flow" trigger → on-device LLM processes edit instruction
│ └── "Make this formal" / "Translate to French" / "Summarize"
│ └── MLX Gemma 3 4B on Mac, CoreML distilled on iOS
│
├── Context Awareness (NEW — V2)
│ └── App-name detection via Accessibility API (NOT screen capture)
│ └── Tone adaptation: formal (Mail), casual (Messages), code (Xcode)
│ └── No screen capture. No cloud. Just app bundle ID.
│
└── N'Ko Transliteration (existing)
└── Latin ↔ N'Ko via IPA intermediary
│
▼
Text Injection
│
├── AX API (Accessibility, preferred)
└── CGEvent Paste (fallback for sandboxed apps)Platform Architecture
┌─────────────────────────────────────────────────┐
│ SpeakFlowCore │
│ (Shared Swift Package / Framework) │
│ │
│ SpeechService AudioProcessingService │
│ SmartFormattingService VoiceCommandService │
│ CommandModeService(NEW) ContextAwarenessService │
│ NKoTransliteration UserDictionaryService │
│ VoiceSnippetService PerformanceProfiler │
│ WhisperCoreMLService(NEW) │
└──────────┬──────────────────┬────────────────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ macOS │ │ iOS │
│ Menu Bar │ │ Keyboard │
│ App │ │ Extension │
│ │ │ + Host App │
│ HotKeyMgr │ │ │
│ TextInject │ │ KeyboardVC │
│ MLX Engine │ │ CoreML Only │
│ RecordPill │ │ ListenOverlay│
└─────────────┘ └─────────────┘New Services (V2)
#### 1. CommandModeService
Wispr Flow's stickiest feature. Voice-driven text editing after dictation.
class CommandModeService {
// Trigger: user holds hotkey a second time after dictating
// Or says "Hey Flow" followed by an instruction
func processCommand(_ instruction: String, selectedText: String) async -> String
// On Mac: MLX Gemma 3 4B (fast, on-device)
// On iOS: CoreML distilled model (3B or smaller)
// Fallback: mesh route to Mac4/Mac5 via Tailscale
// Built-in commands (no LLM needed):
// - "make shorter" → extractive summary
// - "fix grammar" → LanguageTool/NLP
// - "translate to [lang]" → Apple Translation framework (iOS 18+)
// - "make formal/casual" → template-based tone shift
// LLM commands (needs model):
// - "rewrite as bullet points"
// - "explain this simply"
// - Custom instructions
}#### 2. WhisperCoreMLService
Enhanced recognition for accents, code, and whisper-level input.
class WhisperCoreMLService {
// CoreML-converted whisper-large-v3-turbo
// Runs alongside SFSpeechRecognizer
// Merges results: takes higher-confidence segments from each
// ~300ms latency on M1+, ~500ms on A16+
func transcribe(_ audioBuffer: AVAudioPCMBuffer) async -> TranscriptionResult
}#### 3. ContextAwarenessService (Enhanced)
App-aware tone without screen capture.
class ContextAwarenessService {
// Detects active app via NSWorkspace (Mac) / UIApplication (iOS)
// Maps bundle IDs to tone profiles:
// com.apple.mail → formal
// com.tinyspeck.slackmacgap → casual
// com.microsoft.VSCode → code (variable naming, no punctuation in identifiers)
// com.apple.dt.Xcode → code
// com.apple.MobileSMS → casual, short
// NO screen capture. NO accessibility text reading.
// Just bundle ID → tone mapping. Simple, private, effective.
func currentContext() -> DictationContext
}Mesh Integration (Unique Advantage)
SpeakFlow has something no competitor has: a 5-machine compute mesh with cognitive twin intelligence.
SpeakFlow App (on-device, primary)
│
├── Normal dictation: 100% on-device, zero latency
│
└── Command Mode (complex instructions):
├── Try on-device MLX/CoreML first
└── If device is phone/low-power:
└── Route to mesh via Tailscale
├── Mac4: Ollama (large models)
├── Mac5: MLX Server (cognitive twin fine-tuned)
└── exo cluster: distributed inferenceThis means an iPhone user gets desktop-class LLM processing through the mesh, while keeping audio on-device. Only the text instruction + selected text travel over Tailscale (encrypted, private network).
Data Architecture
On-Device Storage (Core Data / SwiftData)
├── Transcription history (searchable, local)
├── Custom vocabulary (synced via iCloud/CloudKit)
├── Voice snippets (synced via iCloud)
├── Usage analytics (local only, never uploaded)
└── Performance metrics (local profiler data)
Sync (Optional, User-Controlled)
├── CloudKit: vocabulary, snippets, settings across devices
├── App Group: keyboard ↔ host app (existing)
└── NO cloud transcription storage
└── NO telemetry without explicit opt-inBusiness Model
Pricing Strategy
| Tier | Price | What You Get |
|---|---|---|
| Free | $0 | Unlimited dictation, 20+ voice commands, N'Ko support, basic formatting |
| Pro | $4/mo or $49 lifetime | Command Mode, custom vocabulary sync, context awareness, whisper mode, priority support |
| Team | $8/user/mo | Shared vocabulary, admin controls, usage dashboard |
Why this works:
- Free tier is genuinely unlimited (not 2,000 words/week like Wispr). This is possible because everything runs on-device — no per-user cloud costs.
- $49 lifetime undercuts Wispr's $144/yr and matches proven price points (Voibe $99, BetterDictation $39)
- $4/mo is an easy impulse buy vs Wispr's $12/mo
- Zero marginal cost per user (no cloud compute) means lifetime pricing is sustainable
Revenue Projections
TAM: Voice dictation market ~$5B by 2027 (Grand View Research). Wispr proved 100x YoY growth is possible.
Target: 10,000 paid users in Year 1 at $49 avg = $490K ARR
- 50
- 30
- 20
Go-To-Market
1. Launch on Product Hunt — "Wispr Flow but private and free" angle
2. Reddit seeding — r/macapps, r/productivity, r/dictation. These communities are already discussing Wispr alternatives weekly.
3. Developer angle — Code-aware dictation that works in Xcode, Cursor, VS Code without sending your code to the cloud
4. Medical/Legal — HIPAA without the enterprise price tag. Everything stays on-device, no BAA needed.
5. N'Ko community — Only voice-to-text that supports N'Ko script natively. Cultural mission + technical differentiation.
6. App Store optimization — "voice to text", "dictation", "speech to text", "wispr alternative"
Moat
1. Mesh intelligence: No indie competitor has a 5-machine compute mesh with cognitive twin fine-tuned models. Command Mode quality scales with mesh, not cloud spend.
2. N'Ko: Cultural mission that no VC-backed competitor will pursue. Locks in a passionate community.
3. Zero marginal cost: On-device = no cloud bills. Can offer unlimited free tier forever. Wispr can't.
4. Cross-pollination: 46+ apps in the portfolio. SpeakFlow becomes the voice layer for CreativeDirector, Spore, OpenClawHub. Network effects across the ecosystem.
5. Cortex/KARL: Behavioral intelligence from 112K+ conversation turns. The app learns how you dictate, not how average users dictate.
Implementation Roadmap
### Phase 1: Core Upgrade (Current Sprint)
- [ ] Fix SF-001 (fatalError on locale), SF-002 (usleep on main thread)
- [ ] iOS 26 speech recognition integration (on-device, enhanced accuracy)
- [ ] Lightweight resource footprint (<80MB RAM target)
- [ ] Command Mode v1 (built-in commands only, no LLM)
### Phase 2: Intelligence Layer
- [ ] CoreML Whisper integration (parallel recognition, confidence merge)
- [ ] Command Mode v2 (on-device MLX Gemma 3 4B for Mac)
- [ ] Context awareness via bundle ID → tone mapping
- [ ] Whisper-level input support (gain boost + noise gate tuning)
### Phase 3: Ecosystem
- [ ] Mesh routing for iOS Command Mode (Tailscale → Mac4/Mac5)
- [ ] CloudKit sync for vocabulary, snippets, settings
- [ ] Cross-app integration (CreativeDirector teleprompter, Spore voice capture)
- [ ] StoreKit 2 paywall (OpenClawPayments)
### Phase 4: Market
- [ ] Product Hunt launch
- [ ] App Store screenshots + metadata optimization
- [ ] Landing page (Vercel)
- [ ] Reddit/community seeding
Critical Differentiators vs Every Competitor
| Feature | SpeakFlow | Wispr | Superwhisper | Voibe | VoiceInk |
|---|---|---|---|---|---|
| Offline | Yes | No | Yes | Yes | Yes |
| iOS | Yes | Yes | Yes | No | No |
| Command Mode | Yes (on-device) | Yes (cloud) | Partial | No | No |
| N'Ko | Yes | No | No | No | No |
| Mesh compute | Yes | No | No | No | No |
| Custom keyboard | Yes | Yes | No | No | No |
| Free unlimited | Yes | 2K words/wk | No | No | Yes |
| RAM | <80MB | 800MB | ~150MB | ~80MB | ~100MB |
| Price | $49 lifetime | $144/yr | $65/yr | $44/yr | Free | ||
| Cognitive twin | Yes | No | No | No | No |
Technical Risks & Mitigations
1. SFSpeechRecognizer accuracy vs Whisper: Apple's on-device model is good but not Whisper-quality for accents/jargon. Mitigation: CoreML Whisper as parallel enhancer, custom vocabulary for domain terms.
2. iOS keyboard memory limit (30MB): Keyboard extensions are sandboxed. Mitigation: Keep keyboard thin (text input + App Group polling), heavy processing in host app.
3. MLX model size for Command Mode: Gemma 3 4B is ~2.5GB. Not viable on iPhone. Mitigation: CoreML distilled 1B model for iOS, mesh fallback for complex instructions.
4. Apple Speech API limits: 1-minute continuous recognition limit on some iOS versions. Mitigation: Automatic session restart with overlap, seamless to user.
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
SpeakFlow/ARCHITECTURE-V2.md
Detected Structure
Method · Evaluation · Architecture