Grand Diomande Research · Full HTML Reader

SpeakFlow V2 — Architecture & Business Plan

SpeakFlow is a **privacy-first, offline-first voice OS** that replaces typing across every app on Mac, iOS, and eventually Windows. It competes directly with Wispr Flow ($10M ARR, $700M valuation, 270 Fortune 500 customers) by exploiting their three biggest vulnerabilities: cloud-only processing, 800MB RAM bloat, and zero customer support.

Business Systems architecture technical paper candidate score 48 .md

Full Public Reader

SpeakFlow V2 — Architecture & Business Plan

Vision

SpeakFlow is a privacy-first, offline-first voice OS that replaces typing across every app on Mac, iOS, and eventually Windows. It competes directly with Wispr Flow ($10M ARR, $700M valuation, 270 Fortune 500 customers) by exploiting their three biggest vulnerabilities: cloud-only processing, 800MB RAM bloat, and zero customer support.

Competitive Position

Why We Win

Dimension	Wispr Flow	SpeakFlow
Processing	100
Privacy	Screen capture + cloud upload	Zero data leaves device
RAM	800MB idle	Target: <80MB
CPU idle	8
Offline	No. Dead without internet	Full functionality offline
Price	$12/mo ($144/yr)	$49 lifetime or $4/mo
Latency	~700ms (network round-trip)	<200ms (on-device)
Platforms	Mac, Win, iOS, Android	Mac, iOS (Win later)
Support	0
N'Ko	No	Native transliteration + keyboard
AI Commands	Cloud LLM (Llama on Baseten)	On-device MLX (Gemma 3 4B) + mesh fallback
Architecture	Electron (Windows)	Native Swift (all platforms)

Attack Surfaces

1. Privacy refugees: Developers, lawyers, medical professionals actively searching for local alternatives (documented in Reddit threads, Trustpilot cancellations)
2. Resource-conscious users: 800MB RAM is absurd for dictation. Position as "dictation that doesn't tax your system"
3. Price-sensitive users: $12/mo for a utility feels wrong. $49 lifetime matches proven price points (Voibe, Superwhisper)
4. Offline workers: Trains, planes, cafes with bad wifi, rural areas. Wispr is dead without internet.

Architecture

Core Pipeline (On-Device)

Audio Input (AVAudioEngine)
    │
    ├── Noise Gate (RMS threshold, hysteresis)
    ├── High-Pass EQ (80Hz, removes rumble)
    └── Input Gain (-20 to +20 dB)
    │
    ▼
Speech Recognition
    │
    ├── PRIMARY: SFSpeechRecognizer (iOS 26 on-device, free, private)
    │   └── requiresOnDeviceRecognition = true
    │   └── addsPunctuation = true (iOS 17+)
    │
    ├── ENHANCED (V2): CoreML Whisper (distilled, Apple Silicon optimized)
    │   └── whisper-large-v3-turbo via coremltools
    │   └── Handles: accents, code jargon, whisper-level input
    │   └── Runs parallel, merges with SF results for confidence boost
    │
    └── FALLBACK: MLX Whisper (Mac only, for maximum accuracy)
        └── whisper-large-v3 via mlx-whisper
    │
    ▼
Post-Processing Pipeline
    │
    ├── Smart Formatting (existing SmartFormattingService)
    │   └── NLP punctuation, capitalization, number formatting
    │
    ├── Voice Commands (existing VoiceCommandService)
    │   └── 20+ commands: delete, undo, copy, new line, etc.
    │
    ├── Command Mode (NEW — V2)
    │   └── "Hey Flow" trigger → on-device LLM processes edit instruction
    │   └── "Make this formal" / "Translate to French" / "Summarize"
    │   └── MLX Gemma 3 4B on Mac, CoreML distilled on iOS
    │
    ├── Context Awareness (NEW — V2)
    │   └── App-name detection via Accessibility API (NOT screen capture)
    │   └── Tone adaptation: formal (Mail), casual (Messages), code (Xcode)
    │   └── No screen capture. No cloud. Just app bundle ID.
    │
    └── N'Ko Transliteration (existing)
        └── Latin ↔ N'Ko via IPA intermediary
    │
    ▼
Text Injection
    │
    ├── AX API (Accessibility, preferred)
    └── CGEvent Paste (fallback for sandboxed apps)

Platform Architecture

┌─────────────────────────────────────────────────┐
│                 SpeakFlowCore                    │
│         (Shared Swift Package / Framework)       │
│                                                  │
│  SpeechService          AudioProcessingService   │
│  SmartFormattingService  VoiceCommandService      │
│  CommandModeService(NEW) ContextAwarenessService  │
│  NKoTransliteration     UserDictionaryService    │
│  VoiceSnippetService     PerformanceProfiler     │
│  WhisperCoreMLService(NEW)                       │
└──────────┬──────────────────┬────────────────────┘
           │                  │
    ┌──────┴──────┐    ┌──────┴──────┐
    │   macOS     │    │    iOS      │
    │  Menu Bar   │    │  Keyboard   │
    │   App       │    │  Extension  │
    │             │    │  + Host App │
    │ HotKeyMgr   │    │             │
    │ TextInject   │    │ KeyboardVC  │
    │ MLX Engine  │    │ CoreML Only │
    │ RecordPill  │    │ ListenOverlay│
    └─────────────┘    └─────────────┘

New Services (V2)

#### 1. CommandModeService
Wispr Flow's stickiest feature. Voice-driven text editing after dictation.

swift

class CommandModeService {
    // Trigger: user holds hotkey a second time after dictating
    // Or says "Hey Flow" followed by an instruction

    func processCommand(_ instruction: String, selectedText: String) async -> String
    // On Mac: MLX Gemma 3 4B (fast, on-device)
    // On iOS: CoreML distilled model (3B or smaller)
    // Fallback: mesh route to Mac4/Mac5 via Tailscale

    // Built-in commands (no LLM needed):
    // - "make shorter" → extractive summary
    // - "fix grammar" → LanguageTool/NLP
    // - "translate to [lang]" → Apple Translation framework (iOS 18+)
    // - "make formal/casual" → template-based tone shift

    // LLM commands (needs model):
    // - "rewrite as bullet points"
    // - "explain this simply"
    // - Custom instructions
}

#### 2. WhisperCoreMLService
Enhanced recognition for accents, code, and whisper-level input.

swift

class WhisperCoreMLService {
    // CoreML-converted whisper-large-v3-turbo
    // Runs alongside SFSpeechRecognizer
    // Merges results: takes higher-confidence segments from each
    // ~300ms latency on M1+, ~500ms on A16+

    func transcribe(_ audioBuffer: AVAudioPCMBuffer) async -> TranscriptionResult
}

#### 3. ContextAwarenessService (Enhanced)
App-aware tone without screen capture.

swift

class ContextAwarenessService {
    // Detects active app via NSWorkspace (Mac) / UIApplication (iOS)
    // Maps bundle IDs to tone profiles:
    //   com.apple.mail → formal
    //   com.tinyspeck.slackmacgap → casual
    //   com.microsoft.VSCode → code (variable naming, no punctuation in identifiers)
    //   com.apple.dt.Xcode → code
    //   com.apple.MobileSMS → casual, short

    // NO screen capture. NO accessibility text reading.
    // Just bundle ID → tone mapping. Simple, private, effective.

    func currentContext() -> DictationContext
}

Mesh Integration (Unique Advantage)

SpeakFlow has something no competitor has: a 5-machine compute mesh with cognitive twin intelligence.

SpeakFlow App (on-device, primary)
    │
    ├── Normal dictation: 100% on-device, zero latency
    │
    └── Command Mode (complex instructions):
        ├── Try on-device MLX/CoreML first
        └── If device is phone/low-power:
            └── Route to mesh via Tailscale
                ├── Mac4: Ollama (large models)
                ├── Mac5: MLX Server (cognitive twin fine-tuned)
                └── exo cluster: distributed inference

This means an iPhone user gets desktop-class LLM processing through the mesh, while keeping audio on-device. Only the text instruction + selected text travel over Tailscale (encrypted, private network).

Data Architecture

On-Device Storage (Core Data / SwiftData)
    ├── Transcription history (searchable, local)
    ├── Custom vocabulary (synced via iCloud/CloudKit)
    ├── Voice snippets (synced via iCloud)
    ├── Usage analytics (local only, never uploaded)
    └── Performance metrics (local profiler data)

Sync (Optional, User-Controlled)
    ├── CloudKit: vocabulary, snippets, settings across devices
    ├── App Group: keyboard ↔ host app (existing)
    └── NO cloud transcription storage
    └── NO telemetry without explicit opt-in

Business Model

Pricing Strategy

Tier	Price	What You Get
Free	$0	Unlimited dictation, 20+ voice commands, N'Ko support, basic formatting
Pro	$4/mo or $49 lifetime	Command Mode, custom vocabulary sync, context awareness, whisper mode, priority support
Team	$8/user/mo	Shared vocabulary, admin controls, usage dashboard

Why this works:
- Free tier is genuinely unlimited (not 2,000 words/week like Wispr). This is possible because everything runs on-device — no per-user cloud costs.
- $49 lifetime undercuts Wispr's $144/yr and matches proven price points (Voibe $99, BetterDictation $39)
- $4/mo is an easy impulse buy vs Wispr's $12/mo
- Zero marginal cost per user (no cloud compute) means lifetime pricing is sustainable

Revenue Projections

TAM: Voice dictation market ~$5B by 2027 (Grand View Research). Wispr proved 100x YoY growth is possible.

Target: 10,000 paid users in Year 1 at $49 avg = $490K ARR
- 50
- 30
- 20

Go-To-Market

1. Launch on Product Hunt — "Wispr Flow but private and free" angle
2. Reddit seeding — r/macapps, r/productivity, r/dictation. These communities are already discussing Wispr alternatives weekly.
3. Developer angle — Code-aware dictation that works in Xcode, Cursor, VS Code without sending your code to the cloud
4. Medical/Legal — HIPAA without the enterprise price tag. Everything stays on-device, no BAA needed.
5. N'Ko community — Only voice-to-text that supports N'Ko script natively. Cultural mission + technical differentiation.
6. App Store optimization — "voice to text", "dictation", "speech to text", "wispr alternative"

Moat

1. Mesh intelligence: No indie competitor has a 5-machine compute mesh with cognitive twin fine-tuned models. Command Mode quality scales with mesh, not cloud spend.
2. N'Ko: Cultural mission that no VC-backed competitor will pursue. Locks in a passionate community.
3. Zero marginal cost: On-device = no cloud bills. Can offer unlimited free tier forever. Wispr can't.
4. Cross-pollination: 46+ apps in the portfolio. SpeakFlow becomes the voice layer for CreativeDirector, Spore, OpenClawHub. Network effects across the ecosystem.
5. Cortex/KARL: Behavioral intelligence from 112K+ conversation turns. The app learns how you dictate, not how average users dictate.

Implementation Roadmap

### Phase 1: Core Upgrade (Current Sprint)
- [ ] Fix SF-001 (fatalError on locale), SF-002 (usleep on main thread)
- [ ] iOS 26 speech recognition integration (on-device, enhanced accuracy)
- [ ] Lightweight resource footprint (<80MB RAM target)
- [ ] Command Mode v1 (built-in commands only, no LLM)

### Phase 2: Intelligence Layer
- [ ] CoreML Whisper integration (parallel recognition, confidence merge)
- [ ] Command Mode v2 (on-device MLX Gemma 3 4B for Mac)
- [ ] Context awareness via bundle ID → tone mapping
- [ ] Whisper-level input support (gain boost + noise gate tuning)

### Phase 3: Ecosystem
- [ ] Mesh routing for iOS Command Mode (Tailscale → Mac4/Mac5)
- [ ] CloudKit sync for vocabulary, snippets, settings
- [ ] Cross-app integration (CreativeDirector teleprompter, Spore voice capture)
- [ ] StoreKit 2 paywall (OpenClawPayments)

### Phase 4: Market
- [ ] Product Hunt launch
- [ ] App Store screenshots + metadata optimization
- [ ] Landing page (Vercel)
- [ ] Reddit/community seeding

Critical Differentiators vs Every Competitor

Feature	SpeakFlow	Wispr	Superwhisper	Voibe	VoiceInk
Offline	Yes	No	Yes	Yes	Yes
iOS	Yes	Yes	Yes	No	No
Command Mode	Yes (on-device)	Yes (cloud)	Partial	No	No
N'Ko	Yes	No	No	No	No
Mesh compute	Yes	No	No	No	No
Custom keyboard	Yes	Yes	No	No	No
Free unlimited	Yes	2K words/wk	No	No	Yes
RAM	<80MB	800MB	~150MB	~80MB	~100MB
Price	$49 lifetime \| $144/yr	$65/yr \| $44/yr	Free
Cognitive twin	Yes	No	No	No	No

Technical Risks & Mitigations

1. SFSpeechRecognizer accuracy vs Whisper: Apple's on-device model is good but not Whisper-quality for accents/jargon. Mitigation: CoreML Whisper as parallel enhancer, custom vocabulary for domain terms.

2. iOS keyboard memory limit (30MB): Keyboard extensions are sandboxed. Mitigation: Keep keyboard thin (text input + App Group polling), heavy processing in host app.

3. MLX model size for Command Mode: Gemma 3 4B is ~2.5GB. Not viable on iPhone. Mitigation: CoreML distilled 1B model for iOS, mesh fallback for complex instructions.

4. Apple Speech API limits: 1-minute continuous recognition limit on some iOS versions. Mitigation: Automatic session restart with overlap, seamless to user.

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

SpeakFlow/ARCHITECTURE-V2.md

Detected Structure

Method · Evaluation · Architecture