Mac4 Model Inference Benchmark — Cognitive Twin Brain
**Date:** 2026-02-18 **Host:** Mac4 — Apple M4, 16GB RAM, macOS 15.6 **Ollama:** v0.16.2 **Task:** Routing classification (5-category: triage, coding, architecture, planning, ops)
Full Public Reader
Mac4 Model Inference Benchmark — Cognitive Twin Brain
Date: 2026-02-18
Host: Mac4 — Apple M4, 16GB RAM, macOS 15.6
Ollama: v0.16.2
Task: Routing classification (5-category: triage, coding, architecture, planning, ops)
Test Methodology
- 10 routing classification prompts per model
- Temperature: 0, deterministic outputs
- Measured: tokens/sec (generation), RSS memory, routing accuracy
- Ground truth defined for each prompt
- API: `http://[ip]:11434/api/generate` (non-streaming)
Results Summary
| Model | Size | Accuracy | Gen Speed | RSS Memory | Passes Criteria? |
|---|---|---|---|---|---|
| llama3.2:3b | 2.0 GB | 7/10 (70 | |||
| qwen3:4b | 2.5 GB | 6/10 (60 | |||
| gemma3:4b | 3.3 GB | 8/10 (80 | |||
| qwen3:30b-a3b | 18 GB | — | — | — | ❌ too large (18GB > 16GB RAM) |
Criteria: >50 tok/s, <14GB RAM, >85
Detailed Results
llama3.2:3b (Q4_K_M, 2.0 GB)
[✓] Q01: SwiftUI settings view → coding (expected: coding)
[✗] Q02: CI/CD pipeline GitHub Actions → coding (expected: ops)
[✓] Q03: Microservices communication → architecture (expected: architecture)
[✓] Q04: 2-week sprint plan → planning (expected: planning)
[✓] Q05: Fix crash on login screen → coding (expected: coding)
[✓] Q06: Deploy staging to AWS ECS → ops (expected: ops)
[✗] Q07: Review bug reports, prioritize → ops (expected: triage)
[✗] Q08: Write unit tests payment module → testing (expected: coding)
[✓] Q09: REST vs GraphQL evaluation → architecture (expected: architecture)
[✓] Q10: Grafana/Prometheus monitoring → ops (expected: ops)
Speed: 71.3 tok/s avg | Prompt processing: 777.8 tok/s
Memory: ~2,237 MB RSS- Fastest by far — exceeds 50 tok/s threshold
- Struggles with triage vs ops distinction
- Outputs single words cleanly (no thinking overhead)
qwen3:4b (Q4_K_M, 2.5 GB)
[✓] Q01: SwiftUI settings view → coding (expected: coding)
[?] Q02: CI/CD pipeline → (thinking...) (expected: ops)
[✓] Q03: Microservices communication → architecture (expected: architecture)
[✓] Q04: 2-week sprint plan → planning (expected: planning)
[✓] Q05: Fix crash on login screen → coding (expected: coding)
[?] Q06: Deploy staging to AWS ECS → (thinking...) (expected: ops)
[✗] Q07: Review bug reports, prioritize → (thinking...) (expected: triage)
[✗] Q08: Write unit tests payment module → (thinking...) (expected: coding)
[✓] Q09: REST vs GraphQL evaluation → architecture (expected: architecture)
[?] Q10: Grafana/Prometheus monitoring → (thinking...) (expected: ops)
Speed: 29.0 tok/s avg
Memory: ~3,311 MB RSS- Critical flaw: Built-in "thinking" mode generates 200-500 internal reasoning tokens before answering
- Most of the 500-token budget consumed by `<think>...</think>` blocks
- Actual answer often truncated or empty
- Not suitable for fast routing without disabling thinking (which Ollama doesn't fully support)
gemma3:4b (Q4_K_M, 3.3 GB)
[✓] Q01: SwiftUI settings view → coding (expected: coding)
[✓] Q02: CI/CD pipeline GitHub Actions → ops (expected: ops)
[✓] Q03: Microservices communication → architecture (expected: architecture)
[✓] Q04: 2-week sprint plan → planning (expected: planning)
[✗] Q05: Fix crash on login screen → triage (expected: coding)
[✓] Q06: Deploy staging to AWS ECS → ops (expected: ops)
[✓] Q07: Review bug reports, prioritize → triage (expected: triage)
[✓] Q08: Write unit tests payment module → coding (expected: coding)
[✗] Q09: REST vs GraphQL evaluation → planning (expected: architecture)
[✓] Q10: Grafana/Prometheus monitoring → ops (expected: ops)
Speed: 44.3 tok/s avg
Memory: ~2,765 MB RSS- Best accuracy (80
- Clean single-word outputs (3-4 tokens per response)
- Speed (44.3 tok/s) is close to but below the 50 tok/s threshold
- Confused "fix crash" as triage and "REST vs GraphQL" as planning
Recommendation
🏆 Winner: gemma3:4b (conditional)
No model fully meets all three criteria (>50 tok/s, <14GB RAM, >85
| Criteria | gemma3:4b | Gap |
|---|---|---|
| Speed >50 tok/s | 44.3 tok/s | -12 |
| Memory <14GB | 2.8 GB | ✅ well within |
| Accuracy >85 |
Why gemma3:4b over llama3.2:3b:
- 10
- Clean, concise outputs (no thinking overhead)
- Handles the triage category correctly
- Speed gap (44 vs 71 tok/s) is less critical for routing (single-call latency <100ms either way)
Why not qwen3:4b:
- Thinking mode is a dealbreaker — wastes 10-20x tokens on internal reasoning
- Slowest at 29 tok/s
- Worst accuracy at 60
### Action Items
1. Deploy gemma3:4b as the routing brain — best accuracy/speed tradeoff
2. Fine-tune prompts to close the 80→85
3. Keep llama3.2:3b as fallback for latency-critical paths
4. Consider custom Modelfile with system prompt tuning for routing
5. Re-evaluate if/when Ollama adds proper thinking-mode control for Qwen3
Quick Start
# On Mac4
ollama run gemma3:4b
# API call for routing
curl http://[ip]:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Classify this task into exactly one category [triage, coding, architecture, planning, ops]: '\''<TASK>'\''. Reply with ONLY the category name.",
"stream": false,
"options": {"temperature": 0, "num_predict": 10}
}'Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/packages/cognitive-twin/MAC4_BENCHMARK.md
Detected Structure
Method · Evaluation · Architecture