Grand Diomande Research · Full HTML Reader

Mac4 Model Inference Benchmark — Cognitive Twin Brain

**Date:** 2026-02-18 **Host:** Mac4 — Apple M4, 16GB RAM, macOS 15.6 **Ollama:** v0.16.2 **Task:** Routing classification (5-category: triage, coding, architecture, planning, ops)

Agents That Account for Themselves experiment experiment writeup candidate score 18 .md

Full Public Reader

Mac4 Model Inference Benchmark — Cognitive Twin Brain

Date: 2026-02-18
Host: Mac4 — Apple M4, 16GB RAM, macOS 15.6
Ollama: v0.16.2
Task: Routing classification (5-category: triage, coding, architecture, planning, ops)

Test Methodology

10 routing classification prompts per model
Temperature: 0, deterministic outputs
Measured: tokens/sec (generation), RSS memory, routing accuracy
Ground truth defined for each prompt
API: `http://[ip]:11434/api/generate` (non-streaming)

Results Summary

Model	Size	Accuracy	Gen Speed	RSS Memory	Passes Criteria?
llama3.2:3b	2.0 GB	7/10 (70
qwen3:4b	2.5 GB	6/10 (60
gemma3:4b	3.3 GB	8/10 (80
qwen3:30b-a3b	18 GB	—	—	—	❌ too large (18GB > 16GB RAM)

Criteria: >50 tok/s, <14GB RAM, >85

Detailed Results

llama3.2:3b (Q4_K_M, 2.0 GB)

[✓] Q01: SwiftUI settings view           → coding        (expected: coding)
[✗] Q02: CI/CD pipeline GitHub Actions   → coding        (expected: ops)
[✓] Q03: Microservices communication     → architecture  (expected: architecture)
[✓] Q04: 2-week sprint plan              → planning      (expected: planning)
[✓] Q05: Fix crash on login screen       → coding        (expected: coding)
[✓] Q06: Deploy staging to AWS ECS       → ops           (expected: ops)
[✗] Q07: Review bug reports, prioritize  → ops           (expected: triage)
[✗] Q08: Write unit tests payment module → testing       (expected: coding)
[✓] Q09: REST vs GraphQL evaluation      → architecture  (expected: architecture)
[✓] Q10: Grafana/Prometheus monitoring   → ops           (expected: ops)

Speed: 71.3 tok/s avg | Prompt processing: 777.8 tok/s
Memory: ~2,237 MB RSS

Fastest by far — exceeds 50 tok/s threshold
Struggles with triage vs ops distinction
Outputs single words cleanly (no thinking overhead)

qwen3:4b (Q4_K_M, 2.5 GB)

[✓] Q01: SwiftUI settings view           → coding        (expected: coding)
[?] Q02: CI/CD pipeline                  → (thinking...)  (expected: ops)
[✓] Q03: Microservices communication     → architecture  (expected: architecture)
[✓] Q04: 2-week sprint plan              → planning      (expected: planning)
[✓] Q05: Fix crash on login screen       → coding        (expected: coding)
[?] Q06: Deploy staging to AWS ECS       → (thinking...)  (expected: ops)
[✗] Q07: Review bug reports, prioritize  → (thinking...)  (expected: triage)
[✗] Q08: Write unit tests payment module → (thinking...)  (expected: coding)
[✓] Q09: REST vs GraphQL evaluation      → architecture  (expected: architecture)
[?] Q10: Grafana/Prometheus monitoring   → (thinking...)  (expected: ops)

Speed: 29.0 tok/s avg
Memory: ~3,311 MB RSS

Critical flaw: Built-in "thinking" mode generates 200-500 internal reasoning tokens before answering
Most of the 500-token budget consumed by `<think>...</think>` blocks
Actual answer often truncated or empty
Not suitable for fast routing without disabling thinking (which Ollama doesn't fully support)

gemma3:4b (Q4_K_M, 3.3 GB)

[✓] Q01: SwiftUI settings view           → coding        (expected: coding)
[✓] Q02: CI/CD pipeline GitHub Actions   → ops           (expected: ops)
[✓] Q03: Microservices communication     → architecture  (expected: architecture)
[✓] Q04: 2-week sprint plan              → planning      (expected: planning)
[✗] Q05: Fix crash on login screen       → triage        (expected: coding)
[✓] Q06: Deploy staging to AWS ECS       → ops           (expected: ops)
[✓] Q07: Review bug reports, prioritize  → triage        (expected: triage)
[✓] Q08: Write unit tests payment module → coding        (expected: coding)
[✗] Q09: REST vs GraphQL evaluation      → planning      (expected: architecture)
[✓] Q10: Grafana/Prometheus monitoring   → ops           (expected: ops)

Speed: 44.3 tok/s avg
Memory: ~2,765 MB RSS

Best accuracy (80
Clean single-word outputs (3-4 tokens per response)
Speed (44.3 tok/s) is close to but below the 50 tok/s threshold
Confused "fix crash" as triage and "REST vs GraphQL" as planning

Recommendation

🏆 Winner: gemma3:4b (conditional)

No model fully meets all three criteria (>50 tok/s, <14GB RAM, >85

Criteria	gemma3:4b	Gap
Speed >50 tok/s	44.3 tok/s	-12
Memory <14GB	2.8 GB	✅ well within
Accuracy >85

Why gemma3:4b over llama3.2:3b:
- 10
- Clean, concise outputs (no thinking overhead)
- Handles the triage category correctly
- Speed gap (44 vs 71 tok/s) is less critical for routing (single-call latency <100ms either way)

Why not qwen3:4b:
- Thinking mode is a dealbreaker — wastes 10-20x tokens on internal reasoning
- Slowest at 29 tok/s
- Worst accuracy at 60

### Action Items
1. Deploy gemma3:4b as the routing brain — best accuracy/speed tradeoff
2. Fine-tune prompts to close the 80→85
3. Keep llama3.2:3b as fallback for latency-critical paths
4. Consider custom Modelfile with system prompt tuning for routing
5. Re-evaluate if/when Ollama adds proper thinking-mode control for Qwen3

Quick Start

bash

# On Mac4
ollama run gemma3:4b

# API call for routing
curl http://[ip]:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Classify this task into exactly one category [triage, coding, architecture, planning, ops]: '\''<TASK>'\''. Reply with ONLY the category name.",
  "stream": false,
  "options": {"temperature": 0, "num_predict": 10}
}'

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/MAC4_BENCHMARK.md

Detected Structure

Method · Evaluation · Architecture