Grand Diomande Research · Full HTML Reader

Mac4 Model Inference Benchmark — Cognitive Twin Brain

**Date:** 2026-02-18 **Host:** Mac4 — Apple M4, 16GB RAM, macOS 15.6 **Ollama:** v0.16.2 **Task:** Routing classification (5-category: triage, coding, architecture, planning, ops)

Agents That Account for Themselves experiment experiment writeup candidate score 18 .md

Full Public Reader

Mac4 Model Inference Benchmark — Cognitive Twin Brain

Date: 2026-02-18
Host: Mac4 — Apple M4, 16GB RAM, macOS 15.6
Ollama: v0.16.2
Task: Routing classification (5-category: triage, coding, architecture, planning, ops)

Test Methodology

  • 10 routing classification prompts per model
  • Temperature: 0, deterministic outputs
  • Measured: tokens/sec (generation), RSS memory, routing accuracy
  • Ground truth defined for each prompt
  • API: `http://[ip]:11434/api/generate` (non-streaming)

Results Summary

ModelSizeAccuracyGen SpeedRSS MemoryPasses Criteria?
llama3.2:3b2.0 GB7/10 (70
qwen3:4b2.5 GB6/10 (60
gemma3:4b3.3 GB8/10 (80
qwen3:30b-a3b18 GB❌ too large (18GB > 16GB RAM)

Criteria: >50 tok/s, <14GB RAM, >85

Detailed Results

llama3.2:3b (Q4_K_M, 2.0 GB)

[✓] Q01: SwiftUI settings view           → coding        (expected: coding)
[✗] Q02: CI/CD pipeline GitHub Actions   → coding        (expected: ops)
[✓] Q03: Microservices communication     → architecture  (expected: architecture)
[✓] Q04: 2-week sprint plan              → planning      (expected: planning)
[✓] Q05: Fix crash on login screen       → coding        (expected: coding)
[✓] Q06: Deploy staging to AWS ECS       → ops           (expected: ops)
[✗] Q07: Review bug reports, prioritize  → ops           (expected: triage)
[✗] Q08: Write unit tests payment module → testing       (expected: coding)
[✓] Q09: REST vs GraphQL evaluation      → architecture  (expected: architecture)
[✓] Q10: Grafana/Prometheus monitoring   → ops           (expected: ops)

Speed: 71.3 tok/s avg | Prompt processing: 777.8 tok/s
Memory: ~2,237 MB RSS
  • Fastest by far — exceeds 50 tok/s threshold
  • Struggles with triage vs ops distinction
  • Outputs single words cleanly (no thinking overhead)

qwen3:4b (Q4_K_M, 2.5 GB)

[✓] Q01: SwiftUI settings view           → coding        (expected: coding)
[?] Q02: CI/CD pipeline                  → (thinking...)  (expected: ops)
[✓] Q03: Microservices communication     → architecture  (expected: architecture)
[✓] Q04: 2-week sprint plan              → planning      (expected: planning)
[✓] Q05: Fix crash on login screen       → coding        (expected: coding)
[?] Q06: Deploy staging to AWS ECS       → (thinking...)  (expected: ops)
[✗] Q07: Review bug reports, prioritize  → (thinking...)  (expected: triage)
[✗] Q08: Write unit tests payment module → (thinking...)  (expected: coding)
[✓] Q09: REST vs GraphQL evaluation      → architecture  (expected: architecture)
[?] Q10: Grafana/Prometheus monitoring   → (thinking...)  (expected: ops)

Speed: 29.0 tok/s avg
Memory: ~3,311 MB RSS
  • Critical flaw: Built-in "thinking" mode generates 200-500 internal reasoning tokens before answering
  • Most of the 500-token budget consumed by `<think>...</think>` blocks
  • Actual answer often truncated or empty
  • Not suitable for fast routing without disabling thinking (which Ollama doesn't fully support)

gemma3:4b (Q4_K_M, 3.3 GB)

[✓] Q01: SwiftUI settings view           → coding        (expected: coding)
[✓] Q02: CI/CD pipeline GitHub Actions   → ops           (expected: ops)
[✓] Q03: Microservices communication     → architecture  (expected: architecture)
[✓] Q04: 2-week sprint plan              → planning      (expected: planning)
[✗] Q05: Fix crash on login screen       → triage        (expected: coding)
[✓] Q06: Deploy staging to AWS ECS       → ops           (expected: ops)
[✓] Q07: Review bug reports, prioritize  → triage        (expected: triage)
[✓] Q08: Write unit tests payment module → coding        (expected: coding)
[✗] Q09: REST vs GraphQL evaluation      → planning      (expected: architecture)
[✓] Q10: Grafana/Prometheus monitoring   → ops           (expected: ops)

Speed: 44.3 tok/s avg
Memory: ~2,765 MB RSS
  • Best accuracy (80
  • Clean single-word outputs (3-4 tokens per response)
  • Speed (44.3 tok/s) is close to but below the 50 tok/s threshold
  • Confused "fix crash" as triage and "REST vs GraphQL" as planning

Recommendation

🏆 Winner: gemma3:4b (conditional)

No model fully meets all three criteria (>50 tok/s, <14GB RAM, >85

Criteriagemma3:4bGap
Speed >50 tok/s44.3 tok/s-12
Memory <14GB2.8 GB✅ well within
Accuracy >85

Why gemma3:4b over llama3.2:3b:
- 10
- Clean, concise outputs (no thinking overhead)
- Handles the triage category correctly
- Speed gap (44 vs 71 tok/s) is less critical for routing (single-call latency <100ms either way)

Why not qwen3:4b:
- Thinking mode is a dealbreaker — wastes 10-20x tokens on internal reasoning
- Slowest at 29 tok/s
- Worst accuracy at 60

### Action Items
1. Deploy gemma3:4b as the routing brain — best accuracy/speed tradeoff
2. Fine-tune prompts to close the 80→85
3. Keep llama3.2:3b as fallback for latency-critical paths
4. Consider custom Modelfile with system prompt tuning for routing
5. Re-evaluate if/when Ollama adds proper thinking-mode control for Qwen3

Quick Start

bash
# On Mac4
ollama run gemma3:4b

# API call for routing
curl http://[ip]:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Classify this task into exactly one category [triage, coding, architecture, planning, ops]: '\''<TASK>'\''. Reply with ONLY the category name.",
  "stream": false,
  "options": {"temperature": 0, "num_predict": 10}
}'

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/MAC4_BENCHMARK.md

Detected Structure

Method · Evaluation · Architecture