Optimal Cognitive Twin Training Configuration

Full HTML reader

Read the full artifact

Extracted abstract or opening context

> Target: Qwen2.5-7B-Instruct-4bit on Mac5 (M4 16GB) via MLX LoRA > Dataset: 2,923 train / 328 valid examples of Mohamed's responses > Goal: Override "helpful assistant" persona with Mohamed's direct, action-oriented voice > Date: 2026-03-23 > Extends: lora-persona-research.md (2026-03-22) The current training runs have two fundamental problems: (1) the 4-bit quantization requires dramatically different hyperparameters than full-precision LoRA, and (2) persona transfer demands comprehensive layer coverage that conflicts with the 16GB memory ceiling. The solution is a specific combination of conservative learning rate (5e-5), gradient checkpointing, LoRA rank 32 on all layers with attention+MLP targeting, batch 1 with grad accumulation of 8, and the `--mask-prompt` flag. For the anticipation geometry integration, the scalars should be embedded into the system prompt as conditioning context at inference time, not injected into the training loop. The knowledge graph integrates via a Parametric-RAG pattern: query at inference, inject retrieved triples into the prompt. The training log shows a clear pattern: - **1e-5**: Converges but too slow to override instruct persona (val loss 1.796, still sounds generic) - **2e-4**: Gradient explosion (loss -> NaN at iter 40) - **5e-5 with mask-prompt**: Running, val loss 2.377 at iter 800 (higher due to mask-prompt, which is expected) 4-bit NormalFloat (NF4) quantization introduces quantization noise into the forward pass. When gradients flow back through quantized weights, the effective gradient magnitudes are noisier and more volatile than with full-precision weights. The QLoRA paper (Dettmers et al., NeurIPS 2023) found this can cause sudden gradient spikes during backpropagation, particularly in attention layers where the Q/K dot product amplifies small weight perturbations quadratically. At LR 2e-4, these amplified gradients exceed the stable training region. The NaN at iter 40 is consistent with a gradient spike accumulating over ~40 updates until the loss landscape escapes the local basin entirely.

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.