Grand Diomande Research · Full HTML Reader

KARL Integration — Evolution³ / Stage 1: PATH D

Abandon regex-based skill routing entirely. Embed every skill and every incoming prompt into a shared vector space. When a new prompt arrives, find the nearest skill by **trajectory-weighted similarity** — not raw text overlap. The "learning" is not RL on model weights (that is KARL's full treatment). It is RL on the **routing layer itself**. Skills stay as SKILL.md markdown. The only thing that changes is which skill gets injected, and that decision is made by a vector space whose distances are continuously update

Agents That Account for Themselves proposal experiment writeup candidate score 30 .md

Full Public Reader

# KARL Integration — Evolution³ / Stage 1: PATH D
Run: karl-trajectory-intelligence
Path: D — Skill Embeddings: Learned Vector Space for Routing
Generated: 2026-03-10
Status: Stage 1 complete
Run Directory: Desktop/evo-cube-output/karl-trajectory-intelligence/

---

Path D Concept Summary

Abandon regex-based skill routing entirely. Embed every skill and every incoming prompt into a shared vector space. When a new prompt arrives, find the nearest skill by trajectory-weighted similarity — not raw text overlap. The "learning" is not RL on model weights (that is KARL's full treatment). It is RL on the routing layer itself. Skills stay as SKILL.md markdown. The only thing that changes is which skill gets injected, and that decision is made by a vector space whose distances are continuously updated by observed trajectory success rates.

This is "KARL for routing" rather than "KARL for reasoning." It replaces the regex `\b(error|bug|crash)\b` in `ops_trigger.py` with a learned function that has been shaped by 324 invocation records and counting.

---

Current State Baseline

Before designing the replacement, establish exactly what we are replacing and why it fails.

What Exists Now

`[home-path]` (233 lines):

Load registry (mtime-cached)
    --> compile trigger regex per active forged skill
    --> match prompt.search(pattern) for each skill
    --> first match wins
    --> load SKILL.md, strip frontmatter, inject via stdout
    --> write invocation_record to entries.jsonl

Skill trigger patterns (from `ops_trigger.py` / `generator.py`):

Skill	Trigger Pattern
ops:debug	`\b(error
ops:deploy	`\b(deploy
ops:ios	`\b(bootstrap
ops:git	`\b(commit
ops:supabase	`\b(migration
ops:prefect	`\b(prefect
ops:monitoring	`\b(grafana
ops:mesh	`\b(mac[1-5]
ops:docker	`\b(docker
ops:asc	`\b(app store

Why Regex Fails

Problem 1: Vocabulary gaps. "The Nexus portal is throwing a 502" should trigger `ops:deploy` or `ops:monitoring`. Neither pattern matches `502` or `portal`. The regex fails silently — no skill injected, user gets no operational context.

Problem 2: First-match wins with no ranking. ops:debug fires on "fix the deploy script" because `fix` matches its pattern, even though `ops:deploy` is the semantically correct skill for that prompt. The system has no way to distinguish between a weak match and a strong match.

Problem 3: No outcome signal. ops:ios is invoked 71 times in entries.jsonl. Of those, roughly how many resulted in a successful archive, a failed xcodebuild, or a correction immediately after? The invocation record writes `trigger_prompt: "deploy the spore app"` and then forgets about what happened. There is zero feedback loop between "skill was injected" and "skill helped."

Problem 4: Compound tasks are undefined. "Fix the Prefect deploy for Spore's Supabase migration" touches ops:debug, ops:prefect, ops:supabase simultaneously. The regex fires on the first match and stops. Multi-skill composition is not possible.

Problem 5: Cold prompts are never routed. Creative or hybrid prompts that use no operational keywords are never routed to a skill, even when they pattern-match to a known successful workflow from session history.

---

Section 1: Embedding Architecture

The Shared Embedding Space

The core idea: embed skills and prompts into the same vector space so that similarity in the space means "this skill is helpful for this prompt." Two design choices determine the quality of this space.

Choice A: Embedding model. We need an embedding model that (a) runs in <100ms on Mac1 (hook budget is 500ms total, need headroom), (b) captures operational semantics (distinguishes "deploy flow" from "debug flow"), and (c) is available without additional inference infrastructure.

Options:

Option	Latency	Quality	Available
`text-embedding-3-small` via OpenAI API	80-150ms (network)	High	Yes (API key exists)
Gemini `text-embedding-004`	80-150ms (network)	High	Yes (GOOGLE_API_KEY)
`nomic-embed-text` via Ollama on Mac4	20-40ms (Tailscale)	Good	Mac4 :11434
`all-MiniLM-L6-v2` via sentence-transformers	5-15ms (local)	Medium	Needs install
RAG++ gateway `/api/rag/embed`	~50ms (SSH tunnel :8000)	High (Gemini)	Already exists

Recommended: RAG++ gateway. The infrastructure already exists. RAG++ uses Gemini `text-embedding-004` and has a pgvector persistence layer. Routing queries through the existing gateway means skill embeddings land in the same space as the turn-level semantic search, enabling cross-system retrieval. The RAG++ container runs on cloud-vm with SSH tunnel to Mac1 :8000.

Choice B: What to embed. This determines what the space means.

Each skill should be embedded not just from its name but from its operational semantics: the combination of its intent statement, its workflow steps, its gotchas, and historically successful trigger prompts. This produces a richer vector than embedding "ops:deploy — Deploy services to cloud-vm" alone.

python

def build_skill_embedding_text(skill_name: str, skill_md: str) -> str:
    """Construct embedding input for a skill — captures semantic surface area."""
    sections = []

    # 1. Strip frontmatter, extract intent + workflow
    body = strip_frontmatter(skill_md)
    intent = extract_section(body, "## Intent")
    workflow = extract_section(body, "## Workflow")
    gotchas = extract_section(body, "## Gotchas")

    # 2. Pull top-5 historical trigger prompts from invocation_records
    historical = get_top_prompts_for_skill(skill_name, limit=5)

    sections = [
        f"Skill: {skill_name}",
        f"Intent: {intent[:200]}",
        f"Workflow: {workflow[:300]}",
        f"Gotchas summary: {gotchas[:200]}",
        f"Used for prompts like: {'; '.join(historical)}",
    ]
    return "\n".join(s for s in sections if s)

For prompts, embed the raw text, potentially augmented with CWD context (the project being worked on often disambiguates ambiguous prompts — "fix the error" means something different in `/Desktop/Spore/` vs `/flows/feed-hub/`).

python

def build_prompt_embedding_text(prompt: str, cwd: str = "") -> str:
    project = Path(cwd).name if cwd else ""
    if project:
        return f"[project:{project}] {prompt}"
    return prompt

Vector Dimensions and Storage

RAG++ uses Gemini `text-embedding-004` which produces 768-dimensional vectors. All skill embeddings live in a single pgvector table on cloud-vm. The table structure:

sql

-- In Supabase (reuses existing pgvector schema)
CREATE TABLE skill_embeddings (
    skill_name      TEXT PRIMARY KEY,
    embedding       vector(768),
    embedding_text  TEXT,           -- the text that was embedded
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    version         INT DEFAULT 1,  -- bump when skill content changes
    trajectory_weight FLOAT DEFAULT 1.0  -- modified by outcome learning
);

CREATE INDEX ON skill_embeddings
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 10);  -- 10 skills = small index, fast

This is a minimal footprint: one new table, reuses existing pgvector infrastructure. No new Docker services.

Skill Embedding Pre-computation

Skills are not re-embedded on every hook call. They are pre-computed once and cached. A background task re-embeds on skill content change.

python

# [home-path]

CACHE_PATH = Path.home() / ".claude" / "cortex" / "skill_embeddings.pkl"
CACHE_TTL_SECONDS = 3600  # 1 hour

def load_skill_embeddings() -> dict[str, np.ndarray]:
    """Load cached skill embeddings. Recompute if stale."""
    if CACHE_PATH.exists():
        age = time.time() - CACHE_PATH.stat().st_mtime
        if age < CACHE_TTL_SECONDS:
            with open(CACHE_PATH, "rb") as f:
                return pickle.load(f)

    # Cache miss — recompute from pgvector
    embeddings = fetch_embeddings_from_pgvector()
    with open(CACHE_PATH, "wb") as f:
        pickle.dump(embeddings, f)
    return embeddings

In-process: a Python dict `{skill_name: np.ndarray}` with 10 skills × 768 dims = ~60KB. Fits easily in the hook process memory. No network call on cache hit.

---

Section 2: Trajectory-Weighted Similarity

This is the core innovation over naive cosine similarity. Raw cosine similarity asks: "which skill text is most similar to this prompt text?" Trajectory-weighted similarity asks: "which skill has historically worked for prompts most similar to this one?"

The Weight Modification Formula

Each skill has a trajectory weight `w_s` that modifies its effective distance to any query prompt. The modified similarity is:

sim_weighted(prompt, skill) = cosine_sim(embed(prompt), embed(skill)) * w_s

Where `w_s` starts at 1.0 and is updated by observed outcome signals. Skills with high success rates for similar prompts get `w_s > 1.0` (pulled toward prompts). Skills with high failure/correction rates get `w_s < 1.0` (pushed away).

The weight update rule (an exponential moving average to prevent catastrophic forgetting):

python

ALPHA = 0.1   # learning rate — slow to prevent noise domination
GAMMA = 0.9   # decay — old signals matter less

def update_weight(current_weight: float, outcome: float, alpha: float = ALPHA) -> float:
    """
    outcome: +1.0 = success, 0.0 = neutral, -1.0 = failure/correction
    Maps outcome to [0.5, 1.5] scaling range.
    """
    target = 1.0 + (outcome * 0.5)  # outcome=+1 -> target=1.5, outcome=-1 -> target=0.5
    return current_weight * (1 - alpha) + target * alpha

This keeps weights bounded: they converge toward 1.5 for consistently helpful skills and toward 0.5 for consistently unhelpful ones. Never reaches 0 (no skill is completely suppressed) and never explodes above 1.5 (no skill dominates unconditionally).

Outcome Signal Sources

The key question: how do we know if a skill injection "worked"? We have three signal sources of decreasing reliability:

Signal 1: Correction absence (strongest proxy, negative signal only). The correction_detector already fires on Stop events. If a correction is detected within 3 prompts after a skill injection, score = -1.0. A correction is strong evidence the skill did not help (or actively misled). This is already being recorded in entries.jsonl — we just need to link corrections to their preceding invocation records.

python

def check_post_injection_correction(
    invocation_ts: str,
    session_id: str,
    window_prompts: int = 3,
) -> bool:
    """Returns True if a correction was detected after this skill injection."""
    entries = load_entries(entry_type="correction")
    inv_dt = datetime.fromisoformat(invocation_ts)

    for correction in entries:
        if correction.session_id != session_id:
            continue
        corr_dt = datetime.fromisoformat(correction.ts)
        if corr_dt > inv_dt:
            # Count prompts between injection and correction
            prompts_between = count_prompts_in_window(session_id, inv_dt, corr_dt)
            if prompts_between <= window_prompts:
                return True
    return False

Signal 2: Task completion indicators (medium reliability, positive signal). Certain Stop event patterns indicate success: the user's next prompt is a trivial continuation ("commit this", "push it", "ship it", "looks good"). These are already filtered out by the extractor.py SKIP_PATTERNS — but for reward purposes, they are exactly what we want. A session that ends with "commit this" after ops:git injection is a positive trajectory.

python

SUCCESS_INDICATORS = [
    r"^commit this",
    r"^push it",
    r"^ship it",
    r"^looks good",
    r"^perfect",
    r"^nice",
    r"^done",
    r"^deploy it",
]

def check_post_injection_success(
    invocation_ts: str,
    session_id: str,
    window_prompts: int = 5,
) -> bool:
    """Returns True if a success-indicator prompt followed this injection."""
    next_prompts = get_prompts_after(session_id, invocation_ts, limit=window_prompts)
    for p in next_prompts:
        for pattern in SUCCESS_INDICATORS:
            if re.match(pattern, p["text"], re.IGNORECASE):
                return True
    return False

Signal 3: Build/deploy exit codes (strongest signal, narrow domain). For ops:ios, ops:deploy, ops:docker — Bash tool exit codes are already logged in the bash_audit.jsonl and post_tool_hook.py. A zero exit code on `xcodebuild archive` or `docker compose up -d` after an ops:ios / ops:deploy injection is a high-confidence success signal.

python

def check_bash_exit_codes(
    invocation_ts: str,
    session_id: str,
    skill_name: str,
) -> Optional[float]:
    """Returns 1.0 if relevant build commands succeeded, -1.0 if failed, None if not applicable."""
    SKILL_COMMANDS = {
        "ops:ios": ["xcodebuild", "xcode-select"],
        "ops:deploy": ["systemctl", "docker compose"],
        "ops:docker": ["docker"],
        "ops:prefect": ["prefect"],
        "ops:git": ["git commit", "git push"],
    }
    relevant_cmds = SKILL_COMMANDS.get(skill_name, [])
    if not relevant_cmds:
        return None  # not applicable for this skill

    bash_records = load_bash_records_after(session_id, invocation_ts)
    for record in bash_records:
        cmd = record.get("command", "")
        if any(c in cmd for c in relevant_cmds):
            exit_code = record.get("exit_code", -1)
            return 1.0 if exit_code == 0 else -0.5
    return None

Combining Signals Into a Trajectory Score

python

def compute_trajectory_score(
    invocation: CortexEntry,
    all_entries: list[CortexEntry],
) -> float:
    """
    Returns a score in [-1, +1] for this invocation.
    Priority: bash exit codes > correction detection > success indicators > neutral.
    """
    # Try bash exit codes first (highest signal quality)
    exit_score = check_bash_exit_codes(
        invocation.ts, invocation.session_id, invocation.skill
    )
    if exit_score is not None:
        return exit_score

    # Check for corrections (strong negative signal)
    if check_post_injection_correction(invocation.ts, invocation.session_id):
        return -1.0

    # Check for success indicators (moderate positive signal)
    if check_post_injection_success(invocation.ts, invocation.session_id):
        return 0.5   # not +1.0 — success indicators are noisy

    # No signal available — treat as weakly positive (injection didn't cause harm)
    return 0.1

---

Section 3: Online Learning

The embedding space must update as new trajectories arrive. The challenge: we cannot re-embed skills after every prompt. Instead, we update the trajectory weights (cheap floating-point operations) and re-embed skills only when their content changes or on a scheduled recomputation.

Two-Tier Learning Pipeline

Tier 1: Weight updates (real-time, every Stop event).

A lightweight Stop hook aggregates recent invocation records, computes trajectory scores, and updates `trajectory_weight` in the skill_embeddings table. This runs in the Stop hook's existing budget (correction_detector already fires here).

python

# [home-path]
# Fires on Stop event, budget: 200ms

def update_weights_from_session(session_id: str) -> None:
    """
    For all invocations in this session, compute outcome scores
    and apply EMA weight updates to skill_embeddings.
    """
    invocations = [
        e for e in load_entries(entry_type="invocation_record")
        if e.session_id == session_id
    ]
    if not invocations:
        return

    all_entries = load_entries()
    for inv in invocations:
        score = compute_trajectory_score(inv, all_entries)
        if abs(score) < 0.05:
            continue  # skip neutral, save write

        current_weight = fetch_weight(inv.skill)  # from local pkl cache
        new_weight = update_weight(current_weight, score)
        persist_weight(inv.skill, new_weight)  # update pkl + Supabase async

Tier 2: Embedding recomputation (scheduled, Prefect daily).

Once a day, a Prefect flow re-embeds all skills whose content has changed (mtime comparison) or whose trajectory weight diverged significantly from 1.0 in the previous week. The recomputation uses the updated `build_skill_embedding_text()` function which now incorporates the new top historical prompts accumulated since the last embedding.

python

# flows/feed-hub/skill_embedding_refresh.py
@flow(name="skill-embedding-refresh")
def refresh_skill_embeddings(force_all: bool = False) -> dict:
    """
    Daily: re-embed skills with changed content or significant weight drift.
    """
    registry = load_registry()
    stats = {"reembedded": 0, "skipped": 0}

    for skill_name, info in registry.get("skills", {}).items():
        if info.get("status") != "active":
            continue

        skill_md_path = SKILLS_DIR / skill_name / "SKILL.md"
        if not skill_md_path.exists():
            continue

        if not force_all:
            # Check if recomputation is needed
            current_emb = fetch_embedding_record(skill_name)
            if current_emb:
                skill_mtime = skill_md_path.stat().st_mtime
                emb_mtime = current_emb["updated_at"].timestamp()
                weight_drift = abs(current_emb["trajectory_weight"] - 1.0)
                if skill_mtime < emb_mtime and weight_drift < 0.15:
                    stats["skipped"] += 1
                    continue

        skill_text = build_skill_embedding_text(skill_name, skill_md_path.read_text())
        embedding = embed_via_ragpp(skill_text)  # POST to :8000
        upsert_skill_embedding(skill_name, embedding, skill_text)
        stats["reembedded"] += 1

    return stats

The Learning Loop Visualized

[User prompt arrives]
        |
        v
[ops_trigger_v2.py] -- embed prompt (RAG++ :8000, ~80ms)
        |
        v
[load skill_embeddings.pkl] -- in-memory, <1ms
        |
        v
[compute weighted similarities] -- 10 skills x 768 dims, numpy, <1ms
        |
        v
[inject top-k skill(s)] -- SKILL.md content injection
        |
        v
[session continues... tools called... bash exits... corrections...]
        |
        v
[Stop event fires]
        |
        v
[weight_updater.py] -- score trajectory, EMA update weights, persist
        |
        v
[weights.pkl updated] -- next prompt uses updated space
        |
        v (daily)
[skill_embedding_refresh Prefect flow] -- re-embed with historical prompts

Each cycle, the routing space becomes more accurate. After 100 sessions, the top-k retrieval has been shaped by real outcomes rather than a programmer's regex intuition.

---

Section 4: Replacement of ops_trigger.py

New File: ops_trigger_v2.py

The new router replaces the regex matching loop with a vector similarity lookup. The 500ms SIGALRM budget is preserved — the critical path is the embedding API call (~80ms) plus cache load (<1ms) plus numpy dot product (<1ms).

python

#!/usr/bin/env python3
"""Ops-Trigger V2 — learned vector routing for skill injection.

Replaces regex-based skill matching with trajectory-weighted cosine similarity.
Embedding model: Gemini text-embedding-004 via RAG++ gateway (:8000).
Weights stored in: [home-path]
Full design: Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-d.md

Performance budget: <500ms (80ms embed + <5ms retrieval + <50ms load). SIGALRM hard cap.
"""

from __future__ import annotations

import json
import os
import pickle
import signal
import sys
import time
from pathlib import Path
from typing import Optional

import numpy as np

if __name__ == "__main__" or "pytest" not in sys.modules:
    signal.signal(signal.SIGALRM, lambda *_: sys.exit(0))
    signal.setitimer(signal.ITIMER_REAL, 0.5)

SKILLS_DIR = Path.home() / ".claude" / "skills"
CACHE_PATH = Path.home() / ".claude" / "cortex" / "skill_embeddings.pkl"
ENTRIES_FILE = Path.home() / ".claude" / "cortex" / "entries.jsonl"
RAG_EMBED_URL = "http://localhost:8000/api/rag/embed"  # SSH tunnel :8000

# Similarity thresholds
MIN_SIMILARITY = 0.60       # below this: no injection
MULTI_SKILL_THRESHOLD = 0.80  # above this for multiple skills: compose
TRAJECTORY_WEIGHT_FLOOR = 0.3  # safety floor to prevent total suppression


def _embed_prompt(text: str, cwd: str = "") -> Optional[np.ndarray]:
    """Embed a prompt text via RAG++ gateway. Returns None on failure."""
    import urllib.request
    payload = json.dumps({
        "text": f"[project:{Path(cwd).name}] {text}" if cwd else text
    }).encode()
    try:
        req = urllib.request.Request(
            RAG_EMBED_URL,
            data=payload,
            headers={"Content-Type": "application/json"},
            method="POST",
        )
        with urllib.request.urlopen(req, timeout=0.35) as resp:
            data = json.loads(resp.read())
            return np.array(data["embedding"], dtype=np.float32)
    except Exception:
        return None


def _load_embeddings() -> dict[str, dict]:
    """Load skill embeddings from local pickle cache."""
    if not CACHE_PATH.exists():
        return {}
    try:
        with open(CACHE_PATH, "rb") as f:
            return pickle.load(f)
    except Exception:
        return {}


def _cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine similarity between two unit vectors."""
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(np.dot(a, b) / (norm_a * norm_b))


def _rank_skills(
    prompt_emb: np.ndarray,
    skill_embeddings: dict[str, dict],
) -> list[tuple[str, float]]:
    """
    Rank skills by trajectory-weighted cosine similarity.
    Returns list of (skill_name, weighted_score) sorted descending.
    """
    ranked = []
    for skill_name, record in skill_embeddings.items():
        emb = record.get("embedding")
        if emb is None:
            continue
        raw_sim = _cosine_sim(prompt_emb, np.array(emb, dtype=np.float32))
        weight = max(record.get("trajectory_weight", 1.0), TRAJECTORY_WEIGHT_FLOOR)
        weighted = raw_sim * weight
        ranked.append((skill_name, weighted))

    return sorted(ranked, key=lambda x: -x[1])


def _check_claims(skill_name: str) -> bool:
    """Check pane domain claims — unchanged from v1."""
    claims_path = Path.home() / ".claude" / "state" / "pane_claims.json"
    if not claims_path.exists():
        return True
    try:
        with open(claims_path) as f:
            claims = json.load(f)
    except Exception:
        return True
    my_tty = os.environ.get("TTY", "")
    from .ops_trigger import _load_registry
    registry = _load_registry()
    domains = registry.get("skills", {}).get(skill_name, {}).get("domains", [])
    domain = domains[0] if domains else None
    if not domain:
        return True
    for claim in claims.get("active", []):
        if claim.get("domain") == domain and claim.get("tty") != my_tty:
            return False
    return True


def _load_skill_content(skill_name: str) -> Optional[str]:
    """Load SKILL.md body — unchanged from v1."""
    skill_path = SKILLS_DIR / skill_name / "SKILL.md"
    if not skill_path.exists():
        return None
    content = skill_path.read_text()
    if content.startswith("---"):
        end = content.find("---", 3)
        if end > 0:
            content = content[end + 3:].strip()
    return content


def _write_invocation(
    skill_name: str,
    prompt: str,
    similarity: float,
    composed_with: Optional[list[str]] = None,
) -> None:
    """Write invocation record with similarity metadata."""
    import uuid
    from datetime import datetime, timezone
    entry = {
        "id": uuid.uuid4().hex[:8],
        "type": "invocation_record",
        "ts": datetime.now(timezone.utc).isoformat(),
        "machine": "mac1",
        "pane": os.environ.get("TTY", ""),
        "skill": skill_name,
        "trigger_prompt": prompt[:100],
        "routing_method": "embedding_v2",
        "similarity_score": round(similarity, 4),
        "composed_with": composed_with or [],
    }
    ENTRIES_FILE.parent.mkdir(parents=True, exist_ok=True)
    with open(ENTRIES_FILE, "a") as f:
        f.write(json.dumps(entry, separators=(",", ":")) + "\n")


def main():
    try:
        hook_input = json.loads(sys.stdin.read())
    except (json.JSONDecodeError, EOFError):
        sys.exit(0)

    prompt = hook_input.get("prompt", "")
    cwd = hook_input.get("cwd", "")
    if not prompt or len(prompt) < 5:
        sys.exit(0)

    # Step 1: Embed the prompt
    prompt_emb = _embed_prompt(prompt, cwd)
    if prompt_emb is None:
        # Embedding failed (timeout/network) — fall back to legacy regex
        # Import and call v1 main() as fallback
        from .ops_trigger import main as legacy_main
        legacy_main()
        return

    # Step 2: Load skill embeddings from cache
    skill_embeddings = _load_embeddings()
    if not skill_embeddings:
        sys.exit(0)  # cold start — no embeddings yet

    # Step 3: Rank by weighted similarity
    ranked = _rank_skills(prompt_emb, skill_embeddings)
    if not ranked or ranked[0][1] < MIN_SIMILARITY:
        sys.exit(0)  # no skill is relevant enough

    # Step 4: Determine injection strategy
    top_skill, top_score = ranked[0]
    if not _check_claims(top_skill):
        sys.exit(0)

    # Check for multi-skill composition
    composed = []
    for skill_name, score in ranked[1:3]:  # consider 2nd and 3rd
        if score >= MULTI_SKILL_THRESHOLD and _check_claims(skill_name):
            composed.append((skill_name, score))

    # Step 5: Inject
    if not composed:
        # Single skill injection
        content = _load_skill_content(top_skill)
        if content:
            print(f"<system-reminder>\n[Cortex v2] Skill: {top_skill} (sim={top_score:.3f})\n{content}\n</system-reminder>")
            _write_invocation(top_skill, prompt, top_score)
    else:
        # Multi-skill composition
        _inject_composite(top_skill, top_score, composed, prompt)

    sys.exit(0)


def _inject_composite(
    primary: str,
    primary_score: float,
    secondary: list[tuple[str, float]],
    prompt: str,
) -> None:
    """Inject a composite skill context for multi-domain prompts."""
    parts = [f"[Cortex v2] Multi-skill: {primary} (sim={primary_score:.3f})"]
    all_skills = [primary]

    primary_content = _load_skill_content(primary)
    if primary_content:
        parts.append(f"## {primary}\n{primary_content}")

    for skill_name, score in secondary:
        content = _load_skill_content(skill_name)
        if content:
            # Extract only the Gotchas section for secondary skills to avoid bloat
            gotchas = _extract_section(content, "## Gotchas")
            if gotchas:
                parts.append(f"## {skill_name} — Gotchas\n{gotchas}")
                all_skills.append(skill_name)

    combined = "\n\n".join(parts)
    print(f"<system-reminder>\n{combined}\n</system-reminder>")

    for skill_name in all_skills:
        score = primary_score if skill_name == primary else dict(secondary)[skill_name]
        _write_invocation(skill_name, prompt, score, composed_with=all_skills)


def _extract_section(content: str, section_header: str) -> str:
    """Extract a markdown section by header."""
    start = content.find(section_header)
    if start == -1:
        return ""
    end = content.find("\n## ", start + len(section_header))
    if end == -1:
        return content[start:]
    return content[start:end]


if __name__ == "__main__":
    main()

Migration Strategy

The migration from v1 to v2 is zero-downtime via a feature flag in the registry:

json

"router": {
    "version": "v1",  // change to "v2" when embeddings are bootstrapped
    "fallback": "regex",  // v2 falls back to v1 on embedding failure
    "min_similarity": 0.60
}

Phase 1: Run both v1 and v2 in shadow mode. v2 logs its decisions to entries.jsonl (routing_method: "embedding_v2") but does not inject. Compare v1 and v2 selections. Phase 2: Enable v2 with regex fallback. Phase 3: Disable regex fallback once similarity score distribution stabilizes.

---

Section 5: Cold Start Bootstrap

The system has 324 invocation records in entries.jsonl. They are the bootstrap dataset. However, 319 of these are registration events ("forge:register ops:ios") not real prompt matches — the registry shows `invocations: 0` for all active skills. The real trigger invocations have not been reliably recorded yet because the trigger pattern was too narrow.

Bootstrap Algorithm

python

# [home-path]
# Run once to initialize skill_embeddings.pkl

def bootstrap_from_invocation_records():
    """
    Phase 1: Embed all 13 active skills using current SKILL.md content
             + historical prompts from prompts-all.jsonl.
    Phase 2: Set initial trajectory weights from domain keyword frequency
             (a cheap proxy for "how often does this domain matter here").
    Phase 3: Persist to pkl + pgvector.
    """
    # Phase 1: Gather top prompts per domain
    prompts = load_all_prompts()
    domain_prompts = group_prompts_by_domain(prompts)  # uses DOMAIN_KEYWORDS from extractor.py

    skill_records = {}
    for skill_name, info in load_active_skills().items():
        skill_md = (SKILLS_DIR / skill_name / "SKILL.md").read_text()
        domain = info.get("domains", ["debug"])[0]
        historical = domain_prompts.get(domain, [])[:5]

        embedding_text = build_skill_embedding_text(skill_name, skill_md, historical)
        embedding = embed_via_ragpp(embedding_text)

        # Phase 2: Initial weight from domain frequency
        # Skills invoked more often start with slight advantage
        domain_count = len(domain_prompts.get(domain, []))
        initial_weight = 1.0 + min(domain_count / 1000, 0.2)  # max 1.2 initial advantage

        skill_records[skill_name] = {
            "embedding": embedding.tolist(),
            "embedding_text": embedding_text,
            "trajectory_weight": initial_weight,
            "version": 1,
            "bootstrapped_at": datetime.now(timezone.utc).isoformat(),
        }

    # Phase 3: Persist
    with open(CACHE_PATH, "wb") as f:
        pickle.dump(skill_records, f)
    upsert_all_to_pgvector(skill_records)

    print(f"Bootstrap complete: {len(skill_records)} skills embedded")
    return skill_records


def group_prompts_by_domain(prompts: list[dict]) -> dict[str, list[str]]:
    """Group prompt texts by the operational domain they touch."""
    from cortex.forge.extractor import DOMAIN_KEYWORDS, _detect_domains
    result = {d: [] for d in DOMAIN_KEYWORDS}
    for p in prompts:
        for domain in _detect_domains(p["text"]):
            if len(result[domain]) < 50:
                result[domain].append(p["text"])
    return result

Cold Start Performance Estimate

Skill	Domain Prompts Available	Initial Weight Estimate
ops:debug	110 (from Stage 0 research)	1.11
ops:deploy	79	1.08
ops:ios	65	1.065
ops:supabase	50	1.05
ops:monitoring	35	1.035
ops:git	21	1.021
ops:prefect	19	1.019
ops:docker	~15	1.015
ops:asc	~10	1.01
ops:mesh	~8	1.008

After bootstrap, the routing space already reflects the empirical frequency distribution of this operator's work — without any training. This is better than regex from day one.

---

Section 6: Multi-Skill Composition

When two or more skills exceed the composition threshold (0.80 weighted similarity), we inject a composite context. The design challenge is preventing context bloat: injecting 3 full SKILL.md files (~600 lines each) would overwhelm the session context.

Composition Strategy

Primary skill (highest similarity):  FULL SKILL.md content injected
Secondary skills (>0.80 threshold):  GOTCHAS SECTION ONLY injected
Tertiary and beyond:                 SKILL NAME MENTIONED ONLY ("also relevant: ops:deploy")

The gotchas are the highest-value section for cross-domain prompts. When someone says "fix the Prefect deploy for the Spore Supabase migration," they need the full ops:prefect workflow, plus the deploy gotchas (SSH heredoc, VM hang recovery), plus the supabase gotcha (anon key behavior). They don't need three full workflows.

Composite Injection Format

[Cortex v2] Multi-skill: ops:prefect (sim=0.87) + ops:deploy (sim=0.83) + ops:supabase (sim=0.81)

## ops:prefect
[full SKILL.md content]

## ops:deploy — Gotchas
- SSH heredoc: Mangles ${...} — write scripts locally, scp to VM, then execute
- VM hangs: ssh cloud-vm hangs? → gcloud compute instances reset...
[gotchas only]

## ops:supabase — Gotchas
- Anon key allows writes: RLS policies use auth.role() = 'authenticated' but anon key works for CRUD...
[gotchas only]

Composition Threshold Tuning

The 0.80 threshold is a starting value. After 50 composite injections, we can analyze whether composite sessions have higher success rates than single-skill sessions. If composite sessions show more corrections, raise the threshold to 0.85. If they show fewer corrections, lower it to 0.75. This tuning happens in the daily Prefect refresh flow.

---

Section 7: Infrastructure

Storage Map

Component	Location	Size Estimate	Purpose
`skill_embeddings.pkl`	`[home-path]`	~60KB (10 skills × 768 dims × 4 bytes)	Hot cache, loaded per hook call
`skill_embeddings` table	Supabase / cloud-vm pgvector	10 rows × 768 vector	Persistent source of truth, daily sync
`entries.jsonl`	`[home-path]`	Growing, currently 399 lines	Invocation records with similarity_score and routing_method fields
`weight_log.jsonl`	`[home-path]`	~1KB/day	Audit trail for weight updates, enables replay/rollback

New Fields Added to entries.jsonl Schema

Existing invocation_record type gains two fields (backward compatible):

json

{
  "type": "invocation_record",
  "skill": "ops:prefect",
  "trigger_prompt": "deploy the flow",
  "routing_method": "embedding_v2",     // NEW — "regex_v1" | "embedding_v2" | "fallback_v1"
  "similarity_score": 0.847,            // NEW — cosine sim after trajectory weighting
  "composed_with": ["ops:deploy"]       // NEW — other skills in composite injection
}

No schema migrations required in Supabase for this — these are additive fields in a JSONL store.

New table (one addition to the existing 141 Supabase tables):

sql

CREATE TABLE skill_embeddings (
    skill_name         TEXT PRIMARY KEY,
    embedding          vector(768),
    embedding_text     TEXT,
    trajectory_weight  FLOAT NOT NULL DEFAULT 1.0,
    updated_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    version            INT NOT NULL DEFAULT 1,
    invocation_count   INT NOT NULL DEFAULT 0,
    success_count      INT NOT NULL DEFAULT 0,
    failure_count      INT NOT NULL DEFAULT 0
);

Computational Cost

The only new compute cost on the critical path is the embedding API call:

Component	Cost	Frequency
Gemini embed (RAG++ :8000)	~80ms, negligible $	Per prompt that reaches ops_trigger_v2
pkl cache load	<1ms	Per trigger
numpy dot product (10 skills)	<1ms	Per trigger
Weight update (EMA float ops)	<1ms	Per Stop event with invocations
Prefect daily re-embed	10 API calls/day	Daily

The full critical path is ~85ms out of a 500ms budget. Comfortable.

Fallback Safety

If RAG++ is down (SSH tunnel dropped, Docker container restarting), the embedding call returns None. The router immediately falls back to `ops_trigger.py` (v1 regex). The system degrades gracefully. The Stop hook's weight updater also short-circuits on error without any user-visible effect.

---

Section 8: Comparison with Full KARL

This design is explicitly "KARL for routing" — not "KARL for reasoning." The distinction matters.

Dimension	Full KARL (Databricks)	Path D (Skill Embeddings)
What is trained	Model weights (OAPL gradient updates)	Routing weight scalars (EMA updates)
Training data	Synthetic multi-step rollouts (1000+ examples)	Existing invocation records (324, growing)
Compute required	GPU cluster, 8 rollouts per prompt	Single M1 Mac, EMA float updates
Latency impact	Zero (weights are model weights, no extra inference)	+80ms per prompt for embedding
What improves	The model's ability to formulate search queries	Which static skill gets injected
Learning algorithm	OAPL (KL-regularized RL with regression loss)	EMA weight update from outcome signals
Trajectory depth	50-200 tool-use steps per trajectory	1 invocation → 1-5 outcome prompts
Coverage	Any text generation task	Only tasks where skills exist
Maximum lift	Closes 52.6 → 67.5 (KARL's result with GLM 4.5)	Eliminates ~30-40

Path D's theoretical ceiling: The static SKILL.md content is not improved. Only the selection mechanism is improved. If ops:debug's workflow is wrong, better routing to ops:debug still injects wrong content. Path D is a prerequisite for — not a replacement of — the deeper KARL integration (Path A/B/C which involve trajectory-based skill rewriting or weight updates).

Path D's practical advantage: It is buildable in one sprint, runs on existing infrastructure, and produces measurable results immediately. The 324 invocation records provide a bootstrap dataset, the RAG++ embedding infrastructure is live, and the EMA weight update is 10 lines of code. The output is a new artifact (skill_embeddings.pkl) that can be extended by future paths.

---

Section 9: Risks

Risk 1: Embedding Drift

What it is: The space in which skills and prompts are embedded shifts when the embedding model is updated. If Gemini rolls `text-embedding-004` to a new version, all vectors become incompatible.

Likelihood: Medium. Gemini embedding models are relatively stable, but not guaranteed.

Mitigation: Store `embedding_model_version` in each skill record. On drift detection (cosine_sim between old and new embedding of a known anchor text drops below 0.90), trigger a full re-embed of all skills. The daily Prefect flow includes this check.

python

ANCHOR_TEXT = "deploy the Prefect flow to cloud-vm"
# Compare embed(ANCHOR_TEXT) monthly. If cos_sim < 0.90 vs. previous: re-embed all.

Risk 2: Cold Start for New Skills

What it is: A newly forged skill has no trajectory weight history. It starts at `w = 1.0` with no successful invocations. If its initial embedding lands close to many prompts, it may get selected too aggressively before it has proven itself.

Mitigation: New skills start at `trajectory_weight = 0.85` (slightly below neutral). They must accumulate positive signals before competing equally with established skills. The weight floor (0.3) and ceiling (1.5) prevent extremes.

Risk 3: Catastrophic Forgetting of Long-Tail Scenarios

What it is: EMA gives recent outcomes more weight than old ones. A skill that was heavily used (and succeeded) 60 days ago but not used recently will have its weight decayed toward 1.0. If it suddenly becomes relevant again, the routing space has forgotten its strength.

Mitigation: The weight decay is gentle (EMA alpha = 0.1 means 10

Risk 4: Success Signal Noise

What it is: The success indicators ("commit this", "looks good") are weak proxies. A user can say "commit this" after a session where the injected skill was irrelevant to the task — the success was not because of the skill, but despite it. This creates false positive weight updates.

Mitigation: The success signal only contributes +0.5 (not +1.0) to the EMA update. Correction signals contribute -1.0 (full weight). This asymmetry means the system is skeptical of positive signals and quick to penalize negative ones, which matches the real use case (it's easier to verify a failure than a success).

Risk 5: Bootstrap Data Quality

What it is: The 324 existing invocation records are mostly registration events, not real trigger events. The actual trigger prompt data is sparse because the regex was too narrow to trigger frequently.

Mitigation: Bootstrap weights from domain keyword frequency (the `group_prompts_by_domain()` approach from Section 5) rather than from invocation counts. This gives a reasonable prior based on 903 real prompts without requiring perfect invocation data. After 1-2 weeks of v2 operation, real invocation data will dominate.

Risk 6: RAG++ Dependency in Hook Critical Path

What it is: ops_trigger_v2 now depends on a network call to localhost:8000 (RAG++ via SSH tunnel). If the tunnel is down, the hook has 350ms to detect the failure and fall back to v1. The SSH tunnel is a LaunchAgent-managed process that reconnects automatically, but brief outages are possible.

Mitigation: The fallback to v1 regex is already designed in. The urllib timeout is set to 0.35s — well within the SIGALRM budget. All failures are silent (sys.exit(0)) from the hook's perspective; v1 regex fires if v2 times out. No user-visible impact.

---

Implementation Checklist

### Phase 1: Bootstrap (1-2 days)
- [ ] Write `[home-path]` using design from Section 5
- [ ] Add `/api/rag/embed` endpoint to RAG++ gateway (or confirm existing embedding endpoint)
- [ ] Run bootstrap, verify `skill_embeddings.pkl` created with 13 skill vectors
- [ ] Create `skill_embeddings` Supabase table
- [ ] Unit test `_rank_skills()` with mock embeddings — verify weighted ordering

### Phase 2: Shadow Mode (3-5 days)
- [ ] Write `ops_trigger_v2.py` using design from Section 4
- [ ] Add routing_method + similarity_score + composed_with to invocation_record writes
- [ ] Run v2 alongside v1: v2 logs decisions, v1 continues injecting
- [ ] Analyze: do v2 selections match v1 on high-confidence prompts? Where do they diverge?
- [ ] Calibrate MIN_SIMILARITY threshold from distribution of observed similarity scores

### Phase 3: Live Routing (1 week)
- [ ] Enable v2 with regex fallback (feature flag in registry.json router section)
- [ ] Write `weight_updater.py` Stop hook component
- [ ] Verify weight updates are landing in pkl and Supabase
- [ ] Write `skill_embedding_refresh` Prefect flow
- [ ] Monitor: invocation_record similarity_score distribution, correction rate before/after

### Phase 4: Evaluation (ongoing)
- [ ] A/B metric: correction rate in sessions with embedding-routed skills vs. regex-routed
- [ ] A/B metric: task completion time before/after (from session duration in unified.jsonl)
- [ ] Dashboard: add embedding routing stats to Nexus Portal /skills page
- [ ] Tune MULTI_SKILL_THRESHOLD based on composite injection outcomes

---

Connection to Other KARL Paths

Path D (Skill Embeddings) is the foundation layer for the deeper KARL integration paths. It produces infrastructure that the other paths depend on:

Path A (Trajectory Recorder): Needs invocation records enriched with similarity_score to know which skills were active during a trajectory. Path D adds exactly this.
Path B (OAPL on Mac5): The trained model on Mac5 could generate embeddings as its output rather than or in addition to text — "predict which skill embedding is nearest for this prompt" as a learned function. Path D's infrastructure is the ground truth for this training target.
Path C (Self-Play Synthesis): Synthetic training examples should include skill routing decisions. Path D's routing log becomes the supervision signal for self-play: "for this type of prompt, ops:prefect was the right skill."

The ordering is: Path D first (routing infrastructure), then Path A (trajectory collection enriched by D), then Path B or C (model-level learning that uses D+A data).

---

Sources

### Codebase Files Read
- `[home]/.claude/cortex/router/ops_trigger.py` (233 lines)
- `[home]/.claude/cortex/models.py` (169 lines)
- `[home]/.claude/cortex/forge/extractor.py` (287 lines)
- `[home]/.claude/cortex/forge/generator.py` (178 lines)
- `[home]/.claude/cortex/decay/detector.py` (251 lines)
- `[home]/.claude/cortex/adaptation/frequency_tracker.py` (200+ lines)
- `[home]/.claude/cortex/entries.jsonl` (399 entries: 324 invocation_records, 75 decay_flags)
- `[home]/.claude/skills/registry.json` (88 skills, 13 active)
- `[home]/.claude/skills/ops:debug/SKILL.md` (Gen 2 enriched)
- `[home]/.claude/skills/ops:deploy/SKILL.md` (Gen 2 enriched)

### Stage 0 Research
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md` (504 lines)
- Section 1: Cortex system architecture with exact file line references
- Section 2: Skills infrastructure (88 skills, 13 active, Gen 1 vs Gen 2)
- Section 3: Hooks architecture and existing tool-use recording
- Section 4: Mac5 fine-tune pipeline
- Section 5: RAG++ / pgvector search infrastructure
- Section 6: Evolution World architecture
- Section 7: KARL paper (OAPL algorithm, synthetic data pipeline, performance tables)
- Section 8: Hard constraints (500ms hook budget, Mac5 16GB, 141 Supabase tables)

### Key Design Decisions Derived From Research
- Gemini text-embedding-004 via RAG++ gateway: Avoids new infrastructure; reuses existing Supabase pgvector
- EMA weight updates (alpha=0.1): Chosen to resist noise while responding to trends, bounded in [0.3, 1.5]
- Correction as primary negative signal: Already detected by correction_detector.py, strongest reliable signal
- Bootstrap from domain keyword frequency: Compensates for sparse invocation record data (most are registration events)
- Regex fallback on embed failure: Preserves availability, allows gradual rollout

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-trajectory-intelligence/stage1-path-d.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture · is Stage Research