KARL Integration — Evolution³ / Stage 1: PATH D
Abandon regex-based skill routing entirely. Embed every skill and every incoming prompt into a shared vector space. When a new prompt arrives, find the nearest skill by **trajectory-weighted similarity** — not raw text overlap. The "learning" is not RL on model weights (that is KARL's full treatment). It is RL on the **routing layer itself**. Skills stay as SKILL.md markdown. The only thing that changes is which skill gets injected, and that decision is made by a vector space whose distances are continuously update
Full Public Reader
# KARL Integration — Evolution³ / Stage 1: PATH D
Run: karl-trajectory-intelligence
Path: D — Skill Embeddings: Learned Vector Space for Routing
Generated: 2026-03-10
Status: Stage 1 complete
Run Directory: Desktop/evo-cube-output/karl-trajectory-intelligence/
---
Path D Concept Summary
Abandon regex-based skill routing entirely. Embed every skill and every incoming prompt into a shared vector space. When a new prompt arrives, find the nearest skill by trajectory-weighted similarity — not raw text overlap. The "learning" is not RL on model weights (that is KARL's full treatment). It is RL on the routing layer itself. Skills stay as SKILL.md markdown. The only thing that changes is which skill gets injected, and that decision is made by a vector space whose distances are continuously updated by observed trajectory success rates.
This is "KARL for routing" rather than "KARL for reasoning." It replaces the regex `\b(error|bug|crash)\b` in `ops_trigger.py` with a learned function that has been shaped by 324 invocation records and counting.
---
Current State Baseline
Before designing the replacement, establish exactly what we are replacing and why it fails.
What Exists Now
`[home-path]` (233 lines):
Load registry (mtime-cached)
--> compile trigger regex per active forged skill
--> match prompt.search(pattern) for each skill
--> first match wins
--> load SKILL.md, strip frontmatter, inject via stdout
--> write invocation_record to entries.jsonlSkill trigger patterns (from `ops_trigger.py` / `generator.py`):
| Skill | Trigger Pattern |
|---|---|
| ops:debug | `\b(error |
| ops:deploy | `\b(deploy |
| ops:ios | `\b(bootstrap |
| ops:git | `\b(commit |
| ops:supabase | `\b(migration |
| ops:prefect | `\b(prefect |
| ops:monitoring | `\b(grafana |
| ops:mesh | `\b(mac[1-5] |
| ops:docker | `\b(docker |
| ops:asc | `\b(app store |
Why Regex Fails
Problem 1: Vocabulary gaps. "The Nexus portal is throwing a 502" should trigger `ops:deploy` or `ops:monitoring`. Neither pattern matches `502` or `portal`. The regex fails silently — no skill injected, user gets no operational context.
Problem 2: First-match wins with no ranking. ops:debug fires on "fix the deploy script" because `fix` matches its pattern, even though `ops:deploy` is the semantically correct skill for that prompt. The system has no way to distinguish between a weak match and a strong match.
Problem 3: No outcome signal. ops:ios is invoked 71 times in entries.jsonl. Of those, roughly how many resulted in a successful archive, a failed xcodebuild, or a correction immediately after? The invocation record writes `trigger_prompt: "deploy the spore app"` and then forgets about what happened. There is zero feedback loop between "skill was injected" and "skill helped."
Problem 4: Compound tasks are undefined. "Fix the Prefect deploy for Spore's Supabase migration" touches ops:debug, ops:prefect, ops:supabase simultaneously. The regex fires on the first match and stops. Multi-skill composition is not possible.
Problem 5: Cold prompts are never routed. Creative or hybrid prompts that use no operational keywords are never routed to a skill, even when they pattern-match to a known successful workflow from session history.
---
Section 1: Embedding Architecture
The Shared Embedding Space
The core idea: embed skills and prompts into the same vector space so that similarity in the space means "this skill is helpful for this prompt." Two design choices determine the quality of this space.
Choice A: Embedding model. We need an embedding model that (a) runs in <100ms on Mac1 (hook budget is 500ms total, need headroom), (b) captures operational semantics (distinguishes "deploy flow" from "debug flow"), and (c) is available without additional inference infrastructure.
Options:
| Option | Latency | Quality | Available |
|---|---|---|---|
| `text-embedding-3-small` via OpenAI API | 80-150ms (network) | High | Yes (API key exists) |
| Gemini `text-embedding-004` | 80-150ms (network) | High | Yes (GOOGLE_API_KEY) |
| `nomic-embed-text` via Ollama on Mac4 | 20-40ms (Tailscale) | Good | Mac4 :11434 |
| `all-MiniLM-L6-v2` via sentence-transformers | 5-15ms (local) | Medium | Needs install |
| RAG++ gateway `/api/rag/embed` | ~50ms (SSH tunnel :8000) | High (Gemini) | Already exists |
Recommended: RAG++ gateway. The infrastructure already exists. RAG++ uses Gemini `text-embedding-004` and has a pgvector persistence layer. Routing queries through the existing gateway means skill embeddings land in the same space as the turn-level semantic search, enabling cross-system retrieval. The RAG++ container runs on cloud-vm with SSH tunnel to Mac1 :8000.
Choice B: What to embed. This determines what the space means.
Each skill should be embedded not just from its name but from its operational semantics: the combination of its intent statement, its workflow steps, its gotchas, and historically successful trigger prompts. This produces a richer vector than embedding "ops:deploy — Deploy services to cloud-vm" alone.
def build_skill_embedding_text(skill_name: str, skill_md: str) -> str:
"""Construct embedding input for a skill — captures semantic surface area."""
sections = []
# 1. Strip frontmatter, extract intent + workflow
body = strip_frontmatter(skill_md)
intent = extract_section(body, "## Intent")
workflow = extract_section(body, "## Workflow")
gotchas = extract_section(body, "## Gotchas")
# 2. Pull top-5 historical trigger prompts from invocation_records
historical = get_top_prompts_for_skill(skill_name, limit=5)
sections = [
f"Skill: {skill_name}",
f"Intent: {intent[:200]}",
f"Workflow: {workflow[:300]}",
f"Gotchas summary: {gotchas[:200]}",
f"Used for prompts like: {'; '.join(historical)}",
]
return "\n".join(s for s in sections if s)For prompts, embed the raw text, potentially augmented with CWD context (the project being worked on often disambiguates ambiguous prompts — "fix the error" means something different in `/Desktop/Spore/` vs `/flows/feed-hub/`).
def build_prompt_embedding_text(prompt: str, cwd: str = "") -> str:
project = Path(cwd).name if cwd else ""
if project:
return f"[project:{project}] {prompt}"
return promptVector Dimensions and Storage
RAG++ uses Gemini `text-embedding-004` which produces 768-dimensional vectors. All skill embeddings live in a single pgvector table on cloud-vm. The table structure:
-- In Supabase (reuses existing pgvector schema)
CREATE TABLE skill_embeddings (
skill_name TEXT PRIMARY KEY,
embedding vector(768),
embedding_text TEXT, -- the text that was embedded
updated_at TIMESTAMPTZ DEFAULT NOW(),
version INT DEFAULT 1, -- bump when skill content changes
trajectory_weight FLOAT DEFAULT 1.0 -- modified by outcome learning
);
CREATE INDEX ON skill_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 10); -- 10 skills = small index, fastThis is a minimal footprint: one new table, reuses existing pgvector infrastructure. No new Docker services.
Skill Embedding Pre-computation
Skills are not re-embedded on every hook call. They are pre-computed once and cached. A background task re-embeds on skill content change.
# [home-path]
CACHE_PATH = Path.home() / ".claude" / "cortex" / "skill_embeddings.pkl"
CACHE_TTL_SECONDS = 3600 # 1 hour
def load_skill_embeddings() -> dict[str, np.ndarray]:
"""Load cached skill embeddings. Recompute if stale."""
if CACHE_PATH.exists():
age = time.time() - CACHE_PATH.stat().st_mtime
if age < CACHE_TTL_SECONDS:
with open(CACHE_PATH, "rb") as f:
return pickle.load(f)
# Cache miss — recompute from pgvector
embeddings = fetch_embeddings_from_pgvector()
with open(CACHE_PATH, "wb") as f:
pickle.dump(embeddings, f)
return embeddingsIn-process: a Python dict `{skill_name: np.ndarray}` with 10 skills × 768 dims = ~60KB. Fits easily in the hook process memory. No network call on cache hit.
---
Section 2: Trajectory-Weighted Similarity
This is the core innovation over naive cosine similarity. Raw cosine similarity asks: "which skill text is most similar to this prompt text?" Trajectory-weighted similarity asks: "which skill has historically worked for prompts most similar to this one?"
The Weight Modification Formula
Each skill has a trajectory weight `w_s` that modifies its effective distance to any query prompt. The modified similarity is:
sim_weighted(prompt, skill) = cosine_sim(embed(prompt), embed(skill)) * w_sWhere `w_s` starts at 1.0 and is updated by observed outcome signals. Skills with high success rates for similar prompts get `w_s > 1.0` (pulled toward prompts). Skills with high failure/correction rates get `w_s < 1.0` (pushed away).
The weight update rule (an exponential moving average to prevent catastrophic forgetting):
ALPHA = 0.1 # learning rate — slow to prevent noise domination
GAMMA = 0.9 # decay — old signals matter less
def update_weight(current_weight: float, outcome: float, alpha: float = ALPHA) -> float:
"""
outcome: +1.0 = success, 0.0 = neutral, -1.0 = failure/correction
Maps outcome to [0.5, 1.5] scaling range.
"""
target = 1.0 + (outcome * 0.5) # outcome=+1 -> target=1.5, outcome=-1 -> target=0.5
return current_weight * (1 - alpha) + target * alphaThis keeps weights bounded: they converge toward 1.5 for consistently helpful skills and toward 0.5 for consistently unhelpful ones. Never reaches 0 (no skill is completely suppressed) and never explodes above 1.5 (no skill dominates unconditionally).
Outcome Signal Sources
The key question: how do we know if a skill injection "worked"? We have three signal sources of decreasing reliability:
Signal 1: Correction absence (strongest proxy, negative signal only). The correction_detector already fires on Stop events. If a correction is detected within 3 prompts after a skill injection, score = -1.0. A correction is strong evidence the skill did not help (or actively misled). This is already being recorded in entries.jsonl — we just need to link corrections to their preceding invocation records.
def check_post_injection_correction(
invocation_ts: str,
session_id: str,
window_prompts: int = 3,
) -> bool:
"""Returns True if a correction was detected after this skill injection."""
entries = load_entries(entry_type="correction")
inv_dt = datetime.fromisoformat(invocation_ts)
for correction in entries:
if correction.session_id != session_id:
continue
corr_dt = datetime.fromisoformat(correction.ts)
if corr_dt > inv_dt:
# Count prompts between injection and correction
prompts_between = count_prompts_in_window(session_id, inv_dt, corr_dt)
if prompts_between <= window_prompts:
return True
return FalseSignal 2: Task completion indicators (medium reliability, positive signal). Certain Stop event patterns indicate success: the user's next prompt is a trivial continuation ("commit this", "push it", "ship it", "looks good"). These are already filtered out by the extractor.py SKIP_PATTERNS — but for reward purposes, they are exactly what we want. A session that ends with "commit this" after ops:git injection is a positive trajectory.
SUCCESS_INDICATORS = [
r"^commit this",
r"^push it",
r"^ship it",
r"^looks good",
r"^perfect",
r"^nice",
r"^done",
r"^deploy it",
]
def check_post_injection_success(
invocation_ts: str,
session_id: str,
window_prompts: int = 5,
) -> bool:
"""Returns True if a success-indicator prompt followed this injection."""
next_prompts = get_prompts_after(session_id, invocation_ts, limit=window_prompts)
for p in next_prompts:
for pattern in SUCCESS_INDICATORS:
if re.match(pattern, p["text"], re.IGNORECASE):
return True
return FalseSignal 3: Build/deploy exit codes (strongest signal, narrow domain). For ops:ios, ops:deploy, ops:docker — Bash tool exit codes are already logged in the bash_audit.jsonl and post_tool_hook.py. A zero exit code on `xcodebuild archive` or `docker compose up -d` after an ops:ios / ops:deploy injection is a high-confidence success signal.
def check_bash_exit_codes(
invocation_ts: str,
session_id: str,
skill_name: str,
) -> Optional[float]:
"""Returns 1.0 if relevant build commands succeeded, -1.0 if failed, None if not applicable."""
SKILL_COMMANDS = {
"ops:ios": ["xcodebuild", "xcode-select"],
"ops:deploy": ["systemctl", "docker compose"],
"ops:docker": ["docker"],
"ops:prefect": ["prefect"],
"ops:git": ["git commit", "git push"],
}
relevant_cmds = SKILL_COMMANDS.get(skill_name, [])
if not relevant_cmds:
return None # not applicable for this skill
bash_records = load_bash_records_after(session_id, invocation_ts)
for record in bash_records:
cmd = record.get("command", "")
if any(c in cmd for c in relevant_cmds):
exit_code = record.get("exit_code", -1)
return 1.0 if exit_code == 0 else -0.5
return NoneCombining Signals Into a Trajectory Score
def compute_trajectory_score(
invocation: CortexEntry,
all_entries: list[CortexEntry],
) -> float:
"""
Returns a score in [-1, +1] for this invocation.
Priority: bash exit codes > correction detection > success indicators > neutral.
"""
# Try bash exit codes first (highest signal quality)
exit_score = check_bash_exit_codes(
invocation.ts, invocation.session_id, invocation.skill
)
if exit_score is not None:
return exit_score
# Check for corrections (strong negative signal)
if check_post_injection_correction(invocation.ts, invocation.session_id):
return -1.0
# Check for success indicators (moderate positive signal)
if check_post_injection_success(invocation.ts, invocation.session_id):
return 0.5 # not +1.0 — success indicators are noisy
# No signal available — treat as weakly positive (injection didn't cause harm)
return 0.1---
Section 3: Online Learning
The embedding space must update as new trajectories arrive. The challenge: we cannot re-embed skills after every prompt. Instead, we update the trajectory weights (cheap floating-point operations) and re-embed skills only when their content changes or on a scheduled recomputation.
Two-Tier Learning Pipeline
Tier 1: Weight updates (real-time, every Stop event).
A lightweight Stop hook aggregates recent invocation records, computes trajectory scores, and updates `trajectory_weight` in the skill_embeddings table. This runs in the Stop hook's existing budget (correction_detector already fires here).
# [home-path]
# Fires on Stop event, budget: 200ms
def update_weights_from_session(session_id: str) -> None:
"""
For all invocations in this session, compute outcome scores
and apply EMA weight updates to skill_embeddings.
"""
invocations = [
e for e in load_entries(entry_type="invocation_record")
if e.session_id == session_id
]
if not invocations:
return
all_entries = load_entries()
for inv in invocations:
score = compute_trajectory_score(inv, all_entries)
if abs(score) < 0.05:
continue # skip neutral, save write
current_weight = fetch_weight(inv.skill) # from local pkl cache
new_weight = update_weight(current_weight, score)
persist_weight(inv.skill, new_weight) # update pkl + Supabase asyncTier 2: Embedding recomputation (scheduled, Prefect daily).
Once a day, a Prefect flow re-embeds all skills whose content has changed (mtime comparison) or whose trajectory weight diverged significantly from 1.0 in the previous week. The recomputation uses the updated `build_skill_embedding_text()` function which now incorporates the new top historical prompts accumulated since the last embedding.
# flows/feed-hub/skill_embedding_refresh.py
@flow(name="skill-embedding-refresh")
def refresh_skill_embeddings(force_all: bool = False) -> dict:
"""
Daily: re-embed skills with changed content or significant weight drift.
"""
registry = load_registry()
stats = {"reembedded": 0, "skipped": 0}
for skill_name, info in registry.get("skills", {}).items():
if info.get("status") != "active":
continue
skill_md_path = SKILLS_DIR / skill_name / "SKILL.md"
if not skill_md_path.exists():
continue
if not force_all:
# Check if recomputation is needed
current_emb = fetch_embedding_record(skill_name)
if current_emb:
skill_mtime = skill_md_path.stat().st_mtime
emb_mtime = current_emb["updated_at"].timestamp()
weight_drift = abs(current_emb["trajectory_weight"] - 1.0)
if skill_mtime < emb_mtime and weight_drift < 0.15:
stats["skipped"] += 1
continue
skill_text = build_skill_embedding_text(skill_name, skill_md_path.read_text())
embedding = embed_via_ragpp(skill_text) # POST to :8000
upsert_skill_embedding(skill_name, embedding, skill_text)
stats["reembedded"] += 1
return statsThe Learning Loop Visualized
[User prompt arrives]
|
v
[ops_trigger_v2.py] -- embed prompt (RAG++ :8000, ~80ms)
|
v
[load skill_embeddings.pkl] -- in-memory, <1ms
|
v
[compute weighted similarities] -- 10 skills x 768 dims, numpy, <1ms
|
v
[inject top-k skill(s)] -- SKILL.md content injection
|
v
[session continues... tools called... bash exits... corrections...]
|
v
[Stop event fires]
|
v
[weight_updater.py] -- score trajectory, EMA update weights, persist
|
v
[weights.pkl updated] -- next prompt uses updated space
|
v (daily)
[skill_embedding_refresh Prefect flow] -- re-embed with historical promptsEach cycle, the routing space becomes more accurate. After 100 sessions, the top-k retrieval has been shaped by real outcomes rather than a programmer's regex intuition.
---
Section 4: Replacement of ops_trigger.py
New File: ops_trigger_v2.py
The new router replaces the regex matching loop with a vector similarity lookup. The 500ms SIGALRM budget is preserved — the critical path is the embedding API call (~80ms) plus cache load (<1ms) plus numpy dot product (<1ms).
#!/usr/bin/env python3
"""Ops-Trigger V2 — learned vector routing for skill injection.
Replaces regex-based skill matching with trajectory-weighted cosine similarity.
Embedding model: Gemini text-embedding-004 via RAG++ gateway (:8000).
Weights stored in: [home-path]
Full design: Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-d.md
Performance budget: <500ms (80ms embed + <5ms retrieval + <50ms load). SIGALRM hard cap.
"""
from __future__ import annotations
import json
import os
import pickle
import signal
import sys
import time
from pathlib import Path
from typing import Optional
import numpy as np
if __name__ == "__main__" or "pytest" not in sys.modules:
signal.signal(signal.SIGALRM, lambda *_: sys.exit(0))
signal.setitimer(signal.ITIMER_REAL, 0.5)
SKILLS_DIR = Path.home() / ".claude" / "skills"
CACHE_PATH = Path.home() / ".claude" / "cortex" / "skill_embeddings.pkl"
ENTRIES_FILE = Path.home() / ".claude" / "cortex" / "entries.jsonl"
RAG_EMBED_URL = "http://localhost:8000/api/rag/embed" # SSH tunnel :8000
# Similarity thresholds
MIN_SIMILARITY = 0.60 # below this: no injection
MULTI_SKILL_THRESHOLD = 0.80 # above this for multiple skills: compose
TRAJECTORY_WEIGHT_FLOOR = 0.3 # safety floor to prevent total suppression
def _embed_prompt(text: str, cwd: str = "") -> Optional[np.ndarray]:
"""Embed a prompt text via RAG++ gateway. Returns None on failure."""
import urllib.request
payload = json.dumps({
"text": f"[project:{Path(cwd).name}] {text}" if cwd else text
}).encode()
try:
req = urllib.request.Request(
RAG_EMBED_URL,
data=payload,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=0.35) as resp:
data = json.loads(resp.read())
return np.array(data["embedding"], dtype=np.float32)
except Exception:
return None
def _load_embeddings() -> dict[str, dict]:
"""Load skill embeddings from local pickle cache."""
if not CACHE_PATH.exists():
return {}
try:
with open(CACHE_PATH, "rb") as f:
return pickle.load(f)
except Exception:
return {}
def _cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two unit vectors."""
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0 or norm_b == 0:
return 0.0
return float(np.dot(a, b) / (norm_a * norm_b))
def _rank_skills(
prompt_emb: np.ndarray,
skill_embeddings: dict[str, dict],
) -> list[tuple[str, float]]:
"""
Rank skills by trajectory-weighted cosine similarity.
Returns list of (skill_name, weighted_score) sorted descending.
"""
ranked = []
for skill_name, record in skill_embeddings.items():
emb = record.get("embedding")
if emb is None:
continue
raw_sim = _cosine_sim(prompt_emb, np.array(emb, dtype=np.float32))
weight = max(record.get("trajectory_weight", 1.0), TRAJECTORY_WEIGHT_FLOOR)
weighted = raw_sim * weight
ranked.append((skill_name, weighted))
return sorted(ranked, key=lambda x: -x[1])
def _check_claims(skill_name: str) -> bool:
"""Check pane domain claims — unchanged from v1."""
claims_path = Path.home() / ".claude" / "state" / "pane_claims.json"
if not claims_path.exists():
return True
try:
with open(claims_path) as f:
claims = json.load(f)
except Exception:
return True
my_tty = os.environ.get("TTY", "")
from .ops_trigger import _load_registry
registry = _load_registry()
domains = registry.get("skills", {}).get(skill_name, {}).get("domains", [])
domain = domains[0] if domains else None
if not domain:
return True
for claim in claims.get("active", []):
if claim.get("domain") == domain and claim.get("tty") != my_tty:
return False
return True
def _load_skill_content(skill_name: str) -> Optional[str]:
"""Load SKILL.md body — unchanged from v1."""
skill_path = SKILLS_DIR / skill_name / "SKILL.md"
if not skill_path.exists():
return None
content = skill_path.read_text()
if content.startswith("---"):
end = content.find("---", 3)
if end > 0:
content = content[end + 3:].strip()
return content
def _write_invocation(
skill_name: str,
prompt: str,
similarity: float,
composed_with: Optional[list[str]] = None,
) -> None:
"""Write invocation record with similarity metadata."""
import uuid
from datetime import datetime, timezone
entry = {
"id": uuid.uuid4().hex[:8],
"type": "invocation_record",
"ts": datetime.now(timezone.utc).isoformat(),
"machine": "mac1",
"pane": os.environ.get("TTY", ""),
"skill": skill_name,
"trigger_prompt": prompt[:100],
"routing_method": "embedding_v2",
"similarity_score": round(similarity, 4),
"composed_with": composed_with or [],
}
ENTRIES_FILE.parent.mkdir(parents=True, exist_ok=True)
with open(ENTRIES_FILE, "a") as f:
f.write(json.dumps(entry, separators=(",", ":")) + "\n")
def main():
try:
hook_input = json.loads(sys.stdin.read())
except (json.JSONDecodeError, EOFError):
sys.exit(0)
prompt = hook_input.get("prompt", "")
cwd = hook_input.get("cwd", "")
if not prompt or len(prompt) < 5:
sys.exit(0)
# Step 1: Embed the prompt
prompt_emb = _embed_prompt(prompt, cwd)
if prompt_emb is None:
# Embedding failed (timeout/network) — fall back to legacy regex
# Import and call v1 main() as fallback
from .ops_trigger import main as legacy_main
legacy_main()
return
# Step 2: Load skill embeddings from cache
skill_embeddings = _load_embeddings()
if not skill_embeddings:
sys.exit(0) # cold start — no embeddings yet
# Step 3: Rank by weighted similarity
ranked = _rank_skills(prompt_emb, skill_embeddings)
if not ranked or ranked[0][1] < MIN_SIMILARITY:
sys.exit(0) # no skill is relevant enough
# Step 4: Determine injection strategy
top_skill, top_score = ranked[0]
if not _check_claims(top_skill):
sys.exit(0)
# Check for multi-skill composition
composed = []
for skill_name, score in ranked[1:3]: # consider 2nd and 3rd
if score >= MULTI_SKILL_THRESHOLD and _check_claims(skill_name):
composed.append((skill_name, score))
# Step 5: Inject
if not composed:
# Single skill injection
content = _load_skill_content(top_skill)
if content:
print(f"<system-reminder>\n[Cortex v2] Skill: {top_skill} (sim={top_score:.3f})\n{content}\n</system-reminder>")
_write_invocation(top_skill, prompt, top_score)
else:
# Multi-skill composition
_inject_composite(top_skill, top_score, composed, prompt)
sys.exit(0)
def _inject_composite(
primary: str,
primary_score: float,
secondary: list[tuple[str, float]],
prompt: str,
) -> None:
"""Inject a composite skill context for multi-domain prompts."""
parts = [f"[Cortex v2] Multi-skill: {primary} (sim={primary_score:.3f})"]
all_skills = [primary]
primary_content = _load_skill_content(primary)
if primary_content:
parts.append(f"## {primary}\n{primary_content}")
for skill_name, score in secondary:
content = _load_skill_content(skill_name)
if content:
# Extract only the Gotchas section for secondary skills to avoid bloat
gotchas = _extract_section(content, "## Gotchas")
if gotchas:
parts.append(f"## {skill_name} — Gotchas\n{gotchas}")
all_skills.append(skill_name)
combined = "\n\n".join(parts)
print(f"<system-reminder>\n{combined}\n</system-reminder>")
for skill_name in all_skills:
score = primary_score if skill_name == primary else dict(secondary)[skill_name]
_write_invocation(skill_name, prompt, score, composed_with=all_skills)
def _extract_section(content: str, section_header: str) -> str:
"""Extract a markdown section by header."""
start = content.find(section_header)
if start == -1:
return ""
end = content.find("\n## ", start + len(section_header))
if end == -1:
return content[start:]
return content[start:end]
if __name__ == "__main__":
main()Migration Strategy
The migration from v1 to v2 is zero-downtime via a feature flag in the registry:
"router": {
"version": "v1", // change to "v2" when embeddings are bootstrapped
"fallback": "regex", // v2 falls back to v1 on embedding failure
"min_similarity": 0.60
}Phase 1: Run both v1 and v2 in shadow mode. v2 logs its decisions to entries.jsonl (routing_method: "embedding_v2") but does not inject. Compare v1 and v2 selections. Phase 2: Enable v2 with regex fallback. Phase 3: Disable regex fallback once similarity score distribution stabilizes.
---
Section 5: Cold Start Bootstrap
The system has 324 invocation records in entries.jsonl. They are the bootstrap dataset. However, 319 of these are registration events ("forge:register ops:ios") not real prompt matches — the registry shows `invocations: 0` for all active skills. The real trigger invocations have not been reliably recorded yet because the trigger pattern was too narrow.
Bootstrap Algorithm
# [home-path]
# Run once to initialize skill_embeddings.pkl
def bootstrap_from_invocation_records():
"""
Phase 1: Embed all 13 active skills using current SKILL.md content
+ historical prompts from prompts-all.jsonl.
Phase 2: Set initial trajectory weights from domain keyword frequency
(a cheap proxy for "how often does this domain matter here").
Phase 3: Persist to pkl + pgvector.
"""
# Phase 1: Gather top prompts per domain
prompts = load_all_prompts()
domain_prompts = group_prompts_by_domain(prompts) # uses DOMAIN_KEYWORDS from extractor.py
skill_records = {}
for skill_name, info in load_active_skills().items():
skill_md = (SKILLS_DIR / skill_name / "SKILL.md").read_text()
domain = info.get("domains", ["debug"])[0]
historical = domain_prompts.get(domain, [])[:5]
embedding_text = build_skill_embedding_text(skill_name, skill_md, historical)
embedding = embed_via_ragpp(embedding_text)
# Phase 2: Initial weight from domain frequency
# Skills invoked more often start with slight advantage
domain_count = len(domain_prompts.get(domain, []))
initial_weight = 1.0 + min(domain_count / 1000, 0.2) # max 1.2 initial advantage
skill_records[skill_name] = {
"embedding": embedding.tolist(),
"embedding_text": embedding_text,
"trajectory_weight": initial_weight,
"version": 1,
"bootstrapped_at": datetime.now(timezone.utc).isoformat(),
}
# Phase 3: Persist
with open(CACHE_PATH, "wb") as f:
pickle.dump(skill_records, f)
upsert_all_to_pgvector(skill_records)
print(f"Bootstrap complete: {len(skill_records)} skills embedded")
return skill_records
def group_prompts_by_domain(prompts: list[dict]) -> dict[str, list[str]]:
"""Group prompt texts by the operational domain they touch."""
from cortex.forge.extractor import DOMAIN_KEYWORDS, _detect_domains
result = {d: [] for d in DOMAIN_KEYWORDS}
for p in prompts:
for domain in _detect_domains(p["text"]):
if len(result[domain]) < 50:
result[domain].append(p["text"])
return resultCold Start Performance Estimate
| Skill | Domain Prompts Available | Initial Weight Estimate |
|---|---|---|
| ops:debug | 110 (from Stage 0 research) | 1.11 |
| ops:deploy | 79 | 1.08 |
| ops:ios | 65 | 1.065 |
| ops:supabase | 50 | 1.05 |
| ops:monitoring | 35 | 1.035 |
| ops:git | 21 | 1.021 |
| ops:prefect | 19 | 1.019 |
| ops:docker | ~15 | 1.015 |
| ops:asc | ~10 | 1.01 |
| ops:mesh | ~8 | 1.008 |
After bootstrap, the routing space already reflects the empirical frequency distribution of this operator's work — without any training. This is better than regex from day one.
---
Section 6: Multi-Skill Composition
When two or more skills exceed the composition threshold (0.80 weighted similarity), we inject a composite context. The design challenge is preventing context bloat: injecting 3 full SKILL.md files (~600 lines each) would overwhelm the session context.
Composition Strategy
Primary skill (highest similarity): FULL SKILL.md content injected
Secondary skills (>0.80 threshold): GOTCHAS SECTION ONLY injected
Tertiary and beyond: SKILL NAME MENTIONED ONLY ("also relevant: ops:deploy")The gotchas are the highest-value section for cross-domain prompts. When someone says "fix the Prefect deploy for the Spore Supabase migration," they need the full ops:prefect workflow, plus the deploy gotchas (SSH heredoc, VM hang recovery), plus the supabase gotcha (anon key behavior). They don't need three full workflows.
Composite Injection Format
[Cortex v2] Multi-skill: ops:prefect (sim=0.87) + ops:deploy (sim=0.83) + ops:supabase (sim=0.81)
## ops:prefect
[full SKILL.md content]
## ops:deploy — Gotchas
- SSH heredoc: Mangles ${...} — write scripts locally, scp to VM, then execute
- VM hangs: ssh cloud-vm hangs? → gcloud compute instances reset...
[gotchas only]
## ops:supabase — Gotchas
- Anon key allows writes: RLS policies use auth.role() = 'authenticated' but anon key works for CRUD...
[gotchas only]Composition Threshold Tuning
The 0.80 threshold is a starting value. After 50 composite injections, we can analyze whether composite sessions have higher success rates than single-skill sessions. If composite sessions show more corrections, raise the threshold to 0.85. If they show fewer corrections, lower it to 0.75. This tuning happens in the daily Prefect refresh flow.
---
Section 7: Infrastructure
Storage Map
| Component | Location | Size Estimate | Purpose |
|---|---|---|---|
| `skill_embeddings.pkl` | `[home-path]` | ~60KB (10 skills × 768 dims × 4 bytes) | Hot cache, loaded per hook call |
| `skill_embeddings` table | Supabase / cloud-vm pgvector | 10 rows × 768 vector | Persistent source of truth, daily sync |
| `entries.jsonl` | `[home-path]` | Growing, currently 399 lines | Invocation records with similarity_score and routing_method fields |
| `weight_log.jsonl` | `[home-path]` | ~1KB/day | Audit trail for weight updates, enables replay/rollback |
New Fields Added to entries.jsonl Schema
Existing invocation_record type gains two fields (backward compatible):
{
"type": "invocation_record",
"skill": "ops:prefect",
"trigger_prompt": "deploy the flow",
"routing_method": "embedding_v2", // NEW — "regex_v1" | "embedding_v2" | "fallback_v1"
"similarity_score": 0.847, // NEW — cosine sim after trajectory weighting
"composed_with": ["ops:deploy"] // NEW — other skills in composite injection
}No schema migrations required in Supabase for this — these are additive fields in a JSONL store.
New table (one addition to the existing 141 Supabase tables):
CREATE TABLE skill_embeddings (
skill_name TEXT PRIMARY KEY,
embedding vector(768),
embedding_text TEXT,
trajectory_weight FLOAT NOT NULL DEFAULT 1.0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
version INT NOT NULL DEFAULT 1,
invocation_count INT NOT NULL DEFAULT 0,
success_count INT NOT NULL DEFAULT 0,
failure_count INT NOT NULL DEFAULT 0
);Computational Cost
The only new compute cost on the critical path is the embedding API call:
| Component | Cost | Frequency |
|---|---|---|
| Gemini embed (RAG++ :8000) | ~80ms, negligible $ | Per prompt that reaches ops_trigger_v2 |
| pkl cache load | <1ms | Per trigger |
| numpy dot product (10 skills) | <1ms | Per trigger |
| Weight update (EMA float ops) | <1ms | Per Stop event with invocations |
| Prefect daily re-embed | 10 API calls/day | Daily |
The full critical path is ~85ms out of a 500ms budget. Comfortable.
Fallback Safety
If RAG++ is down (SSH tunnel dropped, Docker container restarting), the embedding call returns None. The router immediately falls back to `ops_trigger.py` (v1 regex). The system degrades gracefully. The Stop hook's weight updater also short-circuits on error without any user-visible effect.
---
Section 8: Comparison with Full KARL
This design is explicitly "KARL for routing" — not "KARL for reasoning." The distinction matters.
| Dimension | Full KARL (Databricks) | Path D (Skill Embeddings) |
|---|---|---|
| What is trained | Model weights (OAPL gradient updates) | Routing weight scalars (EMA updates) |
| Training data | Synthetic multi-step rollouts (1000+ examples) | Existing invocation records (324, growing) |
| Compute required | GPU cluster, 8 rollouts per prompt | Single M1 Mac, EMA float updates |
| Latency impact | Zero (weights are model weights, no extra inference) | +80ms per prompt for embedding |
| What improves | The model's ability to formulate search queries | Which static skill gets injected |
| Learning algorithm | OAPL (KL-regularized RL with regression loss) | EMA weight update from outcome signals |
| Trajectory depth | 50-200 tool-use steps per trajectory | 1 invocation → 1-5 outcome prompts |
| Coverage | Any text generation task | Only tasks where skills exist |
| Maximum lift | Closes 52.6 → 67.5 (KARL's result with GLM 4.5) | Eliminates ~30-40 |
Path D's theoretical ceiling: The static SKILL.md content is not improved. Only the selection mechanism is improved. If ops:debug's workflow is wrong, better routing to ops:debug still injects wrong content. Path D is a prerequisite for — not a replacement of — the deeper KARL integration (Path A/B/C which involve trajectory-based skill rewriting or weight updates).
Path D's practical advantage: It is buildable in one sprint, runs on existing infrastructure, and produces measurable results immediately. The 324 invocation records provide a bootstrap dataset, the RAG++ embedding infrastructure is live, and the EMA weight update is 10 lines of code. The output is a new artifact (skill_embeddings.pkl) that can be extended by future paths.
---
Section 9: Risks
Risk 1: Embedding Drift
What it is: The space in which skills and prompts are embedded shifts when the embedding model is updated. If Gemini rolls `text-embedding-004` to a new version, all vectors become incompatible.
Likelihood: Medium. Gemini embedding models are relatively stable, but not guaranteed.
Mitigation: Store `embedding_model_version` in each skill record. On drift detection (cosine_sim between old and new embedding of a known anchor text drops below 0.90), trigger a full re-embed of all skills. The daily Prefect flow includes this check.
ANCHOR_TEXT = "deploy the Prefect flow to cloud-vm"
# Compare embed(ANCHOR_TEXT) monthly. If cos_sim < 0.90 vs. previous: re-embed all.Risk 2: Cold Start for New Skills
What it is: A newly forged skill has no trajectory weight history. It starts at `w = 1.0` with no successful invocations. If its initial embedding lands close to many prompts, it may get selected too aggressively before it has proven itself.
Mitigation: New skills start at `trajectory_weight = 0.85` (slightly below neutral). They must accumulate positive signals before competing equally with established skills. The weight floor (0.3) and ceiling (1.5) prevent extremes.
Risk 3: Catastrophic Forgetting of Long-Tail Scenarios
What it is: EMA gives recent outcomes more weight than old ones. A skill that was heavily used (and succeeded) 60 days ago but not used recently will have its weight decayed toward 1.0. If it suddenly becomes relevant again, the routing space has forgotten its strength.
Mitigation: The weight decay is gentle (EMA alpha = 0.1 means 10
Risk 4: Success Signal Noise
What it is: The success indicators ("commit this", "looks good") are weak proxies. A user can say "commit this" after a session where the injected skill was irrelevant to the task — the success was not because of the skill, but despite it. This creates false positive weight updates.
Mitigation: The success signal only contributes +0.5 (not +1.0) to the EMA update. Correction signals contribute -1.0 (full weight). This asymmetry means the system is skeptical of positive signals and quick to penalize negative ones, which matches the real use case (it's easier to verify a failure than a success).
Risk 5: Bootstrap Data Quality
What it is: The 324 existing invocation records are mostly registration events, not real trigger events. The actual trigger prompt data is sparse because the regex was too narrow to trigger frequently.
Mitigation: Bootstrap weights from domain keyword frequency (the `group_prompts_by_domain()` approach from Section 5) rather than from invocation counts. This gives a reasonable prior based on 903 real prompts without requiring perfect invocation data. After 1-2 weeks of v2 operation, real invocation data will dominate.
Risk 6: RAG++ Dependency in Hook Critical Path
What it is: ops_trigger_v2 now depends on a network call to localhost:8000 (RAG++ via SSH tunnel). If the tunnel is down, the hook has 350ms to detect the failure and fall back to v1. The SSH tunnel is a LaunchAgent-managed process that reconnects automatically, but brief outages are possible.
Mitigation: The fallback to v1 regex is already designed in. The urllib timeout is set to 0.35s — well within the SIGALRM budget. All failures are silent (sys.exit(0)) from the hook's perspective; v1 regex fires if v2 times out. No user-visible impact.
---
Implementation Checklist
### Phase 1: Bootstrap (1-2 days)
- [ ] Write `[home-path]` using design from Section 5
- [ ] Add `/api/rag/embed` endpoint to RAG++ gateway (or confirm existing embedding endpoint)
- [ ] Run bootstrap, verify `skill_embeddings.pkl` created with 13 skill vectors
- [ ] Create `skill_embeddings` Supabase table
- [ ] Unit test `_rank_skills()` with mock embeddings — verify weighted ordering
### Phase 2: Shadow Mode (3-5 days)
- [ ] Write `ops_trigger_v2.py` using design from Section 4
- [ ] Add routing_method + similarity_score + composed_with to invocation_record writes
- [ ] Run v2 alongside v1: v2 logs decisions, v1 continues injecting
- [ ] Analyze: do v2 selections match v1 on high-confidence prompts? Where do they diverge?
- [ ] Calibrate MIN_SIMILARITY threshold from distribution of observed similarity scores
### Phase 3: Live Routing (1 week)
- [ ] Enable v2 with regex fallback (feature flag in registry.json router section)
- [ ] Write `weight_updater.py` Stop hook component
- [ ] Verify weight updates are landing in pkl and Supabase
- [ ] Write `skill_embedding_refresh` Prefect flow
- [ ] Monitor: invocation_record similarity_score distribution, correction rate before/after
### Phase 4: Evaluation (ongoing)
- [ ] A/B metric: correction rate in sessions with embedding-routed skills vs. regex-routed
- [ ] A/B metric: task completion time before/after (from session duration in unified.jsonl)
- [ ] Dashboard: add embedding routing stats to Nexus Portal /skills page
- [ ] Tune MULTI_SKILL_THRESHOLD based on composite injection outcomes
---
Connection to Other KARL Paths
Path D (Skill Embeddings) is the foundation layer for the deeper KARL integration paths. It produces infrastructure that the other paths depend on:
- Path A (Trajectory Recorder): Needs invocation records enriched with similarity_score to know which skills were active during a trajectory. Path D adds exactly this.
- Path B (OAPL on Mac5): The trained model on Mac5 could generate embeddings as its output rather than or in addition to text — "predict which skill embedding is nearest for this prompt" as a learned function. Path D's infrastructure is the ground truth for this training target.
- Path C (Self-Play Synthesis): Synthetic training examples should include skill routing decisions. Path D's routing log becomes the supervision signal for self-play: "for this type of prompt, ops:prefect was the right skill."
The ordering is: Path D first (routing infrastructure), then Path A (trajectory collection enriched by D), then Path B or C (model-level learning that uses D+A data).
---
Sources
### Codebase Files Read
- `[home]/.claude/cortex/router/ops_trigger.py` (233 lines)
- `[home]/.claude/cortex/models.py` (169 lines)
- `[home]/.claude/cortex/forge/extractor.py` (287 lines)
- `[home]/.claude/cortex/forge/generator.py` (178 lines)
- `[home]/.claude/cortex/decay/detector.py` (251 lines)
- `[home]/.claude/cortex/adaptation/frequency_tracker.py` (200+ lines)
- `[home]/.claude/cortex/entries.jsonl` (399 entries: 324 invocation_records, 75 decay_flags)
- `[home]/.claude/skills/registry.json` (88 skills, 13 active)
- `[home]/.claude/skills/ops:debug/SKILL.md` (Gen 2 enriched)
- `[home]/.claude/skills/ops:deploy/SKILL.md` (Gen 2 enriched)
### Stage 0 Research
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md` (504 lines)
- Section 1: Cortex system architecture with exact file line references
- Section 2: Skills infrastructure (88 skills, 13 active, Gen 1 vs Gen 2)
- Section 3: Hooks architecture and existing tool-use recording
- Section 4: Mac5 fine-tune pipeline
- Section 5: RAG++ / pgvector search infrastructure
- Section 6: Evolution World architecture
- Section 7: KARL paper (OAPL algorithm, synthetic data pipeline, performance tables)
- Section 8: Hard constraints (500ms hook budget, Mac5 16GB, 141 Supabase tables)
### Key Design Decisions Derived From Research
- Gemini text-embedding-004 via RAG++ gateway: Avoids new infrastructure; reuses existing Supabase pgvector
- EMA weight updates (alpha=0.1): Chosen to resist noise while responding to trends, bounded in [0.3, 1.5]
- Correction as primary negative signal: Already detected by correction_detector.py, strongest reliable signal
- Bootstrap from domain keyword frequency: Compensates for sparse invocation record data (most are registration events)
- Regex fallback on embed failure: Preserves availability, allows gradual rollout
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
evo-cube-output/karl-trajectory-intelligence/stage1-path-d.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture · is Stage Research