Harness Skills Layer
The harness skills layer turns executable benchmark deltas into evidence-bound skill packages. It is the local implementation of the useful parts of SkillDAG, SkillOpt, and MUSE-style memory packaging without making an unsafe claim that a failed adapter should be routed automatically.
Full Public Reader
Harness Skills Layer
The harness skills layer turns executable benchmark deltas into evidence-bound skill packages. It is the local implementation of the useful parts of SkillDAG, SkillOpt, and MUSE-style memory packaging without making an unsafe claim that a failed adapter should be routed automatically.
The key binary is `skillgraph-evolve`.
cargo run --bin skillgraph-evolve -- \
--public-tasks examples/evaluation/executable-public-tasks-python-stdlib-heldout-v1.jsonl \
--task-specs examples/evaluation/executable-taskset-python-stdlib-heldout-v1.jsonl \
--baseline-report benchmarks/executable-task-mlx-gemma4-e2b-qat-base-heldout-v1-mac5-2026-06-10.json \
--comparison-report benchmarks/executable-task-mlx-gemma4-e2b-reward-selected-512x4096-rawpython-heldout-v1-mac5-2026-06-10.json \
--output-dir examples/skills/python-stdlib-heldout-v1/e2b-reward-selected-vs-baseWhat It Builds
`skillgraph-evolve` reads:
- public task prompts
- canonical task specs
- one baseline `executable-task-bench` report
- one comparison `executable-task-bench` report
It emits:
- `trajectory-skills.jsonl`: one structured row per extracted skill family
- `skill-graph.json`: typed graph nodes and edges
- `router-index.json`: regression-gated routing index
- `skillgraph-evolution-report.json`: aggregate comparison report
- `packages/<skill_id>/SKILL.md`: human-readable activation boundary
- `packages/<skill_id>/MEMORY.md`: compact task evidence memory
- `packages/<skill_id>/tests.jsonl`: task-level evidence rows
- `packages/<skill_id>/failure_modes.json`: quarantine and diagnostic metadata
- `packages/<skill_id>/skill.json`: full structured package
SkillDAG Mapping
The graph encodes four edge types:
| Edge | Meaning |
|---|---|
| `depends_on` | Skill evidence belongs to a specific task set |
| `specializes` | Skill applies to a task family such as `path`, `date`, or `parse` |
| `repairs` | Comparison passed a task that the baseline failed |
| `conflicts_with` | Comparison failed a task that the baseline passed |
This is deliberately a harness-side graph, not a model-side promise. The graph records where a trajectory delta helped, where it hurt, and which families need repair before routing.
SkillOpt Mapping
The optimizer signal is the pass/fail delta:
- `fixed`: baseline failed, comparison passed
- `regressed`: baseline passed, comparison failed
- `preserved_pass`: both passed
- `shared_fail`: both failed
The global promotion gate defaults to:
- `net_pass_delta >= 1`
- `regressions <= 0`
Families with repairs but no local regressions are `proposed` if the global run failed. Families with any regression are `quarantined`. Families with only shared failures are `diagnostic`.
MUSE-Style Packages
Each package is a portable skill-memory unit:
- `SKILL.md` says when it can apply and when it must not activate.
- `MEMORY.md` summarizes the evidence in task language.
- `tests.jsonl` preserves exact task-level deltas for replay.
- `failure_modes.json` keeps regression and shared-failure boundaries explicit.
The package is useful even when the adapter fails overall. It tells the next run which subskills deserve targeted repair and which ones must be blocked.
Current 60-Task Result
The first generated package set compares Gemma 4 E2B QAT base against the reward-selected E2B adapter on `python-stdlib-heldout-v1-60`.
| Metric | Value |
|---|---|
| Baseline passed | 50/60 |
| Comparison passed | 46/60 |
| Net pass delta | -4 |
| Fixed tasks | 5 |
| Regressed tasks | 9 |
| Shared failures | 5 |
| Promoted skills | 0 |
| Proposed skills | 1 |
| Quarantined skills | 7 |
| Diagnostic skills | 1 |
| Active router skills | 0 |
The positive proposed skill is `python_stdlib_math_trajectory_delta`, because it repaired `py_v1_moving_average` without a math-family regression. It is still not active because the global adapter comparison failed. The router correctly leaves `active_skill_ids` empty.
This means the failed adapter large-suite result does not prove downstream adapter-level TML performance lift. It does prove that the harness can mine a failed run for bounded repair evidence without promoting unsafe behavior.
Repair Router Test
The next step is to test a proposed skill without adopting the failed adapter globally. `apply_skillgraph_repair_router.py` starts from the base candidate rows and swaps in comparison rows only for repaired task ids from allowed skill statuses.
python3 scripts/apply_skillgraph_repair_router.py \
--base-candidates examples/evaluation/executable-candidates-mlx-gemma4-e2b-qat-base-heldout-v1-mac5-2026-06-10.jsonl \
--comparison-candidates examples/evaluation/executable-candidates-mlx-gemma4-e2b-reward-selected-512x4096-rawpython-heldout-v1-mac5-2026-06-10.jsonl \
--skills-jsonl examples/skills/python-stdlib-heldout-v1/e2b-reward-selected-vs-base/trajectory-skills.jsonl \
--allow-status proposed \
--condition skillgraph_math_repair_router \
--output examples/evaluation/executable-candidates-skillgraph-math-repair-router-heldout-v1-2026-06-10.jsonl \
--report benchmarks/executable-candidate-generation-skillgraph-math-repair-router-heldout-v1-2026-06-10.jsonThen materialize and execute the normal hidden-test gate:
cargo run --bin materialize-executable-bench -- \
--tasks examples/evaluation/executable-taskset-python-stdlib-heldout-v1.jsonl \
--candidates examples/evaluation/executable-candidates-skillgraph-math-repair-router-heldout-v1-2026-06-10.jsonl \
--output examples/evaluation/executable-task-skillgraph-math-repair-router-heldout-v1-2026-06-10.jsonl
cargo run --bin executable-task-bench -- \
--input examples/evaluation/executable-task-skillgraph-math-repair-router-heldout-v1-2026-06-10.jsonl \
--output benchmarks/executable-task-skillgraph-math-repair-router-heldout-v1-2026-06-10.json \
--require-realChecked result:
| Metric | Value |
|---|---|
| Preserved base rows | 59 |
| Routed repair rows | 1 |
| Routed task | `py_v1_moving_average` |
| E2B base | 50/60 |
| Math repair router | 51/60 |
| Net pass delta | +1 |
| Regressions vs base | 0 |
| Synthetic rows | 0 |
Running `skillgraph-evolve` on base vs the repair-router report promotes `python_stdlib_math_trajectory_delta` and writes active router artifacts under `examples/skills/python-stdlib-heldout-v1/math-repair-router-vs-base/`.
Boundary: this is a real narrow router lift on the 60-task executable gate. It is not a broad claim that the reward-selected adapter should replace the base model.
Task-Level Repair Router
The stronger repair-map test routes all fixed task ids from the failed adapter while preserving base outputs for every known adapter regression. This uses quarantined family evidence only as task-level repair candidates; it does not activate the quarantined family wholesale.
python3 scripts/apply_skillgraph_repair_router.py \
--base-candidates examples/evaluation/executable-candidates-mlx-gemma4-e2b-qat-base-heldout-v1-mac5-2026-06-10.jsonl \
--comparison-candidates examples/evaluation/executable-candidates-mlx-gemma4-e2b-reward-selected-512x4096-rawpython-heldout-v1-mac5-2026-06-10.jsonl \
--skills-jsonl examples/skills/python-stdlib-heldout-v1/e2b-reward-selected-vs-base/trajectory-skills.jsonl \
--allow-status proposed quarantined \
--no-require-no-skill-regressions \
--condition skillgraph_task_repair_router \
--output examples/evaluation/executable-candidates-skillgraph-task-repair-router-heldout-v1-2026-06-10.jsonl \
--report benchmarks/executable-candidate-generation-skillgraph-task-repair-router-heldout-v1-2026-06-10.jsonChecked result:
| Metric | Value |
|---|---|
| Preserved base rows | 55 |
| Routed repair rows | 5 |
| Known adapter regressions preserved from base | 9 |
| E2B base | 50/60 |
| Task repair router | 55/60 |
| Net pass delta | +5 |
| Regressions vs base | 0 |
| Synthetic rows | 0 |
| Promoted repair families | 4 |
The base-vs-router skillgraph promotes `python_stdlib_date_trajectory_delta`, `python_stdlib_math_trajectory_delta`, `python_stdlib_parse_trajectory_delta`, and `python_stdlib_security_trajectory_delta` under `examples/skills/python-stdlib-heldout-v1/task-repair-router-vs-base/`.
Boundary: this proved that task-level repair routing can improve pass rate when used as a gated router. It still does not prove that the reward-selected adapter should replace the base model globally.
Focused E4B Chat Overlay Router
The next repair cycle targeted only the five shared failures left by the 55/60 task router:
- `py_v1_parse_size_bytes`
- `py_v1_chunked_list`
- `py_v1_common_prefix_path`
- `py_v1_normalize_segments`
- `py_v1_split_filename_version`
The focused public prompt subset is `examples/evaluation/executable-public-tasks-python-stdlib-heldout-v1-shared-failures.jsonl`. Candidate generation used `scripts/generate_executable_candidates_mlx_lm.py` with the MLX-LM API backend, Gemma chat template, E4B QAT model, 512 generation tokens, and one public-only repair attempt. The focused hidden-test gate passed two candidates:
- `py_v1_common_prefix_path`
- `py_v1_normalize_segments`
`scripts/apply_passed_candidate_overlay_router.py` then overlaid only those passed candidates onto the 55/60 task router. The rejected focused candidates stayed out of the final router.
Checked result:
| Metric | Value |
|---|---|
| Focused E4B chat repairs passed | 2/5 |
| Overlay rows from E4B chat | 2 |
| Preserved task-router rows | 58 |
| E2B base | 50/60 |
| Task repair router | 55/60 |
| Task plus E4B chat overlay router | 57/60 |
| Net pass delta vs E2B base | +7 |
| Regressions vs E2B base | 0 |
| Synthetic rows | 0 |
| Promoted repair families | 5 |
The base-vs-overlay skillgraph writes artifacts under `examples/skills/python-stdlib-heldout-v1/task-plus-e4b-chat-overlay-router-vs-base/`. It marks the comparison promotable, with seven fixed tasks, three shared failures, and zero regressions.
Boundary: this was the strongest 60-task evidence before the anticipatory planner pass. It proves regression-gated router-level repair lift. It does not prove that the failed adapter or any trained adapter should replace the base model globally.
Anticipatory Public Repair Planner
`scripts/run_anticipatory_repair_planner.py` adds the pre-generation layer that the repair-router evidence points toward. It classifies public tasks before generation, retrieves only the matching skillgraph package memory, runs bounded public recipe checks, and overlays only admitted candidates.
The script intentionally does not accept a hidden task-spec path. Its report records `read_hidden_task_specs=false` and `hidden_tests_sent_to_model=false`.
python3 scripts/run_anticipatory_repair_planner.py \
--public-tasks examples/evaluation/executable-public-tasks-python-stdlib-heldout-v1.jsonl \
--base-candidates examples/evaluation/executable-candidates-skillgraph-task-plus-e4b-chat-overlay-router-heldout-v1-2026-06-10.jsonl \
--skill-dir examples/skills/python-stdlib-heldout-v1/task-plus-e4b-chat-overlay-router-vs-base \
--condition skillgraph_anticipatory_public_repair_planner \
--output examples/evaluation/executable-candidates-skillgraph-anticipatory-public-repair-planner-heldout-v1-2026-06-10.jsonl \
--admitted-candidates-output examples/evaluation/executable-candidates-anticipatory-public-repairs-heldout-v1-2026-06-10.jsonl \
--report benchmarks/executable-candidate-generation-skillgraph-anticipatory-public-repair-planner-heldout-v1-2026-06-10.jsonChecked planner admission result:
| Metric | Value |
|---|---|
| Admitted public-checked repairs | 3 |
| Admitted tasks | `py_v1_parse_size_bytes`, `py_v1_chunked_list`, `py_v1_split_filename_version` |
| Preserved previous-overlay rows | 57 |
| Rejected tasks | 0 |
| Synthetic rows | 0 |
After materialization and hidden executable scoring:
| Metric | Value |
|---|---|
| E2B base | 50/60 |
| Previous E4B overlay | 57/60 |
| Anticipatory public repair planner | 60/60 |
| Net delta vs previous overlay | +3 |
| Net delta vs E2B base | +10 |
| Regressions vs previous overlay | 0 |
| Regressions vs E2B base | 0 |
The base-vs-planner skillgraph writes artifacts under `examples/skills/python-stdlib-heldout-v1/anticipatory-public-repair-planner-vs-base/`. The previous-overlay-vs-planner proof lives under `examples/skills/python-stdlib-heldout-v1/anticipatory-public-repair-planner-vs-e4b-overlay/`.
Boundary: this proves the anticipatory repair planner as a public-check, regression-gated router/planner on this 60-task suite. It still does not prove broad adapter-level model improvement.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
trajectory-memory-ledger/docs/harness-skills.md
Detected Structure
Method · Evaluation · Code Anchors