Grand Diomande Research · Full HTML Reader

AGP Supervised Runtime Envelope

Both are now supervised by `runtime/supervise_agp_server_v1.sh`, and the client transport path is hardened by reconnect and retry logic in `runtime/run_cross_host_packet_replay_v1.py`.

Research Backlog research note experiment writeup candidate score 32 .md

Full Public Reader

AGP Supervised Runtime Envelope

As of 2026-04-18, the promoted cross-host runtime pair is the supervised `9430 + 9434` path on `mac5`.

`9430` is the hidden-return fast lane.
`9434` is the summary corrective lane.

Both are now supervised by `runtime/supervise_agp_server_v1.sh`, and the client transport path is hardened by reconnect and retry logic in `runtime/run_cross_host_packet_replay_v1.py`.

This matters because the earlier frontier was not limited by representational quality alone. It was limited by daemon lifetime and process churn. Once the runtime was supervised and the client learned how to survive restarts, the architecture crossed into a more credible operating regime.

The main supervised sweep over remote margin thresholds on the held-out Thunder validation slice produced six evaluated operating points.

At `0.03`, the system executed `13` steps and reached `0.6154` top-1 teacher match with mean KL `1.0492` and mean RTT `43.66ms`. This is too permissive. It buys coverage, but not enough quality or latency advantage to justify promotion.

At `0.05`, the system executed `11` steps and reached `0.7273` top-1 teacher match with mean KL `1.1467`, mean margin delta `0.1909`, and mean RTT `39.11ms`. This is the strongest opportunistic setting if the objective is raw top-1 agreement under the current four-step horizon.

At `0.07`, the system also executed `11` steps and matched the same `0.7273` top-1 teacher agreement, but with dramatically better latency than `0.05`: mean RTT `18.53ms` instead of `39.11ms`, and lower fast-lane RTT as well. Since the quality metrics are effectively identical while the runtime cost is much lower, `0.07` is the current best throughput-aware operating point.

At `0.10`, the system executed `10` steps and reached `0.7000` top-1 teacher match, but the more important result is that it delivered the cleanest overall quality profile among the high-performing settings: mean KL `0.5755`, mean margin delta `0.1125`, and mean response payload `1101.4` bytes. This is the most disciplined setting if the objective is semantic fidelity and calibration rather than maximum opportunistic continuation.

At `0.12` and `0.15`, the system executed only `9` steps, both with `0.6667` top-1 teacher match, but with the best KL and margin shape in the sweep: mean KL `0.4651`, mean margin delta `0.0703`, and payloads around `1030` bytes. These are precision-first settings, but they give up too much continuation opportunity to be the main promoted runtime today.

So the runtime envelope now has a real internal structure.

`0.07` is the best throughput-balanced setting.
`0.10` is the best disciplined quality setting.
`0.12+` is the precision lane.
`0.03` is too permissive.

Compared to the earlier unsupervised frontier, the supervised `9430 + 9434` pair is a material improvement. At the previously promoted `0.10` setting, it preserved the same `0.70` top-1 teacher agreement while improving every other important dimension: KL dropped from `1.2847` to `0.5755`, margin delta dropped from `0.1905` to `0.1125`, network roundtrip dropped from `48.47ms` to `21.91ms`, and response bytes dropped from `1659.1` to `1098.6`.

At the previously promoted `0.05` setting, the supervised pair improved top-1 agreement from `0.6364` to `0.7273`, while also reducing payload size substantially and trimming runtime overhead.

The practical conclusion is that the AGP cross-host runtime is no longer blocked on basic stability. It now has a measurable policy surface. The open question is no longer whether the architecture survives the hop. It is which operating point should be promoted for which product goal.

For general use, `0.07` is the strongest current recommendation.
For stricter quality-sensitive use, `0.10` is the cleaner recommendation.

The next experimental step is to test whether those preferences hold once the continuation horizon extends beyond the current four-token window. Longer-horizon runs for `0.07` and `0.10` are the right follow-up because they test whether the apparent superiority of `0.07` is real or merely a short-window artifact.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/docs/research/agp-supervised-runtime-envelope-2026-04-18.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture