Grand Diomande Research · Full HTML Reader

Self-Referential Context Penalization for RAG++ Context Gateway

That creates a failure mode: - the gateway retrieves chunks that are semantically relevant but already present in the prompt - those chunks consume scarce tokens without adding novelty - when the top results are all self-referential, the gateway amplifies echo instead of expanding the reasoning surface

Agents That Account for Themselves proposal experiment writeup candidate score 32 .md

Full Public Reader

Self-Referential Context Penalization for RAG++ Context Gateway

## Status
- Proposed
- Target module: `Desktop/Comp-Core/core/retrieval/cc-rag-plus-plus/rag_plusplus/service/routes/context_gateway.py`
- Primary integration point: `_compose_response(...)` inside the Smart Context Gateway

## Problem
The current gateway composes Graph Kernel and RAG++ results based on retrieval score and token budget, but it has no awareness of what is already inside the model's active context window.

That creates a failure mode:
- the gateway retrieves chunks that are semantically relevant but already present in the prompt
- those chunks consume scarce tokens without adding novelty
- when the top results are all self-referential, the gateway amplifies echo instead of expanding the reasoning surface

This spec adds a self-referential penalization stage so the gateway prefers novel, adjacent context over duplicate context.

## Goals
- Detect overlap between the current context window and retrieved RAG++ / GK candidates using hash-based approximate similarity.
- Penalize candidates in proportion to overlap ratio instead of hard-dropping everything.
- Trigger adjacent semantic expansion when the top-k set is mostly self-referential.
- Keep Graph Kernel admissibility rules unchanged.
- Integrate cleanly into `context_gateway.py` without introducing a new service boundary.

## Non-Goals
- Changing Graph Kernel slice authority or admissibility semantics.
- Replacing hybrid retrieval.
- Doing expensive full-text diffing across the whole prompt on every request.
- Making global search results admissible.

## Current Baseline
`context_gateway.py` currently does this:
1. query GK traversal and RAG++ search in parallel
2. pass raw results into `_compose_response(...)`
3. allocate roughly 60
4. truncate by remaining characters

What is missing:
- no request field for current context window or its fingerprint
- no overlap scoring
- no reranking based on novelty
- no expansion path when top-k is redundant

Proposed Design

### 1. Request Contract Additions
Extend `ContextGatewayRequest` with optional current-window inputs.

python

class ContextGatewayRequest(BaseModel):
    query: str
    cwd: str = ""
    session_id: str = ""
    max_tokens: int = 500
    include_graph: bool = True
    k_rag: int = 5

    # New
    current_window_text: str = ""
    current_window_blocks: List[str] = Field(default_factory=list)
    current_window_fingerprint: Optional[WindowFingerprintPayload] = None
    enable_self_ref_penalty: bool = True

Rules:
- `current_window_fingerprint` wins if provided.
- otherwise compute fingerprints from `current_window_blocks`.
- if blocks are absent, segment `current_window_text` into fixed windows.
- if none are provided, self-referential penalization is skipped and current behavior is preserved.

Rationale:
- callers that already have the raw prompt can send it directly
- callers with payload-size concerns can send a precomputed fingerprint bundle

### 2. Fingerprinting Strategy
Use a two-stage hash-based detector:
- `SimHash64` for cheap near-duplicate screening
- `MinHash128` over token shingles for overlap-ratio estimation

Why both:
- SimHash is fast and good for exact or near-exact echoes
- MinHash gives a stable Jaccard-style overlap estimate, which is required for proportional scoring

### 3. Text Normalization
Normalize all compared text the same way before fingerprinting:
- Unicode NFKC
- lowercase
- collapse repeated whitespace
- strip markdown control characters but keep code/content text
- cap candidate text at a deterministic max length before hashing

Do not aggressively remove identifiers. Repeated IDs, filenames, and symbols are often the signal in engineering prompts.

### 4. Shingling
Use token shingles:
- default shingle size: `5`
- for chunks under `20` tokens, fall back to shingle size `3`

For each text block compute:
- `sha256` of normalized text for exact match detection
- `simhash64`
- `minhash128`
- `token_count`
- `shingle_count`

### 5. Current Window Representation
Represent the active prompt as both:
- one global fingerprint for the full current window
- many span fingerprints for local overlap detection

python

class SpanFingerprint(BaseModel):
    span_id: str
    sha256: str
    simhash64: str
    minhash128: List[int]
    token_count: int

class WindowFingerprintPayload(BaseModel):
    global_sha256: str
    global_simhash64: str
    global_minhash128: List[int]
    spans: List[SpanFingerprint]

Span segmentation:
- if `current_window_blocks` is provided, one block = one span
- otherwise segment `current_window_text` into chunks of roughly `160-220` tokens

Why spans matter:
- a retrieved chunk may overlap only one recent user/tool block, not the whole window
- global-only comparison will miss partial duplication

### 6. Candidate Representation
Before budgeting, transform raw candidates into scored candidates.

python

class OverlapAssessment(BaseModel):
    exact_match: bool
    simhash_distance: int
    overlap_ratio: float
    penalty: float
    reason: str

class ScoredCandidate(BaseModel):
    source: Literal["rag", "gk"]
    candidate_id: str
    text: str
    base_score: float
    adjusted_score: float
    overlap: OverlapAssessment
    metadata: Dict[str, Any] = Field(default_factory=dict)

For RAG results:
- `base_score = similarity or score`

For GK paths:
- render each path to canonical text
- derive `base_score` from path signal, for example:
- `0.50` baseline
- `+0.10` for each high-signal predicate hit
- `-0.05` for long/noisy paths

### 7. Overlap Computation
For each candidate:
1. exact match check using `sha256`
2. nearest-span SimHash distance
3. MinHash overlap estimate against:
- global window
- best matching span

Use:

python

overlap_ratio = max(global_overlap, best_span_overlap)

Near-duplicate override:
- if `simhash_distance <= 3`, force `overlap_ratio = max(overlap_ratio, 0.90)`
- if `sha256` matches exactly, set `overlap_ratio = 1.0`

### 8. Penalty Function
Penalty must scale with overlap ratio.

Recommended formula:

python

source_weight = 0.90 if source == "rag" else 0.60
penalty = source_weight * (overlap_ratio ** 1.35)
adjusted_score = max(0.0, base_score * (1.0 - penalty))

Interpretation:
- low overlap barely moves the score
- medium overlap suppresses rank
- high overlap nearly zeroes the score
- GK context is penalized less aggressively than RAG text because a structural path can still add value even when some wording overlaps

Exact duplicate rule:
- if `overlap_ratio == 1.0`, set `adjusted_score = 0.0`
- do not include unless no non-zero candidates exist and the caller explicitly wants best-effort fallback

### 9. Self-Referential Thresholds
Recommended defaults:
- `overlap_ratio >= 0.85`: near-duplicate
- `overlap_ratio >= 0.60`: self-referential
- `overlap_ratio < 0.30`: novel

These thresholds should be env-configurable:
- `CONTEXT_GATEWAY_SELFREF_ENABLED=1`
- `CONTEXT_GATEWAY_SELFREF_SHINGLE_SIZE=5`
- `CONTEXT_GATEWAY_SELFREF_MINHASH_PERM=128`
- `CONTEXT_GATEWAY_SELFREF_DUP_HAMMING=3`
- `CONTEXT_GATEWAY_SELFREF_TRIGGER_OVERLAP=0.60`
- `CONTEXT_GATEWAY_SELFREF_MIN_NOVEL=2`

### 10. Replacement Strategy: Expand Into Adjacent Semantic Space
If the first-pass top-k is mostly self-referential, the gateway should expand instead of returning echo.

Trigger expansion when either condition is true:
- fewer than `min(2, k_rag)` candidates have `overlap_ratio < 0.30`
- all top-k candidates have `overlap_ratio >= 0.60`

Expansion strategy:
1. Collect adjacent terms from the existing GK traversal result.
- node labels
- project names
- dependency names
- intent-bearing objects
2. Drop terms already dominant in the current window.
3. Build 2-4 deterministic query variants:
- original query
- query + highest-signal adjacent project/entity
- query + dependency/conflict term
- keyword-compressed version of the original query
4. Re-run `_query_rag_search(...)` with:
- larger `match_count`, for example `max(k_rag * 3, 12)`
- client-side exclusion of exact duplicate `id`s and high-overlap texts
5. Re-score expansion candidates with the same overlap penalty.
6. Fuse original and expanded pools with Reciprocal Rank Fusion or simple adjusted-score sorting.

Important:
- only one expansion round per request
- expansion is for novelty recovery, not an unbounded search loop

### 11. Semantic-Adjacent Query Generation
Do not use an LLM here. Keep it deterministic and cheap.

Preferred sources for adjacent terms:
- `gk_result["paths"]`
- extracted project name from `cwd`
- existing query keywords
- path predicates in `HIGH_SIGNAL_PREDICATES`

A valid adjacent query generator can be rule-based:
- prioritize terms connected by `depends_on`, `conflicts_with`, `extends`, `resolves`, `relates_to_project`
- exclude terms already present in the highest-overlap current spans

### 12. Integration Point in `context_gateway.py`
Implement this as composition middleware inside the route module, not FastAPI ASGI middleware.

Reason:
- it needs access to raw retrieved candidates and token composition state
- ASGI middleware runs at the wrong abstraction layer

#### Required refactor
Change `_compose_response(...)` from sync to async so it can trigger adjacent expansion directly.

Current:

python

def _compose_response(gk_result, rag_results, max_tokens, include_graph) -> ContextGatewayResponse:

Proposed:

python

async def _compose_response(
    request: ContextGatewayRequest,
    project: Optional[str],
    gk_result: Dict[str, Any],
    rag_results: List[Dict[str, Any]],
) -> ContextGatewayResponse:

#### Internal pipeline inside `_compose_response(...)`
1. build or load current-window fingerprint
2. map raw RAG rows to `ScoredCandidate`
3. map GK paths to `ScoredCandidate`
4. apply self-referential penalty
5. if novelty is insufficient, expand adjacent semantic space and merge
6. sort by `adjusted_score`
7. apply the existing token-budget packing logic
8. emit normal `ContextGatewayResponse`

13. Pseudocode

python

async def _compose_response(request, project, gk_result, rag_results):
    window_fp = build_window_fingerprint(request)

    rag_candidates = build_rag_candidates(rag_results)
    gk_candidates = build_gk_candidates(gk_result)

    if request.enable_self_ref_penalty and window_fp:
        rag_candidates = penalize_candidates(rag_candidates, window_fp, source="rag")
        gk_candidates = penalize_candidates(gk_candidates, window_fp, source="gk")

        if should_expand(rag_candidates, request.k_rag):
            expanded_rows = await expand_adjacent_semantic_space(
                query=request.query,
                project=project,
                session_id=request.session_id,
                gk_result=gk_result,
                window_fp=window_fp,
                target_k=request.k_rag,
            )
            expanded_candidates = build_rag_candidates(expanded_rows)
            expanded_candidates = penalize_candidates(expanded_candidates, window_fp, source="rag")
            rag_candidates = fuse_candidate_pools(rag_candidates, expanded_candidates)

    rag_candidates.sort(key=lambda c: c.adjusted_score, reverse=True)
    gk_candidates.sort(key=lambda c: c.adjusted_score, reverse=True)

    return pack_budgeted_response(
        rag_candidates=rag_candidates,
        gk_candidates=gk_candidates,
        admissibility_token=gk_result.get("admissibility_token"),
        max_tokens=request.max_tokens,
        include_graph=request.include_graph,
    )

### 14. Budgeting Behavior
Keep the existing 60/30/10 budget split:
- 60
- 30
- 10

But fill each section from `adjusted_score` ordering, not raw retrieval order.

This means self-referential chunks are filtered out by ranking pressure before truncation.

### 15. Optional Response Debug Surface
Do not inflate the main context payload by default.

If debug is needed, add optional fields behind a flag:
- `self_ref_penalty_applied`
- `overlap_ratio`
- `original_score`
- `adjusted_score`
- `expansion_triggered`

Default behavior should keep the response as compact as today.

### 16. Caching
To keep latency bounded:

#### Per-request
- fingerprint the current window once
- fingerprint each retrieved chunk once

#### In-process LRU cache
Key:
- `sha256(normalized_text)`

Value:
- `{simhash64, minhash128, token_count, shingle_count}`

Recommended:
- LRU size: `10_000`
- TTL: `10-30 minutes`

#### Future optimization
Persist chunk fingerprints in `rag_embeddings.metadata`, for example:

json

{
  "self_ref_fp_v1": {
    "sha256": "...",
    "simhash64": "...",
    "minhash128": [...]
  }
}

This is optional for the first implementation. MVP can compute fingerprints at runtime.

### 17. Metrics and Observability
Add counters and histograms to the existing gateway metrics surface:
- `context_gateway_selfref_candidates_total`
- `context_gateway_selfref_penalized_total`
- `context_gateway_selfref_exact_duplicates_total`
- `context_gateway_selfref_expansion_total`
- `context_gateway_selfref_expansion_success_total`
- `context_gateway_selfref_overlap_ratio_sum`
- `context_gateway_selfref_novel_results_sum`

Add structured logs per request:
- `session_id`
- `query`
- `candidate_count`
- `penalized_count`
- `exact_duplicate_count`
- `expansion_triggered`
- `novel_result_count`

### 18. Invariants
This feature must preserve the existing production invariants:
- it must never mark a result admissible
- it must never modify GK-issued tokens
- it must never widen Graph Kernel authority
- it may only rerank or omit candidates from the composed output

In short:
- admissibility is governance
- self-referential penalization is relevance shaping

They must remain separate.

### 19. Failure Handling
If self-referential logic fails:
- log at warning level
- fall back to raw current behavior
- do not fail the whole request

If expansion fails:
- keep first-pass penalized results
- do not retry more than once

20. Testing Plan

#### Unit tests
- fingerprint determinism for identical normalized text
- exact duplicate detection via `sha256`
- near-duplicate detection via `simhash_distance <= 3`
- MinHash overlap monotonicity
- penalty monotonicity: higher overlap => lower adjusted score
- no-window input => identical behavior to baseline

#### Integration tests
- current window contains the top retrieved chunk verbatim, chunk is suppressed
- current window overlaps 40-60
- all top-k are duplicates, expansion is triggered
- expansion returns adjacent non-duplicate results
- GK token remains unchanged

#### Performance targets
- added p95 latency under `20ms` with warm fingerprint cache
- added p95 latency under `60ms` without cache and without expansion
- single expansion round stays under `120ms` extra on local services

21. Rollout Plan

#### Wave 1
- add request fields
- add fingerprint utilities
- add RAG candidate penalization
- add metrics

#### Wave 2
- add GK candidate penalization
- add adjacent semantic expansion
- add debug logging

#### Wave 3
- persist fingerprints in metadata or shared cache
- tune thresholds from production telemetry

## Summary
The gateway should stop treating retrieved relevance as equivalent to contextual usefulness.

The correct behavior is:
- retrieve broadly
- detect overlap with the existing prompt
- penalize self-reference proportionally
- expand outward when the candidate set collapses into echo
- keep GK admissibility untouched

That gives the context gateway a missing property: novelty pressure under token constraints.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

selfref-penalization-spec.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture