Self-Referential Context Penalization for RAG++ Context Gateway
That creates a failure mode: - the gateway retrieves chunks that are semantically relevant but already present in the prompt - those chunks consume scarce tokens without adding novelty - when the top results are all self-referential, the gateway amplifies echo instead of expanding the reasoning surface
Full Public Reader
Self-Referential Context Penalization for RAG++ Context Gateway
## Status
- Proposed
- Target module: `Desktop/Comp-Core/core/retrieval/cc-rag-plus-plus/rag_plusplus/service/routes/context_gateway.py`
- Primary integration point: `_compose_response(...)` inside the Smart Context Gateway
## Problem
The current gateway composes Graph Kernel and RAG++ results based on retrieval score and token budget, but it has no awareness of what is already inside the model's active context window.
That creates a failure mode:
- the gateway retrieves chunks that are semantically relevant but already present in the prompt
- those chunks consume scarce tokens without adding novelty
- when the top results are all self-referential, the gateway amplifies echo instead of expanding the reasoning surface
This spec adds a self-referential penalization stage so the gateway prefers novel, adjacent context over duplicate context.
## Goals
- Detect overlap between the current context window and retrieved RAG++ / GK candidates using hash-based approximate similarity.
- Penalize candidates in proportion to overlap ratio instead of hard-dropping everything.
- Trigger adjacent semantic expansion when the top-k set is mostly self-referential.
- Keep Graph Kernel admissibility rules unchanged.
- Integrate cleanly into `context_gateway.py` without introducing a new service boundary.
## Non-Goals
- Changing Graph Kernel slice authority or admissibility semantics.
- Replacing hybrid retrieval.
- Doing expensive full-text diffing across the whole prompt on every request.
- Making global search results admissible.
## Current Baseline
`context_gateway.py` currently does this:
1. query GK traversal and RAG++ search in parallel
2. pass raw results into `_compose_response(...)`
3. allocate roughly 60
4. truncate by remaining characters
What is missing:
- no request field for current context window or its fingerprint
- no overlap scoring
- no reranking based on novelty
- no expansion path when top-k is redundant
Proposed Design
### 1. Request Contract Additions
Extend `ContextGatewayRequest` with optional current-window inputs.
class ContextGatewayRequest(BaseModel):
query: str
cwd: str = ""
session_id: str = ""
max_tokens: int = 500
include_graph: bool = True
k_rag: int = 5
# New
current_window_text: str = ""
current_window_blocks: List[str] = Field(default_factory=list)
current_window_fingerprint: Optional[WindowFingerprintPayload] = None
enable_self_ref_penalty: bool = TrueRules:
- `current_window_fingerprint` wins if provided.
- otherwise compute fingerprints from `current_window_blocks`.
- if blocks are absent, segment `current_window_text` into fixed windows.
- if none are provided, self-referential penalization is skipped and current behavior is preserved.
Rationale:
- callers that already have the raw prompt can send it directly
- callers with payload-size concerns can send a precomputed fingerprint bundle
### 2. Fingerprinting Strategy
Use a two-stage hash-based detector:
- `SimHash64` for cheap near-duplicate screening
- `MinHash128` over token shingles for overlap-ratio estimation
Why both:
- SimHash is fast and good for exact or near-exact echoes
- MinHash gives a stable Jaccard-style overlap estimate, which is required for proportional scoring
### 3. Text Normalization
Normalize all compared text the same way before fingerprinting:
- Unicode NFKC
- lowercase
- collapse repeated whitespace
- strip markdown control characters but keep code/content text
- cap candidate text at a deterministic max length before hashing
Do not aggressively remove identifiers. Repeated IDs, filenames, and symbols are often the signal in engineering prompts.
### 4. Shingling
Use token shingles:
- default shingle size: `5`
- for chunks under `20` tokens, fall back to shingle size `3`
For each text block compute:
- `sha256` of normalized text for exact match detection
- `simhash64`
- `minhash128`
- `token_count`
- `shingle_count`
### 5. Current Window Representation
Represent the active prompt as both:
- one global fingerprint for the full current window
- many span fingerprints for local overlap detection
class SpanFingerprint(BaseModel):
span_id: str
sha256: str
simhash64: str
minhash128: List[int]
token_count: int
class WindowFingerprintPayload(BaseModel):
global_sha256: str
global_simhash64: str
global_minhash128: List[int]
spans: List[SpanFingerprint]Span segmentation:
- if `current_window_blocks` is provided, one block = one span
- otherwise segment `current_window_text` into chunks of roughly `160-220` tokens
Why spans matter:
- a retrieved chunk may overlap only one recent user/tool block, not the whole window
- global-only comparison will miss partial duplication
### 6. Candidate Representation
Before budgeting, transform raw candidates into scored candidates.
class OverlapAssessment(BaseModel):
exact_match: bool
simhash_distance: int
overlap_ratio: float
penalty: float
reason: str
class ScoredCandidate(BaseModel):
source: Literal["rag", "gk"]
candidate_id: str
text: str
base_score: float
adjusted_score: float
overlap: OverlapAssessment
metadata: Dict[str, Any] = Field(default_factory=dict)For RAG results:
- `base_score = similarity or score`
For GK paths:
- render each path to canonical text
- derive `base_score` from path signal, for example:
- `0.50` baseline
- `+0.10` for each high-signal predicate hit
- `-0.05` for long/noisy paths
### 7. Overlap Computation
For each candidate:
1. exact match check using `sha256`
2. nearest-span SimHash distance
3. MinHash overlap estimate against:
- global window
- best matching span
Use:
overlap_ratio = max(global_overlap, best_span_overlap)Near-duplicate override:
- if `simhash_distance <= 3`, force `overlap_ratio = max(overlap_ratio, 0.90)`
- if `sha256` matches exactly, set `overlap_ratio = 1.0`
### 8. Penalty Function
Penalty must scale with overlap ratio.
Recommended formula:
source_weight = 0.90 if source == "rag" else 0.60
penalty = source_weight * (overlap_ratio ** 1.35)
adjusted_score = max(0.0, base_score * (1.0 - penalty))Interpretation:
- low overlap barely moves the score
- medium overlap suppresses rank
- high overlap nearly zeroes the score
- GK context is penalized less aggressively than RAG text because a structural path can still add value even when some wording overlaps
Exact duplicate rule:
- if `overlap_ratio == 1.0`, set `adjusted_score = 0.0`
- do not include unless no non-zero candidates exist and the caller explicitly wants best-effort fallback
### 9. Self-Referential Thresholds
Recommended defaults:
- `overlap_ratio >= 0.85`: near-duplicate
- `overlap_ratio >= 0.60`: self-referential
- `overlap_ratio < 0.30`: novel
These thresholds should be env-configurable:
- `CONTEXT_GATEWAY_SELFREF_ENABLED=1`
- `CONTEXT_GATEWAY_SELFREF_SHINGLE_SIZE=5`
- `CONTEXT_GATEWAY_SELFREF_MINHASH_PERM=128`
- `CONTEXT_GATEWAY_SELFREF_DUP_HAMMING=3`
- `CONTEXT_GATEWAY_SELFREF_TRIGGER_OVERLAP=0.60`
- `CONTEXT_GATEWAY_SELFREF_MIN_NOVEL=2`
### 10. Replacement Strategy: Expand Into Adjacent Semantic Space
If the first-pass top-k is mostly self-referential, the gateway should expand instead of returning echo.
Trigger expansion when either condition is true:
- fewer than `min(2, k_rag)` candidates have `overlap_ratio < 0.30`
- all top-k candidates have `overlap_ratio >= 0.60`
Expansion strategy:
1. Collect adjacent terms from the existing GK traversal result.
- node labels
- project names
- dependency names
- intent-bearing objects
2. Drop terms already dominant in the current window.
3. Build 2-4 deterministic query variants:
- original query
- query + highest-signal adjacent project/entity
- query + dependency/conflict term
- keyword-compressed version of the original query
4. Re-run `_query_rag_search(...)` with:
- larger `match_count`, for example `max(k_rag * 3, 12)`
- client-side exclusion of exact duplicate `id`s and high-overlap texts
5. Re-score expansion candidates with the same overlap penalty.
6. Fuse original and expanded pools with Reciprocal Rank Fusion or simple adjusted-score sorting.
Important:
- only one expansion round per request
- expansion is for novelty recovery, not an unbounded search loop
### 11. Semantic-Adjacent Query Generation
Do not use an LLM here. Keep it deterministic and cheap.
Preferred sources for adjacent terms:
- `gk_result["paths"]`
- extracted project name from `cwd`
- existing query keywords
- path predicates in `HIGH_SIGNAL_PREDICATES`
A valid adjacent query generator can be rule-based:
- prioritize terms connected by `depends_on`, `conflicts_with`, `extends`, `resolves`, `relates_to_project`
- exclude terms already present in the highest-overlap current spans
### 12. Integration Point in `context_gateway.py`
Implement this as composition middleware inside the route module, not FastAPI ASGI middleware.
Reason:
- it needs access to raw retrieved candidates and token composition state
- ASGI middleware runs at the wrong abstraction layer
#### Required refactor
Change `_compose_response(...)` from sync to async so it can trigger adjacent expansion directly.
Current:
def _compose_response(gk_result, rag_results, max_tokens, include_graph) -> ContextGatewayResponse:Proposed:
async def _compose_response(
request: ContextGatewayRequest,
project: Optional[str],
gk_result: Dict[str, Any],
rag_results: List[Dict[str, Any]],
) -> ContextGatewayResponse:#### Internal pipeline inside `_compose_response(...)`
1. build or load current-window fingerprint
2. map raw RAG rows to `ScoredCandidate`
3. map GK paths to `ScoredCandidate`
4. apply self-referential penalty
5. if novelty is insufficient, expand adjacent semantic space and merge
6. sort by `adjusted_score`
7. apply the existing token-budget packing logic
8. emit normal `ContextGatewayResponse`
13. Pseudocode
async def _compose_response(request, project, gk_result, rag_results):
window_fp = build_window_fingerprint(request)
rag_candidates = build_rag_candidates(rag_results)
gk_candidates = build_gk_candidates(gk_result)
if request.enable_self_ref_penalty and window_fp:
rag_candidates = penalize_candidates(rag_candidates, window_fp, source="rag")
gk_candidates = penalize_candidates(gk_candidates, window_fp, source="gk")
if should_expand(rag_candidates, request.k_rag):
expanded_rows = await expand_adjacent_semantic_space(
query=request.query,
project=project,
session_id=request.session_id,
gk_result=gk_result,
window_fp=window_fp,
target_k=request.k_rag,
)
expanded_candidates = build_rag_candidates(expanded_rows)
expanded_candidates = penalize_candidates(expanded_candidates, window_fp, source="rag")
rag_candidates = fuse_candidate_pools(rag_candidates, expanded_candidates)
rag_candidates.sort(key=lambda c: c.adjusted_score, reverse=True)
gk_candidates.sort(key=lambda c: c.adjusted_score, reverse=True)
return pack_budgeted_response(
rag_candidates=rag_candidates,
gk_candidates=gk_candidates,
admissibility_token=gk_result.get("admissibility_token"),
max_tokens=request.max_tokens,
include_graph=request.include_graph,
)### 14. Budgeting Behavior
Keep the existing 60/30/10 budget split:
- 60
- 30
- 10
But fill each section from `adjusted_score` ordering, not raw retrieval order.
This means self-referential chunks are filtered out by ranking pressure before truncation.
### 15. Optional Response Debug Surface
Do not inflate the main context payload by default.
If debug is needed, add optional fields behind a flag:
- `self_ref_penalty_applied`
- `overlap_ratio`
- `original_score`
- `adjusted_score`
- `expansion_triggered`
Default behavior should keep the response as compact as today.
### 16. Caching
To keep latency bounded:
#### Per-request
- fingerprint the current window once
- fingerprint each retrieved chunk once
#### In-process LRU cache
Key:
- `sha256(normalized_text)`
Value:
- `{simhash64, minhash128, token_count, shingle_count}`
Recommended:
- LRU size: `10_000`
- TTL: `10-30 minutes`
#### Future optimization
Persist chunk fingerprints in `rag_embeddings.metadata`, for example:
{
"self_ref_fp_v1": {
"sha256": "...",
"simhash64": "...",
"minhash128": [...]
}
}This is optional for the first implementation. MVP can compute fingerprints at runtime.
### 17. Metrics and Observability
Add counters and histograms to the existing gateway metrics surface:
- `context_gateway_selfref_candidates_total`
- `context_gateway_selfref_penalized_total`
- `context_gateway_selfref_exact_duplicates_total`
- `context_gateway_selfref_expansion_total`
- `context_gateway_selfref_expansion_success_total`
- `context_gateway_selfref_overlap_ratio_sum`
- `context_gateway_selfref_novel_results_sum`
Add structured logs per request:
- `session_id`
- `query`
- `candidate_count`
- `penalized_count`
- `exact_duplicate_count`
- `expansion_triggered`
- `novel_result_count`
### 18. Invariants
This feature must preserve the existing production invariants:
- it must never mark a result admissible
- it must never modify GK-issued tokens
- it must never widen Graph Kernel authority
- it may only rerank or omit candidates from the composed output
In short:
- admissibility is governance
- self-referential penalization is relevance shaping
They must remain separate.
### 19. Failure Handling
If self-referential logic fails:
- log at warning level
- fall back to raw current behavior
- do not fail the whole request
If expansion fails:
- keep first-pass penalized results
- do not retry more than once
20. Testing Plan
#### Unit tests
- fingerprint determinism for identical normalized text
- exact duplicate detection via `sha256`
- near-duplicate detection via `simhash_distance <= 3`
- MinHash overlap monotonicity
- penalty monotonicity: higher overlap => lower adjusted score
- no-window input => identical behavior to baseline
#### Integration tests
- current window contains the top retrieved chunk verbatim, chunk is suppressed
- current window overlaps 40-60
- all top-k are duplicates, expansion is triggered
- expansion returns adjacent non-duplicate results
- GK token remains unchanged
#### Performance targets
- added p95 latency under `20ms` with warm fingerprint cache
- added p95 latency under `60ms` without cache and without expansion
- single expansion round stays under `120ms` extra on local services
21. Rollout Plan
#### Wave 1
- add request fields
- add fingerprint utilities
- add RAG candidate penalization
- add metrics
#### Wave 2
- add GK candidate penalization
- add adjacent semantic expansion
- add debug logging
#### Wave 3
- persist fingerprints in metadata or shared cache
- tune thresholds from production telemetry
## Summary
The gateway should stop treating retrieved relevance as equivalent to contextual usefulness.
The correct behavior is:
- retrieve broadly
- detect overlap with the existing prompt
- penalize self-reference proportionally
- expand outward when the candidate set collapses into echo
- keep GK admissibility untouched
That gives the context gateway a missing property: novelty pressure under token constraints.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
selfref-penalization-spec.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture