Stage 3: Expand + Master Plan -- AI-Generated Graph Topology

Full HTML reader

Read the full artifact

Extracted abstract or opening context

#### Risk 1: Confidence Decay Destroys Valuable Low-Frequency Triples - **Failure scenario**: The 3 `evolved_from` triples (accessed rarely but semantically critical) decay below threshold and disappear from default queries within 90 days. - **Probability**: HIGH (guaranteed by the decay formula if not addressed) - **Impact**: HIGH (evolutionary lineage lost) - **Mitigation**: Tier 1 predicates (including evolved_from) are EXEMPT from decay. Only Tier 2 and Tier 3 predicates decay. This is enforced in the decay SQL: `WHERE predicate NOT IN (SELECT name FROM predicate_registry WHERE tier = 1)` - **Validation**: After first decay run, verify all Tier 1 triples retain original confidence. Query: `SELECT count(*) FROM knowledge_graph WHERE predicate = 'evolved_from'` must = 3. #### Risk 2: Pruning Breaks RAG++ Retrieval Quality - **Failure scenario**: RAG++ searches that previously returned relevant results now return empty because the connecting triples were pruned. - **Probability**: MEDIUM (pruning targets low-confidence noise, but RAG++ may use noise paths as fallback when high-signal paths are sparse) - **Impact**: HIGH (RAG++ is the primary GK consumer) - **Mitigation**: Three-layer defense: 1. Soft delete (pruned_at, not DELETE) -- reversible in seconds 2. Before pruning: run a test suite of 20 known-good RAG++ queries, record results 3. After pruning: re-run the same 20 queries, compare result overlap. If <80% overlap, rollback (NULL all pruned_at values) - **Validation**: RAG++ test suite passes with >= 80% result overlap post-pruning. #### Risk 3: Entity Embedding Pipeline Produces Garbage Embeddings - **Failure scenario**: Entities with no GK predicates (e.g., bare prompt hashes like "prompt:1069f93fdd8f24c3") are embedded as meaningless vectors, creating noise clusters that contaminate hierarchy discovery. - **Probability**: MEDIUM (8,303 entities are prompt hashes with only contains_prompt edges) - **Impact**: MEDIUM (garbage clusters waste effort but do not corrupt good clusters) - **Mitigation**: Filter entity list BEFORE embedding. Only embed entities that: 1. Have at least 2 predicates (not counting contains_prompt, has_intent, ran_session) 2. Are not prefixed with "session:", "prompt:", or "agent:" unless they also have Tier 1/2 edges Expected: ~2,000-3,000 embeddable entities out of 10,781 total. - **Validation**: Embedding count < 4,000. No cluster contains >50% session/prompt entities. #### Risk 4: Cluster Labeling Produces Vague Names - **Probability**: MEDIUM - **Impact**: LOW (labels are human-readable convenience, not machine-critical) - **Mitigation**: Three-pass labeling: (1) auto-label from shared predicates, (2) Gemini for mixed clusters, (3) manual override list for top-20 clusters. #### Risk 5: Predicate Registry Cold S

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.