Back to corpus
experimentexperiment writeup candidatescore 24

CRP-1.2: Expanded Evaluation Suite (174 Questions)

| Dimension | ID Prefix | Count | Source | |-----------|-----------|-------|--------| | Question Policy | `qp` | 7 | original | | Format Compliance | `fc` | 5 | original | | Omission | `om` | 3 | original | | Historical Annoyance | `ha` | 5 | original | | Edge Case | `ec` | 4 | original | | **Recall** | `rc` | 15 | expanded | | **Reasoning** | `rs` | 15 | expanded | | **Temporal** | `tp` | 12 | expanded | | **Counterfactual** | `cf` | 12 | expanded | | **Adversarial** | `av` | 12 | expanded | | **Generalization** |

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

Expanded the CognitiveTwin V3 evaluation suite from **24 to 174 test cases** across **18 dimensions**. ### Files Modified - `cognitive_twin/v3/eval/test_cases_expanded.py` — **NEW** (2309 lines, 150 tests across 13 classes) - `cognitive_twin/v3/eval/suite.py` — Registered all 13 new generators - `cognitive_twin/v3/eval/__init__.py` — Added exports for expanded test classes - `scripts/eval_dry_run.py` — **NEW** dry-run + live eval script ### Files Generated - `data/eval_results/eval_dryrun_*.json` — Mock results with full structure | Dimension | ID Prefix | Count | Source | |-----------|-----------|-------|--------| | Question Policy | `qp` | 7 | original | | Format Compliance | `fc` | 5 | original | | Omission | `om` | 3 | original | | Historical Annoyance | `ha` | 5 | original | | Edge Case | `ec` | 4 | original | | **Recall** | `rc` | 15 | expanded | | **Reasoning** | `rs` | 15 | expanded | | **Temporal** | `tp` | 12 | expanded | | **Counterfactual** | `cf` | 12 | expanded | | **Adversarial** | `av` | 12 | expanded | | **Generalization** | `gz` | 10 | expanded | | **Consistency** | `cs` | 10 | expanded | | **Precision** | `pr` | 10 | expanded | | **Negation** | `ng` | 10 | expanded | | **Inference** | `if` | 10 | expanded | | **Multi-Turn Coherence** | `mt` | 12 | expanded | | **Ambiguity Handling** | `ah` | 10 | expanded | | **Edge Case Extended** | `ex` | 12 | expanded | | **Total** | | **174** | | | Priority | Count | |----------|-------| | Critical | 11 | | High | 59 | | Medium | 93 | | Low | 11 |

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.