Grand Diomande Research · Full HTML Reader

CRP-1.2: Expanded Evaluation Suite (174 Questions)

| Dimension | ID Prefix | Count | Source | |-----------|-----------|-------|--------| | Question Policy | `qp` | 7 | original | | Format Compliance | `fc` | 5 | original | | Omission | `om` | 3 | original | | Historical Annoyance | `ha` | 5 | original | | Edge Case | `ec` | 4 | original | | **Recall** | `rc` | 15 | expanded | | **Reasoning** | `rs` | 15 | expanded | | **Temporal** | `tp` | 12 | expanded | | **Counterfactual** | `cf` | 12 | expanded | | **Adversarial** | `av` | 12 | expanded | | **Generalization** |

Agents That Account for Themselves experiment experiment writeup candidate score 24 .md

Full Public Reader

CRP-1.2: Expanded Evaluation Suite (174 Questions)

Status: COMPLETE

What Changed

Expanded the CognitiveTwin V3 evaluation suite from 24 to 174 test cases across 18 dimensions.

### Files Modified
- `cognitive_twin/v3/eval/test_cases_expanded.py` — NEW (2309 lines, 150 tests across 13 classes)
- `cognitive_twin/v3/eval/suite.py` — Registered all 13 new generators
- `cognitive_twin/v3/eval/__init__.py` — Added exports for expanded test classes
- `scripts/eval_dry_run.py` — NEW dry-run + live eval script

### Files Generated
- `data/eval_results/eval_dryrun_*.json` — Mock results with full structure

Dimension Coverage

DimensionID PrefixCountSource
Question Policy`qp`7original
Format Compliance`fc`5original
Omission`om`3original
Historical Annoyance`ha`5original
Edge Case`ec`4original
Recall`rc`15expanded
Reasoning`rs`15expanded
Temporal`tp`12expanded
Counterfactual`cf`12expanded
Adversarial`av`12expanded
Generalization`gz`10expanded
Consistency`cs`10expanded
Precision`pr`10expanded
Negation`ng`10expanded
Inference`if`10expanded
Multi-Turn Coherence`mt`12expanded
Ambiguity Handling`ah`10expanded
Edge Case Extended`ex`12expanded
Total174

Priority Distribution

PriorityCount
Critical11
High59
Medium93
Low11

Category Distribution

CategoryCount
content_quality115
behavioral_audit36
comparative10
policy_compliance7
format_adherence6

Running the Eval

bash
# Dry-run (no model needed, validates all tests and writes mock results)
python scripts/eval_dry_run.py

# Live run against model (requires TOGETHER_API_KEY or OPENAI_API_KEY)
EVAL_MODEL_ID="cognitivetwins/v3-llama3.2-3b" python scripts/eval_dry_run.py --live

# Parallel live run
python scripts/eval_dry_run.py --live --parallel 5

New Dimension Details

  • Recall (rc): Factual recall of algorithm complexity, SOLID/ACID principles, SQL JOINs, Git states, decorator patterns
  • Reasoning (rs): Algorithm selection, debugging logic, architecture tradeoffs, security analysis, concurrency reasoning
  • Temporal (tp): CI/CD ordering, migration sequences, build dependencies, lifecycle hooks, event ordering
  • Counterfactual (cf): Language design what-ifs, architecture alternatives, failure scenario analysis
  • Adversarial (av): Prompt injection resistance, trick questions, misleading premises, false authority claims
  • Generalization (gz): Design pattern application to novel domains, concept transfer, analogy completion
  • Consistency (cs): Paired questions testing identical answers via different phrasings (5 pairs = 10 tests)
  • Precision (pr): Exact port numbers, protocol versions, language-specific constants, concrete values
  • Negation (ng): Exclusion constraints, negative requirements, "do NOT" instructions
  • Inference (if): Algorithm identification from properties, bug diagnosis from symptoms, pattern recognition
  • Multi-Turn Coherence (mt): Context carryover, pronoun resolution, instruction accumulation, correction handling
  • Ambiguity Handling (ah): Sensible defaults on underspecified requests, missing language/scope/requirements
  • Edge Case Extended (ex): Empty inputs, division by zero, unicode handling, mutable defaults, boolean arithmetic

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/CRP-1.2-COMPLETE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture