CRP-1.2: Expanded Evaluation Suite (174 Questions)
| Dimension | ID Prefix | Count | Source | |-----------|-----------|-------|--------| | Question Policy | `qp` | 7 | original | | Format Compliance | `fc` | 5 | original | | Omission | `om` | 3 | original | | Historical Annoyance | `ha` | 5 | original | | Edge Case | `ec` | 4 | original | | **Recall** | `rc` | 15 | expanded | | **Reasoning** | `rs` | 15 | expanded | | **Temporal** | `tp` | 12 | expanded | | **Counterfactual** | `cf` | 12 | expanded | | **Adversarial** | `av` | 12 | expanded | | **Generalization** |
Full Public Reader
CRP-1.2: Expanded Evaluation Suite (174 Questions)
Status: COMPLETE
What Changed
Expanded the CognitiveTwin V3 evaluation suite from 24 to 174 test cases across 18 dimensions.
### Files Modified
- `cognitive_twin/v3/eval/test_cases_expanded.py` — NEW (2309 lines, 150 tests across 13 classes)
- `cognitive_twin/v3/eval/suite.py` — Registered all 13 new generators
- `cognitive_twin/v3/eval/__init__.py` — Added exports for expanded test classes
- `scripts/eval_dry_run.py` — NEW dry-run + live eval script
### Files Generated
- `data/eval_results/eval_dryrun_*.json` — Mock results with full structure
Dimension Coverage
| Dimension | ID Prefix | Count | Source |
|---|---|---|---|
| Question Policy | `qp` | 7 | original |
| Format Compliance | `fc` | 5 | original |
| Omission | `om` | 3 | original |
| Historical Annoyance | `ha` | 5 | original |
| Edge Case | `ec` | 4 | original |
| Recall | `rc` | 15 | expanded |
| Reasoning | `rs` | 15 | expanded |
| Temporal | `tp` | 12 | expanded |
| Counterfactual | `cf` | 12 | expanded |
| Adversarial | `av` | 12 | expanded |
| Generalization | `gz` | 10 | expanded |
| Consistency | `cs` | 10 | expanded |
| Precision | `pr` | 10 | expanded |
| Negation | `ng` | 10 | expanded |
| Inference | `if` | 10 | expanded |
| Multi-Turn Coherence | `mt` | 12 | expanded |
| Ambiguity Handling | `ah` | 10 | expanded |
| Edge Case Extended | `ex` | 12 | expanded |
| Total | 174 |
Priority Distribution
| Priority | Count |
|---|---|
| Critical | 11 |
| High | 59 |
| Medium | 93 |
| Low | 11 |
Category Distribution
| Category | Count |
|---|---|
| content_quality | 115 |
| behavioral_audit | 36 |
| comparative | 10 |
| policy_compliance | 7 |
| format_adherence | 6 |
Running the Eval
# Dry-run (no model needed, validates all tests and writes mock results)
python scripts/eval_dry_run.py
# Live run against model (requires TOGETHER_API_KEY or OPENAI_API_KEY)
EVAL_MODEL_ID="cognitivetwins/v3-llama3.2-3b" python scripts/eval_dry_run.py --live
# Parallel live run
python scripts/eval_dry_run.py --live --parallel 5New Dimension Details
- Recall (rc): Factual recall of algorithm complexity, SOLID/ACID principles, SQL JOINs, Git states, decorator patterns
- Reasoning (rs): Algorithm selection, debugging logic, architecture tradeoffs, security analysis, concurrency reasoning
- Temporal (tp): CI/CD ordering, migration sequences, build dependencies, lifecycle hooks, event ordering
- Counterfactual (cf): Language design what-ifs, architecture alternatives, failure scenario analysis
- Adversarial (av): Prompt injection resistance, trick questions, misleading premises, false authority claims
- Generalization (gz): Design pattern application to novel domains, concept transfer, analogy completion
- Consistency (cs): Paired questions testing identical answers via different phrasings (5 pairs = 10 tests)
- Precision (pr): Exact port numbers, protocol versions, language-specific constants, concrete values
- Negation (ng): Exclusion constraints, negative requirements, "do NOT" instructions
- Inference (if): Algorithm identification from properties, bug diagnosis from symptoms, pattern recognition
- Multi-Turn Coherence (mt): Context carryover, pronoun resolution, instruction accumulation, correction handling
- Ambiguity Handling (ah): Sensible defaults on underspecified requests, missing language/scope/requirements
- Edge Case Extended (ex): Empty inputs, division by zero, unicode handling, mutable defaults, boolean arithmetic
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/packages/cognitive-twin/CRP-1.2-COMPLETE.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture