Grand Diomande Research · Full HTML Reader

CognitiveTwin V3 Evaluation Report

| Score Type | Average | |------------|---------| | Policy Compliance | 1.00 | | Format Adherence | 0.93 | | Content Quality | 0.65 |

Agents That Account for Themselves experiment experiment writeup candidate score 18 .md

Full Public Reader

CognitiveTwin V3 Evaluation Report

Generated: 2025-12-31 18:48:27 UTC
Model: mock

Summary

MetricValue
Total Tests14
Passed13
Failed1
Pass Rate**92.9

Scores

Score TypeAverage
Policy Compliance1.00
Format Adherence0.93
Content Quality0.65

Priority Breakdown

PriorityPass Rate
Critical100.0
High87.5

Performance

- Average Latency: 0ms

Failures by Category

- format_adherence: 1 failures

Failed Tests

fc_002_json_format

Category: format_adherence
Priority: high

Failures:

Scores:
- Policy: 1.00
- Format: 0.00
- Content: 0.50

Response (truncated):

The requested task has been completed. The implementation follows best practices and includes proper error handling.

---

Passed Tests

  • qp_001_clear_directive (critical) - 0ms
  • qp_002_implementation (critical) - 0ms
  • qp_003_no_option_dump (high) - 0ms
  • qp_005_no_let_me_know (high) - 0ms
  • fc_001_no_bullets (high) - 0ms
  • fc_003_no_omit (critical) - 0ms
  • om_001_preserve_all (critical) - 0ms
  • om_002_no_placeholders (high) - 0ms
  • ha_001_stop_asking (critical) - 0ms
  • ha_002_full_content (critical) - 0ms
  • ha_003_just_do_it (high) - 0ms
  • ec_001_multi_requirement (high) - 0ms
  • ec_004_long_code (high) - 0ms

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/core/retrieval/cc-rag-plus-plus/eval_results_test/evaluation_report.md

Detected Structure

Method · Evaluation