Grand Diomande Research · Full HTML Reader

RAG++ v0 Evaluation Report

- **Action Classification F1:** 84.5% - **Relevance Rate:** 7.7% - **"Oh Wow" Rate:** 0.0% - **Avg Relevance Score:** 2.08/5.0 - **Regime Differentiation:** 16.7% - **Explainability Score:** 82.0%

Agents That Account for Themselves experiment experiment writeup candidate score 18 .md

Full Public Reader

RAG++ v0 Evaluation Report

Generated: 2025-12-18T03:20:27.601Z
Version: v0.1.0

---

Executive Summary

⚠️ RAG++ v0 has met 1/5 success criteria. Further iteration recommended.

Quick Stats

  • Action Classification F1: 84.5
  • Relevance Rate: 7.7
  • "Oh Wow" Rate: 0.0
  • Avg Relevance Score: 2.08/5.0
  • Regime Differentiation: 16.7
  • Explainability Score: 82.0

Dataset: 27 life states, 26 transitions, 320 events

V0 Success Criteria

CriterionTargetStatus
Action Classification F1≥ 70
Relevance Rate≥ 65
"Oh Wow" Rate≥ 30
Better than RandomYes (No)❌ NOT MET
Contextual Awareness≥ 50

Data Statistics

MetricValue
Total Users1
Total Life States27
Total Transitions26
Total Life Events320
Avg Transitions/User26.0
Date Range2025-09-29 to 2025-12-16

Action Classification Evaluation

Overall Metrics

MetricValue
Accuracy73.3
Precision95.0
Recall76.3
F1 Score84.5

Per-Action Performance

Action TypePrecisionRecallF1 ScoreSupport
ReduceGravity100.0
ReduceMass80.0
IncreaseAlignment100.0
IncreaseThrust100.0

Performance by Method

MethodPrecisionRecallF1 Score
heuristic95.8

Recommendation Quality Evaluation

Relevance Metrics

MetricValue
Relevance Rate (3-5)7.7
"Oh Wow" Rate (5)0.0
Avg Relevance Score2.08/5.0

Confidence & Diversity

MetricValue
Avg Confidence8.8
Calibration Error18.1
Action Diversity9.7
Avg Supporting Transitions3.0

Feedback by Regime

RegimeEvaluationsAvg Relevance
threshold32.00/5.0
approaching52.00/5.0
falling52.20/5.0

State-Awareness Evaluation

Regime Awareness

MetricValue
Regime Consistency94.4
Regime Differentiation16.7

Flag Sensitivity

scattered:

  • Avg Difference: 25.0
  • Action Correlations:
  • ReduceGravity: -20.0
  • ReduceMass: +50.0
  • IncreaseAlignment: 0.0
  • IncreaseThrust: -30.0

heavy:

  • Avg Difference: 20.0
  • Action Correlations:
  • ReduceGravity: +40.0
  • ReduceMass: -10.0
  • IncreaseAlignment: 0.0
  • IncreaseThrust: -30.0

pressured:

  • Avg Difference: 5.0
  • Action Correlations:
  • ReduceGravity: 0.0
  • ReduceMass: -10.0
  • IncreaseAlignment: 0.0
  • IncreaseThrust: +10.0

Phase Awareness

MetricValue
Day-of-Week Variation0.0
Time-of-Day Variation0.0

Explainability

Reasoning Quality Score: 82.0

Baseline Comparisons

ApproachAvg Relevance"Oh Wow" Rate
Random Policy2.08/5.00.0
Most Frequent Action3.26/5.05.0
RAG++ v02.08/5.0**0.0

Improvement over baselines:

  • vs Random: -0.1
  • vs Most Frequent: -36.4

Key Findings

  • ✅ **Action classification achieved 84.5
  • - Best performing action type: ReduceGravity (F1: 93.3
  • ⚠️ **Relevance rate is 7.7
  • ⚠️ **"Oh wow" rate is 0.0
  • Confidence is well-calibrated (calibration error < 20
  • ⚠️ Contextual awareness is weak (16.7
  • - Flag-action correlations detected: scattered → ReduceMass, heavy → ReduceGravity

Recommendations

  • Recommendation Quality: Improve state similarity function or increase transition data volume.
  • "Oh Wow" Rate: Consider highlighting more surprising/counterintuitive transitions.
  • Contextual Awareness: Increase weight of regime and flags in state similarity function.
  • Data Volume: Current transition count (26) is low. Collect more user data to improve recommendation quality.

Appendix

Evaluation Methodology

RAG++ v0 was evaluated across three dimensions:

1. Action Classification: 30 labeled events tested with heuristic and hybrid methods. Metrics: precision, recall, F1 (macro-averaged).

2. Recommendation Quality: 10-15 diverse states sampled from historical data. Recommendations evaluated on 1-5 relevance scale. Compared against random and most-frequent baselines.

3. State-Awareness: Tested regime consistency (same regime → similar recs), regime differentiation (different regimes → different recs), flag sensitivity (flags → appropriate actions), and explainability (reasoning quality).

Success Criteria (v0):
- Action Classification F1 ≥ 70
- Relevance Rate ≥ 65
- "Oh Wow" Rate ≥ 30
- Better than random baseline
- Regime differentiation ≥ 50

All evaluations used 7-day transition horizons and minimum 0.2 confidence threshold.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/backend/cc-trajectory/services/trajectory-core/evaluation-report.md

Detected Structure

Method · Evaluation