RAG++ v0 Evaluation Report
- **Action Classification F1:** 84.5% - **Relevance Rate:** 7.7% - **"Oh Wow" Rate:** 0.0% - **Avg Relevance Score:** 2.08/5.0 - **Regime Differentiation:** 16.7% - **Explainability Score:** 82.0%
Full Public Reader
RAG++ v0 Evaluation Report
Generated: 2025-12-18T03:20:27.601Z
Version: v0.1.0
---
Executive Summary
⚠️ RAG++ v0 has met 1/5 success criteria. Further iteration recommended.
Quick Stats
- Action Classification F1: 84.5
- Relevance Rate: 7.7
- "Oh Wow" Rate: 0.0
- Avg Relevance Score: 2.08/5.0
- Regime Differentiation: 16.7
- Explainability Score: 82.0
Dataset: 27 life states, 26 transitions, 320 events
V0 Success Criteria
| Criterion | Target | Status |
|---|---|---|
| Action Classification F1 | ≥ 70 | |
| Relevance Rate | ≥ 65 | |
| "Oh Wow" Rate | ≥ 30 | |
| Better than Random | Yes (No) | ❌ NOT MET |
| Contextual Awareness | ≥ 50 |
Data Statistics
| Metric | Value |
|---|---|
| Total Users | 1 |
| Total Life States | 27 |
| Total Transitions | 26 |
| Total Life Events | 320 |
| Avg Transitions/User | 26.0 |
| Date Range | 2025-09-29 to 2025-12-16 |
Action Classification Evaluation
Overall Metrics
| Metric | Value |
|---|---|
| Accuracy | 73.3 |
| Precision | 95.0 |
| Recall | 76.3 |
| F1 Score | 84.5 |
Per-Action Performance
| Action Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| ReduceGravity | 100.0 | |||
| ReduceMass | 80.0 | |||
| IncreaseAlignment | 100.0 | |||
| IncreaseThrust | 100.0 |
Performance by Method
| Method | Precision | Recall | F1 Score |
|---|---|---|---|
| heuristic | 95.8 |
Recommendation Quality Evaluation
Relevance Metrics
| Metric | Value |
|---|---|
| Relevance Rate (3-5) | 7.7 |
| "Oh Wow" Rate (5) | 0.0 |
| Avg Relevance Score | 2.08/5.0 |
Confidence & Diversity
| Metric | Value |
|---|---|
| Avg Confidence | 8.8 |
| Calibration Error | 18.1 |
| Action Diversity | 9.7 |
| Avg Supporting Transitions | 3.0 |
Feedback by Regime
| Regime | Evaluations | Avg Relevance |
|---|---|---|
| threshold | 3 | 2.00/5.0 |
| approaching | 5 | 2.00/5.0 |
| falling | 5 | 2.20/5.0 |
State-Awareness Evaluation
Regime Awareness
| Metric | Value |
|---|---|
| Regime Consistency | 94.4 |
| Regime Differentiation | 16.7 |
Flag Sensitivity
scattered:
- Avg Difference: 25.0
- Action Correlations:
- ReduceGravity: -20.0
- ReduceMass: +50.0
- IncreaseAlignment: 0.0
- IncreaseThrust: -30.0
heavy:
- Avg Difference: 20.0
- Action Correlations:
- ReduceGravity: +40.0
- ReduceMass: -10.0
- IncreaseAlignment: 0.0
- IncreaseThrust: -30.0
pressured:
- Avg Difference: 5.0
- Action Correlations:
- ReduceGravity: 0.0
- ReduceMass: -10.0
- IncreaseAlignment: 0.0
- IncreaseThrust: +10.0
Phase Awareness
| Metric | Value |
|---|---|
| Day-of-Week Variation | 0.0 |
| Time-of-Day Variation | 0.0 |
Explainability
Reasoning Quality Score: 82.0
Baseline Comparisons
| Approach | Avg Relevance | "Oh Wow" Rate |
|---|---|---|
| Random Policy | 2.08/5.0 | 0.0 |
| Most Frequent Action | 3.26/5.0 | 5.0 |
| RAG++ v0 | 2.08/5.0 | **0.0 |
Improvement over baselines:
- vs Random: -0.1
- vs Most Frequent: -36.4
Key Findings
- ✅ **Action classification achieved 84.5
- - Best performing action type: ReduceGravity (F1: 93.3
- ⚠️ **Relevance rate is 7.7
- ⚠️ **"Oh wow" rate is 0.0
- ✅ Confidence is well-calibrated (calibration error < 20
- ⚠️ Contextual awareness is weak (16.7
- - Flag-action correlations detected: scattered → ReduceMass, heavy → ReduceGravity
Recommendations
- Recommendation Quality: Improve state similarity function or increase transition data volume.
- "Oh Wow" Rate: Consider highlighting more surprising/counterintuitive transitions.
- Contextual Awareness: Increase weight of regime and flags in state similarity function.
- Data Volume: Current transition count (26) is low. Collect more user data to improve recommendation quality.
Appendix
Evaluation Methodology
RAG++ v0 was evaluated across three dimensions:
1. Action Classification: 30 labeled events tested with heuristic and hybrid methods. Metrics: precision, recall, F1 (macro-averaged).
2. Recommendation Quality: 10-15 diverse states sampled from historical data. Recommendations evaluated on 1-5 relevance scale. Compared against random and most-frequent baselines.
3. State-Awareness: Tested regime consistency (same regime → similar recs), regime differentiation (different regimes → different recs), flag sensitivity (flags → appropriate actions), and explainability (reasoning quality).
Success Criteria (v0):
- Action Classification F1 ≥ 70
- Relevance Rate ≥ 65
- "Oh Wow" Rate ≥ 30
- Better than random baseline
- Regime differentiation ≥ 50
All evaluations used 7-day transition horizons and minimum 0.2 confidence threshold.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/backend/cc-trajectory/services/trajectory-core/evaluation-report.md
Detected Structure
Method · Evaluation