RAG++ Evaluation Framework
Runs all three evaluation components: - Action Classification (30-100 labeled events) - Recommendation Quality (5-15 states) - State-Awareness (regime consistency, flag sensitivity)
Full Public Reader
RAG++ Evaluation Framework
Complete evaluation suite for RAG++ v0 with action classification, recommendation quality, and state-awareness testing.
Quick Start
1. Generate Sample Data
cd services/trajectory-core
npx tsx src/scripts/evaluate-ragpp.ts --generate-dataThis creates 3 evaluation users with 60-90 days of realistic trajectory data each.
2. Run Full Evaluation
npx tsx src/scripts/evaluate-ragpp.tsRuns all three evaluation components:
- Action Classification (30-100 labeled events)
- Recommendation Quality (5-15 states)
- State-Awareness (regime consistency, flag sensitivity)
3. Run Quick Evaluation
npx tsx src/scripts/evaluate-ragpp.ts quickSmaller test set for faster iteration (30 events, 5 states, 3 regime pairs).
4. Interactive Mode
npx tsx src/scripts/evaluate-ragpp.ts --interactiveManually rate recommendations on 1-5 scale instead of simulation.
Evaluation Components
Action Classification Evaluation
File: [ActionClassificationEval.ts](./ActionClassificationEval.ts)
Measures: Precision, Recall, F1 for each action type
- ReduceGravity
- ReduceMass
- IncreaseAlignment
- IncreaseThrust
Target: F1 ≥ 70
Methods Tested:
- Heuristic (keyword patterns)
- LLM (Anthropic API)
- Hybrid (heuristic + LLM fallback)
Sample Output:
Overall Metrics:
Accuracy: 73.3%
Precision: 72.5%
Recall: 68.0%
F1 Score: 70.2%
Per-Action Metrics:
ReduceGravity:
Precision: 75.0%
Recall: 80.0%
F1 Score: 77.4%
Support: 5 instancesRecommendation Quality Evaluation
File: [RecommendationQualityEval.ts](./RecommendationQualityEval.ts)
Measures:
- Relevance Rate (
- "Oh Wow" Rate (
- Avg Relevance Score
- Confidence Calibration
- Action Diversity
Targets:
- Relevance Rate ≥ 65
- "Oh Wow" Rate ≥ 30
Sample Output:
Relevance Metrics:
Relevance Rate (3-5): 65.0%
"Oh Wow" Rate (5): 35.0%
Avg Relevance Score: 3.45/5.0
Baseline Comparisons:
Random Policy: 2.1/5.0, 5% oh-wow
Most Frequent Action: 2.8/5.0, 12% oh-wow
RAG++: 3.45/5.0, 35% oh-wowState-Awareness Evaluation
File: [StateAwarenessEval.ts](./StateAwarenessEval.ts)
Measures:
- Regime Consistency (similar states → similar recs)
- Regime Differentiation (different states → different recs)
- Flag Sensitivity (flags → appropriate actions)
- Phase Awareness (day/time variations)
- Explainability Score (reasoning quality)
Target: Regime Differentiation ≥ 50
Sample Output:
Regime Awareness:
Regime Consistency: 78.5%
Regime Differentiation: 62.3%
Flag Sensitivity:
scattered:
Avg Difference: 45.2%
Action Correlations:
IncreaseAlignment +32.1%
ReduceGravity +12.5%Command-Line Options
# Full evaluation (all metrics, larger dataset)
npx tsx src/scripts/evaluate-ragpp.ts
# Quick evaluation (smaller dataset for iteration)
npx tsx src/scripts/evaluate-ragpp.ts quick
# Run specific component only
npx tsx src/scripts/evaluate-ragpp.ts action-only
npx tsx src/scripts/evaluate-ragpp.ts rec-only
npx tsx src/scripts/evaluate-ragpp.ts state-only
# Interactive recommendation rating
npx tsx src/scripts/evaluate-ragpp.ts --interactive
# Generate sample data
npx tsx src/scripts/evaluate-ragpp.ts --generate-data
# Clean up evaluation data
npx tsx src/scripts/evaluate-ragpp.ts --cleanupGenerated Artifacts
Evaluation Report
After running, a comprehensive markdown report is generated:
Location: `./evaluation-report.md`
Sections:
1. Executive Summary (V0 criteria pass/fail)
2. Data Statistics
3. Action Classification Results
4. Recommendation Quality Results
5. State-Awareness Results
6. Baseline Comparisons
7. Key Findings
8. Recommendations for Improvement
Sample Data
When using `--generate-data`:
Created:
- 3 evaluation users ([email], etc.)
- 60-90 days of trajectory data per user
- Realistic state transitions (stuck → improving → escaping)
- 2-5 events per day with action labels
- ~150-200 life states per user
- ~180-450 life events per user
Trajectory Phases:
1. Stuck Phase (30 days): Low η, scattered focus, high pressure
2. Improvement Phase (30 days): Rising η, actions taken
3. Escaping Phase (20 days): High η, sustained progress
4. Maintaining Phase (10 days): Stable high performance
V0 Success Criteria
RAG++ v0 must meet ALL of the following:
| Criterion | Target | Measured By |
|---|---|---|
| Action Classification F1 | ≥ 70 | |
| Relevance Rate | ≥ 65 | |
| "Oh Wow" Rate | ≥ 30 | |
| Better than Random | Yes | Baseline comparison |
| Contextual Awareness | ≥ 50 |
Architecture
evaluation/
├── types.ts # Evaluation interfaces
├── ActionClassificationEval.ts # Precision/recall/F1 testing
├── RecommendationQualityEval.ts # Relevance rating
├── StateAwarenessEval.ts # Context sensitivity testing
├── SampleDataGenerator.ts # Realistic trajectory creation
├── ReportGenerator.ts # Markdown report generation
└── README.md # This file
scripts/
└── evaluate-ragpp.ts # Main evaluation runnerAdding Custom Labeled Events
To improve action classification accuracy, add more labeled events:
Location: [ActionClassificationEval.ts](./ActionClassificationEval.ts:243)
export function generateSampleLabeledEvents(): LabeledEvent[] {
return [
{
id: '31',
content: 'Your custom event description here',
trueActionTypes: ['ReduceGravity', 'IncreaseThrust'], // Ground truth
},
// ... add more
];
}Extending Evaluations
Add New Metric
1. Update `types.ts` with new metric interface
2. Create new evaluation module (e.g., `TemporalConsistencyEval.ts`)
3. Add to `evaluate-ragpp.ts` main runner
4. Update report generator to display results
Add New Baseline
In [RecommendationQualityEval.ts](./RecommendationQualityEval.ts):
export async function generateSemanticRAGBaseline(
userId: string,
numSamples: number
): Promise<{ avgRelevance: number; ohWowRate: number }> {
// Implement semantic embedding retrieval
// Compare to RAG++ state-based retrieval
}Troubleshooting
"Insufficient data" Error
Problem: Less than 2 life states in database
Solution:
npx tsx src/scripts/evaluate-ragpp.ts --generate-dataLow Action Classification F1
Causes:
- Insufficient keyword patterns
- Events don't match expected language
- Need LLM classification for complex cases
Solution:
1. Review failed classifications in output
2. Add patterns to `ACTION_PATTERNS` in ActionClassifier.ts
3. Enable LLM mode: Set `useLLM: true` in config
Low "Oh Wow" Rate
Causes:
- Similar past transitions not insightful
- State similarity function too broad
- Confidence threshold too low
Solution:
1. Increase `minConfidence` threshold (e.g., 0.3 → 0.5)
2. Adjust state similarity weights in StateEstimator.ts
3. Require more supporting transitions
Regime Differentiation Below 50
Causes:
- State similarity doesn't distinguish regimes well
- Not enough regime variety in data
- Recommendations too generic
Solution:
1. Increase `w_r` (regime weight) in state similarity
2. Generate data with more regime transitions
3. Add regime-specific action patterns
Related Documentation
- RAG++ Architecture: [../docs/guides/RAG_PLUS_PLUS.md](../../docs/guides/RAG_PLUS_PLUS.md)
- Research Paper: [../docs/research/RAG_PLUS_PLUS_PAPER.md](../../docs/research/RAG_PLUS_PLUS_PAPER.md)
- Service APIs:
- [StateEstimator.ts](../services/StateEstimator.ts)
- [ActionClassifier.ts](../services/ActionClassifier.ts)
- [TransitionBuilder.ts](../services/TransitionBuilder.ts)
- [TransitionRetrieval.ts](../services/TransitionRetrieval.ts)
- [PolicySuggester.ts](../services/PolicySuggester.ts)
Contributing
To add new evaluation metrics:
1. Define metric interface in `types.ts`
2. Implement evaluation function in new file
3. Add to main runner (`evaluate-ragpp.ts`)
4. Update report generator
5. Document in this README
6. Add test cases with sample data
Citation
If using this evaluation framework in research:
@misc{ragpp2025,
title={RAG++: State-Based Retrieval for Life Trajectory Optimization},
author={Diomande, Mo and Claude},
year={2025},
note={TrajectoryOS Project}
}---
Last Updated: December 2025
Version: v0.1.0
Status: Production-ready evaluation framework
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/backend/cc-trajectory/services/trajectory-core/src/evaluation/README.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture