Grand Diomande Research · Full HTML Reader

RAG++ Evaluation Framework

Runs all three evaluation components: - Action Classification (30-100 labeled events) - Recommendation Quality (5-15 states) - State-Awareness (regime consistency, flag sensitivity)

Agents That Account for Themselves experiment experiment writeup candidate score 24 .md

Full Public Reader

RAG++ Evaluation Framework

Complete evaluation suite for RAG++ v0 with action classification, recommendation quality, and state-awareness testing.

Quick Start

1. Generate Sample Data

bash
cd services/trajectory-core
npx tsx src/scripts/evaluate-ragpp.ts --generate-data

This creates 3 evaluation users with 60-90 days of realistic trajectory data each.

2. Run Full Evaluation

bash
npx tsx src/scripts/evaluate-ragpp.ts

Runs all three evaluation components:
- Action Classification (30-100 labeled events)
- Recommendation Quality (5-15 states)
- State-Awareness (regime consistency, flag sensitivity)

3. Run Quick Evaluation

bash
npx tsx src/scripts/evaluate-ragpp.ts quick

Smaller test set for faster iteration (30 events, 5 states, 3 regime pairs).

4. Interactive Mode

bash
npx tsx src/scripts/evaluate-ragpp.ts --interactive

Manually rate recommendations on 1-5 scale instead of simulation.

Evaluation Components

Action Classification Evaluation

File: [ActionClassificationEval.ts](./ActionClassificationEval.ts)

Measures: Precision, Recall, F1 for each action type
- ReduceGravity
- ReduceMass
- IncreaseAlignment
- IncreaseThrust

Target: F1 ≥ 70

Methods Tested:
- Heuristic (keyword patterns)
- LLM (Anthropic API)
- Hybrid (heuristic + LLM fallback)

Sample Output:

Overall Metrics:
  Accuracy:  73.3%
  Precision: 72.5%
  Recall:    68.0%
  F1 Score:  70.2%

Per-Action Metrics:
  ReduceGravity:
    Precision: 75.0%
    Recall:    80.0%
    F1 Score:  77.4%
    Support:   5 instances

Recommendation Quality Evaluation

File: [RecommendationQualityEval.ts](./RecommendationQualityEval.ts)

Measures:
- Relevance Rate (
- "Oh Wow" Rate (
- Avg Relevance Score
- Confidence Calibration
- Action Diversity

Targets:
- Relevance Rate ≥ 65
- "Oh Wow" Rate ≥ 30

Sample Output:

Relevance Metrics:
  Relevance Rate (3-5):  65.0%
  "Oh Wow" Rate (5):     35.0%
  Avg Relevance Score:   3.45/5.0

Baseline Comparisons:
  Random Policy:         2.1/5.0, 5% oh-wow
  Most Frequent Action:  2.8/5.0, 12% oh-wow
  RAG++:                 3.45/5.0, 35% oh-wow

State-Awareness Evaluation

File: [StateAwarenessEval.ts](./StateAwarenessEval.ts)

Measures:
- Regime Consistency (similar states → similar recs)
- Regime Differentiation (different states → different recs)
- Flag Sensitivity (flags → appropriate actions)
- Phase Awareness (day/time variations)
- Explainability Score (reasoning quality)

Target: Regime Differentiation ≥ 50

Sample Output:

Regime Awareness:
  Regime Consistency:      78.5%
  Regime Differentiation:  62.3%

Flag Sensitivity:
  scattered:
    Avg Difference:        45.2%
    Action Correlations:
      IncreaseAlignment    +32.1%
      ReduceGravity        +12.5%

Command-Line Options

bash
# Full evaluation (all metrics, larger dataset)
npx tsx src/scripts/evaluate-ragpp.ts

# Quick evaluation (smaller dataset for iteration)
npx tsx src/scripts/evaluate-ragpp.ts quick

# Run specific component only
npx tsx src/scripts/evaluate-ragpp.ts action-only
npx tsx src/scripts/evaluate-ragpp.ts rec-only
npx tsx src/scripts/evaluate-ragpp.ts state-only

# Interactive recommendation rating
npx tsx src/scripts/evaluate-ragpp.ts --interactive

# Generate sample data
npx tsx src/scripts/evaluate-ragpp.ts --generate-data

# Clean up evaluation data
npx tsx src/scripts/evaluate-ragpp.ts --cleanup

Generated Artifacts

Evaluation Report

After running, a comprehensive markdown report is generated:

Location: `./evaluation-report.md`

Sections:
1. Executive Summary (V0 criteria pass/fail)
2. Data Statistics
3. Action Classification Results
4. Recommendation Quality Results
5. State-Awareness Results
6. Baseline Comparisons
7. Key Findings
8. Recommendations for Improvement

Sample Data

When using `--generate-data`:

Created:
- 3 evaluation users ([email], etc.)
- 60-90 days of trajectory data per user
- Realistic state transitions (stuck → improving → escaping)
- 2-5 events per day with action labels
- ~150-200 life states per user
- ~180-450 life events per user

Trajectory Phases:
1. Stuck Phase (30 days): Low η, scattered focus, high pressure
2. Improvement Phase (30 days): Rising η, actions taken
3. Escaping Phase (20 days): High η, sustained progress
4. Maintaining Phase (10 days): Stable high performance

V0 Success Criteria

RAG++ v0 must meet ALL of the following:

CriterionTargetMeasured By
Action Classification F1≥ 70
Relevance Rate≥ 65
"Oh Wow" Rate≥ 30
Better than RandomYesBaseline comparison
Contextual Awareness≥ 50

Architecture

evaluation/
├── types.ts                      # Evaluation interfaces
├── ActionClassificationEval.ts   # Precision/recall/F1 testing
├── RecommendationQualityEval.ts  # Relevance rating
├── StateAwarenessEval.ts         # Context sensitivity testing
├── SampleDataGenerator.ts        # Realistic trajectory creation
├── ReportGenerator.ts            # Markdown report generation
└── README.md                     # This file

scripts/
└── evaluate-ragpp.ts             # Main evaluation runner

Adding Custom Labeled Events

To improve action classification accuracy, add more labeled events:

Location: [ActionClassificationEval.ts](./ActionClassificationEval.ts:243)

typescript
export function generateSampleLabeledEvents(): LabeledEvent[] {
  return [
    {
      id: '31',
      content: 'Your custom event description here',
      trueActionTypes: ['ReduceGravity', 'IncreaseThrust'], // Ground truth
    },
    // ... add more
  ];
}

Extending Evaluations

Add New Metric

1. Update `types.ts` with new metric interface
2. Create new evaluation module (e.g., `TemporalConsistencyEval.ts`)
3. Add to `evaluate-ragpp.ts` main runner
4. Update report generator to display results

Add New Baseline

In [RecommendationQualityEval.ts](./RecommendationQualityEval.ts):

typescript
export async function generateSemanticRAGBaseline(
  userId: string,
  numSamples: number
): Promise<{ avgRelevance: number; ohWowRate: number }> {
  // Implement semantic embedding retrieval
  // Compare to RAG++ state-based retrieval
}

Troubleshooting

"Insufficient data" Error

Problem: Less than 2 life states in database

Solution:

bash
npx tsx src/scripts/evaluate-ragpp.ts --generate-data

Low Action Classification F1

Causes:
- Insufficient keyword patterns
- Events don't match expected language
- Need LLM classification for complex cases

Solution:
1. Review failed classifications in output
2. Add patterns to `ACTION_PATTERNS` in ActionClassifier.ts
3. Enable LLM mode: Set `useLLM: true` in config

Low "Oh Wow" Rate

Causes:
- Similar past transitions not insightful
- State similarity function too broad
- Confidence threshold too low

Solution:
1. Increase `minConfidence` threshold (e.g., 0.3 → 0.5)
2. Adjust state similarity weights in StateEstimator.ts
3. Require more supporting transitions

Regime Differentiation Below 50

Causes:
- State similarity doesn't distinguish regimes well
- Not enough regime variety in data
- Recommendations too generic

Solution:
1. Increase `w_r` (regime weight) in state similarity
2. Generate data with more regime transitions
3. Add regime-specific action patterns

Related Documentation

  • RAG++ Architecture: [../docs/guides/RAG_PLUS_PLUS.md](../../docs/guides/RAG_PLUS_PLUS.md)
  • Research Paper: [../docs/research/RAG_PLUS_PLUS_PAPER.md](../../docs/research/RAG_PLUS_PLUS_PAPER.md)
  • Service APIs:
  • [StateEstimator.ts](../services/StateEstimator.ts)
  • [ActionClassifier.ts](../services/ActionClassifier.ts)
  • [TransitionBuilder.ts](../services/TransitionBuilder.ts)
  • [TransitionRetrieval.ts](../services/TransitionRetrieval.ts)
  • [PolicySuggester.ts](../services/PolicySuggester.ts)

Contributing

To add new evaluation metrics:

1. Define metric interface in `types.ts`
2. Implement evaluation function in new file
3. Add to main runner (`evaluate-ragpp.ts`)
4. Update report generator
5. Document in this README
6. Add test cases with sample data

Citation

If using this evaluation framework in research:

bibtex
@misc{ragpp2025,
  title={RAG++: State-Based Retrieval for Life Trajectory Optimization},
  author={Diomande, Mo and Claude},
  year={2025},
  note={TrajectoryOS Project}
}

---

Last Updated: December 2025
Version: v0.1.0
Status: Production-ready evaluation framework

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/backend/cc-trajectory/services/trajectory-core/src/evaluation/README.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture