Back to corpus
experimentexperiment writeup candidatescore 32

RAG++ v0 Evaluation Framework - Complete Implementation

A comprehensive evaluation framework has been built to measure the real-world performance of RAG++ v0 across three critical dimensions: action classification accuracy, recommendation quality, and state-awareness.

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

A comprehensive evaluation framework has been built to measure the real-world performance of RAG++ v0 across three critical dimensions: action classification accuracy, recommendation quality, and state-awareness. 1. **[types.ts](src/evaluation/types.ts)** (134 lines) - Complete TypeScript interfaces for all evaluation metrics - ActionClassificationMetrics, RecommendationQualityMetrics, StateAwarenessMetrics - RAGPPEvaluationReport with V0 success criteria - EvaluationConfig for customization 2. **[ActionClassificationEval.ts](src/evaluation/ActionClassificationEval.ts)** (415 lines) - Measures precision, recall, F1 for 4 action types - Supports heuristic, LLM, and hybrid methods - Includes 30 hand-labeled sample events for testing - Multi-label confusion matrix computation - **Target**: F1 ≥ 70% 3. **[RecommendationQualityEval.ts](src/evaluation/RecommendationQualityEval.ts)** (508 lines) - Interactive and simulated user feedback modes - 1-5 relevance scoring ("oh wow" = 5) - Confidence calibration measurement - Baseline generators (random policy, most-frequent action) - Regime-specific feedback tracking - **Targets**: 65% relevance rate, 30% "oh wow" rate 4. **[StateAwarenessEval.ts](src/evaluation/StateAwarenessEval.ts)** (568 lines) - Regime consistency (same state → same recs) - Regime differentiation (different state → different recs) - Flag sensitivity (scattered/heavy/pressured → appropriate actions) - Phase awareness (day-of-week, time-of-day variations) - Explainability scoring (reasoning quality) - **Target**: Regime differentiation ≥ 50%

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.