Grand Diomande Research · Full HTML Reader

IRCP & DLMDataLoader Integration - Quick Reference

| Component | File Path | |-----------|-----------| | **IRCP Trainer** | `packages/ircp/training/icp_trainer.py` | | **IRCP Database Loader** | `packages/ircp/data/database_loader.py` | | **IRCP Base Models** | `packages/ircp/core/base_models.py` | | **DLM Data Loader** | `packages/dlm/core/data_loader.py` | | **TPO Trainer** | `packages/tpo/training/trainer.py` | | **Database Enhanced RCP** | `packages/tpo/consolidation/knowledge_base/database_enhanced_rcp.py` |

Agents That Account for Themselves research note experiment writeup candidate score 24 .md

Full Public Reader

IRCP & DLMDataLoader Integration - Quick Reference

File Locations

ComponentFile Path
IRCP Trainer`packages/ircp/training/icp_trainer.py`
IRCP Database Loader`packages/ircp/data/database_loader.py`
IRCP Base Models`packages/ircp/core/base_models.py`
DLM Data Loader`packages/dlm/core/data_loader.py`
TPO Trainer`packages/tpo/training/trainer.py`
Database Enhanced RCP`packages/tpo/consolidation/knowledge_base/database_enhanced_rcp.py`

Key Classes

IRCP Training

ICPTrainer (main trainer class)
  - train()
  - validate_epoch()
  - train_epoch()
  - _compute_loss() [5-component loss]
  - save_checkpoint()
  - export_model()

ICPDataset (PyTorch Dataset)
  - __getitem__() returns dict with:
    - embedding: torch.Tensor
    - coordinates: torch.Tensor (4D)
    - target: torch.Tensor
    - message_id, conversation_id, author

IRCP Data Loading

DatabaseLoader
  - get_conversation_ids()
  - load_conversation()
  - load_conversations_parallel()
  - _load_coordinates_batch()
  - _load_embeddings_batch()
  - create_icp_dataset()

ConversationDataLoader (High-level interface)
  - load_training_data() [returns train/val/test split]
  - load_sample_data()
  - get_statistics()

DLM Data Loading

DLMDataLoader (context manager enabled)
  - get_conversation_ids()
  - load_conversation()
  - load_conversations() [iterator pattern]
  - _load_coordinates_batch() [with caching]
  - _load_embeddings_batch() [with caching]
  - get_statistics()
  - close()

Data Structure Compatibility

### Coordinates
| IRCP DLMCoordinates | DLM DLMCoordinate | Status |
|-------|-------|--------|
| x | x | Direct |
| y | y | Direct |
| z | z | Direct |
| t | t | Direct |
| depth | depth_level | Direct |
| sibling_count | n_parts | Semantic map |
| is_linear | (missing) | Need default |
| confidence | confidence | Direct |

### ConversationNode
Both have ConversationNode but:
- IRCP uses: `DLMCoordinates` (from ircp/core/base_models.py)
- DLM uses: `DLMCoordinate` (from dlm/core/coordinates.py)

Loss Function Components

1. Coordinate Prediction Loss (weight: 1.0) - MSE
2. Embedding Consistency Loss (weight: 0.1) - Cosine similarity
3. Conservation Constraint Loss (weight: 0.05) - Measure preservation
4. Topological Consistency Loss (weight: 0.1) - k-NN preservation
5. L2 Regularization (weight: 1e-5) - Parameter regularization

Training Configuration Parameters

python
config = {
    "epochs": 50,
    "batch_size": 32,
    "learning_rate": 1e-4,
    "weight_decay": 1e-5,
    "optimizer": "adamw",  # adamw, adam, sgd
    "scheduler": "cosine",  # cosine, step, exponential
    "max_grad_norm": 1.0,
    "save_checkpoints": True,
    "output_dir": "./checkpoints"
}

Database Schema

### IRCP Expected Schema
- conversations: conversation_id, total_messages
- messages: message_id, conversation_id, parent_id, content, author, create_time, token_count
- dlm_coordinates: message_id, x_coord, y_coord, z_coord, t_coord, depth, sibling_order, sibling_count, is_linear
- embeddings: message_id, embedding_vector

### DLM Expected Schema
- conversations: conversation_id, total_messages
- messages: message_id, conversation_id, parent_id, content, author, create_time, token_count, end_turn, weight
- dlm_coordinates: message_id, x, y, z, t, n_parts, depth_level, sibling_index, confidence
- embeddings: message_id, embedding

Critical Difference: Column names in dlm_coordinates table!

Integration Challenges

Challenge 1: Coordinate System

python
# Need adapter function
def dlm_to_ircp_coordinates(dlm_coord) -> DLMCoordinates:
    return DLMCoordinates(
        x=dlm_coord.x,
        y=dlm_coord.y,
        z=dlm_coord.z,
        t=dlm_coord.t,
        depth=dlm_coord.depth_level,
        sibling_count=dlm_coord.n_parts,
        confidence=dlm_coord.confidence,
        metadata={"sibling_index": dlm_coord.sibling_index}
    )

Challenge 2: ConversationGraph Structure

python
# IRCP approach
graph.edges = {parent_id: [child_ids]}
graph.reverse_edges = {child_id: parent_id}

# DLM approach
graph.root_ids = [root_ids]
# Has methods: get_children(), get_ancestors(), get_depth()

### Challenge 3: Database Column Names
- IRCP: x_coord, y_coord, z_coord, t_coord
- DLM: x, y, z, t

Current Data Flow

IRCP Current:
Database → DatabaseLoader → ConversationGraph → ICPDataPoint
  → ICPDataset → DataLoader → Training Loop

Proposed with DLMDataLoader:
Database → DLMDataLoader → DLM ConversationGraph → Adapter
  → ICPDataPoint → ICPDataset → DataLoader → Training Loop

Integration Benefits

1. Unified coordinate caching (DLM has both coords + embeddings)
2. Context manager support (automatic cleanup)
3. Better logging integration
4. Reduced code duplication
5. Iterator pattern for memory efficiency
6. Flexible database schema handling

Adapter Layer Tasks

1. Convert DLMCoordinate → DLMCoordinates
2. Convert DLM ConversationGraph → IRCP ConversationGraph
3. Create ICPDataPoint from DLM nodes
4. Handle database schema differences
5. Provide statistics and validation

Testing Checklist

  • [ ] Coordinate conversion preserves precision
  • [ ] Embedding arrays identical between loaders
  • [ ] Training loss curves within 1
  • [ ] Backward compatibility with existing code
  • [ ] Performance metrics documented
  • [ ] Edge cases handled (missing data, schema variants)

Performance Targets

  • Data loading: 10-20
  • Memory: Reduced with iterator pattern
  • Quality: No regression in training results
  • Maintainability: 50

Key Files to Modify

1. Create: `packages/ircp/data/dlm_adapter.py`
2. Create: `packages/ircp/data/dlm_data_loader.py`
3. Optional: Modify `packages/ircp/training/icp_trainer.py` (for loader flexibility)

Example Integration Flow

python
# Step 1: Use DLMDataLoader
from dlm.core.data_loader import DLMDataLoader

with DLMDataLoader(db_path) as loader:
    conv_ids = loader.get_conversation_ids()
    graphs = list(loader.load_conversations(conv_ids[:100]))

# Step 2: Convert to IRCP format
from ircp.data.dlm_adapter import convert_dlm_to_icp_dataset

icp_data = convert_dlm_to_icp_dataset(graphs)

# Step 3: Train as normal
from ircp.training.icp_trainer import ICPTrainer

trainer = ICPTrainer(model, config)
results = trainer.train(icp_data[:80], icp_data[80:])

Database Query Differences

Getting Coordinates

python
# IRCP expects
SELECT x_coord, y_coord, z_coord, t_coord FROM dlm_coordinates

# DLM expects
SELECT x, y, z, t FROM dlm_coordinates

Getting Conversations

python
# Both use same approach
SELECT conversation_id, total_messages FROM conversations
WHERE total_messages >= min_messages
ORDER BY total_messages DESC

Compatibility Matrix

FeatureIRCPDLMCompatibility
Conversation loadingYesYes100
Parallel loadingYesYes100
Embedding cachingYesYes100
Coordinate cachingNoYesImprovement
Context managerNoYesImprovement
Iterator patternNoYesImprovement
Logging integrationStandardEnhancedImprovement

Common Gotchas

1. Schema Mismatch: Database has DLM schema but IRCP expects different column names
- Solution: Check database and use appropriate loader

2. Coordinate Precision: DLM might have float vs int differences
- Solution: Ensure all conversions use consistent float type

3. Memory Issues: Large datasets with embedding caching
- Solution: Use DLMDataLoader iterator pattern and load_conversations()

4. Missing Metadata: is_linear field not in DLM coordinates
- Solution: Provide sensible defaults in adapter

5. Timing: Different timestamp formats possible
- Solution: Normalize to float (seconds since epoch) in adapter

Command Reference

Load Data (Current)

bash
python -c "
from ircp.data.database_loader import ConversationDataLoader
loader = ConversationDataLoader('path/to/db.sqlite')
train, val, test = loader.load_training_data()
"

Load Data (With DLMDataLoader)

bash
python -c "
from dlm.core.data_loader import DLMDataLoader
from ircp.data.dlm_adapter import convert_dlm_to_icp_dataset
with DLMDataLoader('path/to/db.sqlite') as loader:
    graphs = list(loader.load_conversations())
    data = convert_dlm_to_icp_dataset(graphs)
"

Next Steps

1. Implement adapter layer (Phase 1) - 2-3 hours
2. Test coordinate conversion - 1 hour
3. Test training equivalence - 2-3 hours
4. Create documentation - 1 hour
5. Performance benchmarking - 1-2 hours

Total Estimated Time: 7-10 hours

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/docs/guides/IRCP_INTEGRATION_QUICK_REFERENCE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture