IRCP Training Infrastructure Analysis & DLMDataLoader Integration Guide
This document provides a comprehensive analysis of the IRCP training infrastructure and a detailed integration plan for the DLMDataLoader from Phase 3.1. The IRCP framework uses an ICP trainer with a sophisticated multi-component loss function and database-backed data loading. Integration with DLMDataLoader will improve data loading efficiency and provide unified coordinate system support.
Full Public Reader
IRCP Training Infrastructure Analysis & DLMDataLoader Integration Guide
Executive Summary
This document provides a comprehensive analysis of the IRCP training infrastructure and a detailed integration plan for the DLMDataLoader from Phase 3.1. The IRCP framework uses an ICP trainer with a sophisticated multi-component loss function and database-backed data loading. Integration with DLMDataLoader will improve data loading efficiency and provide unified coordinate system support.
---
1. IRCP Training Infrastructure Overview
1.1 Main Trainer Class
File: `[home]/Desktop/Computational Choreography/cc-tpo/packages/ircp/training/icp_trainer.py`
Key Class: `ICPTrainer`
The ICPTrainer is the main training orchestrator for the IRCP framework with the following capabilities:
Constructor Parameters
def __init__(self, model, config: Dict[str, Any]):
- epochs: int = 50
- batch_size: int = 32
- learning_rate: float = 1e-4
- weight_decay: float = 1e-5
- save_checkpoints: bool = True
- output_dir: str = "./checkpoints"
- optimizer: str = "adamw" (adamw, adam, sgd)
- scheduler: str = "cosine" (cosine, step, exponential)
- max_grad_norm: float = 1.0#### Core Methods
- `train(train_data: List[ICPDataPoint], val_data: Optional[List[ICPDataPoint]]) -> Dict`
- `validate_epoch(val_loader: DataLoader) -> float`
- `train_epoch(train_loader: DataLoader) -> float`
- `_create_dataloader(data_points: List[ICPDataPoint], mode: str) -> DataLoader`
- `save_checkpoint(epoch: int, val_loss: float, is_best: bool)`
- `load_checkpoint(checkpoint_path: str)`
- `get_training_statistics() -> Dict[str, Any]`
- `export_model(export_path: str, format: str)`
1.2 Dataset Class
Class: `ICPDataset(Dataset)`
Handles conversion of ICPDataPoint objects to PyTorch tensors:
class ICPDataset(Dataset):
def __init__(self, data_points: List[ICPDataPoint], max_length: int = 512, mode: str = "train")
def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
Returns:
- "embedding": torch.Tensor (embedding_dim,)
- "coordinates": torch.Tensor (4,) [x, y, z, t]
- "target": torch.Tensor
- "message_id": str
- "conversation_id": str
- "author": strData Validation: Filters out invalid data points:
- embedding is not None and has length > 0
- embedding contains no NaN values
- coordinates is not None
1.3 Loss Function Architecture
Method: `_compute_loss(batch: Dict[str, torch.Tensor]) -> torch.Tensor`
The trainer implements a sophisticated multi-component loss function:
#### 1. Coordinate Prediction Loss (MSE)
- Predicts 4D coordinates from embeddings
- Supports three prediction approaches:
- Standard PyTorch forward pass
- Custom `predict_coordinates` method
- Fallback linear transformation layer
#### 2. Embedding Consistency Loss (0.1 weight)
- Ensures similar coordinates have similar embeddings
- Uses cosine similarity for embeddings vs pairwise distances for coordinates
- Metric: `MSE(embedding_distances, normalized_coordinate_distances)`
#### 3. Conservation Constraint Loss (0.05 weight)
- Preserves measure (volume) consistency
- Implements measure-preservation through determinant/trace calculations
- For high-dimensional embeddings: `|log(embed_measure) - log(coord_measure)|`
#### 4. Topological Consistency Loss (0.1 weight)
- Preserves k-NN neighborhood structure in coordinate space
- Normalizes distances to [0,1] range
- Maintains relative distance ordering between embeddings and coordinates
#### 5. L2 Regularization (1e-5 weight)
- Regularizes model parameters to prevent overfitting
1.4 Training Loop
The training process follows this sequence:
For each epoch:
1. train_epoch() - forward pass with backprop and optimization
2. scheduler.step() - learning rate adjustment
3. Logging of epoch statistics
4. Checkpoint saving (every 10 epochs or if best validation loss)
Training State Tracked:
- train_losses: List[float]
- val_losses: List[float]
- learning_rates: List[float]
- loss_history: Dict with all loss components
- best_val_loss: float (for early stopping logic)---
2. Current Data Loading Approach
2.1 IRCP DatabaseLoader
File: `[home]/Desktop/Computational Choreography/cc-tpo/packages/ircp/data/database_loader.py`
DatabaseConfig
@dataclass
class DatabaseConfig:
db_path: str
min_messages: int = 5
max_conversations: Optional[int] = None
require_coordinates: bool = True
require_embeddings: bool = False
batch_size: int = 1000
cache_embeddings: bool = True
parallel_loading: bool = True
max_workers: int = 4Key Classes
DatabaseLoader
- `get_conversation_ids() -> List[str]` - Filters conversations by message count and data availability
- `load_conversation(conversation_id: str) -> Optional[ConversationGraph]` - Loads single conversation
- `load_conversations_parallel(conversation_ids: List[str]) -> List[ConversationGraph]` - Parallel loading
- `_load_coordinates_batch(message_ids: List[str]) -> Dict[str, DLMCoordinates]` - Batch coordinate loading
- `_load_embeddings_batch(message_ids: List[str]) -> Dict[str, np.ndarray]` - Batch embedding loading with cache
- `create_icp_dataset(conversation_graphs: List[ConversationGraph]) -> List[ICPDataPoint]` - Converts graphs to ICP data
ConversationDataLoader (High-level interface)
- `load_training_data(train_ratio=0.8, validation_ratio=0.1, test_ratio=0.1) -> Tuple[List[ICPDataPoint], ...]`
- `load_sample_data(n_conversations: int = 10) -> List[ICPDataPoint]`
- `get_statistics() -> Dict[str, Any]`
#### Database Schema Expected
- `conversations` table: conversation_id, total_messages
- `messages` table: message_id, conversation_id, parent_id, content, author, create_time, token_count
- `dlm_coordinates` table: message_id, x_coord, y_coord, z_coord, t_coord, depth, sibling_order, sibling_count, is_linear
- `embeddings` table: message_id, embedding_vector (pickled numpy array)
- `relationships` table: prompt_id, response_id, confidence, quality_difference, temporal_distance, euclidean_distance
2.2 Data Flow
Database (SQLite)
↓
DatabaseLoader._load_coordinates_batch()
DatabaseLoader._load_embeddings_batch()
↓
ConversationNode objects
↓
ConversationGraph
↓
DatabaseLoader.create_icp_dataset()
↓
List[ICPDataPoint]
↓
ICPTrainer._create_dataloader()
↓
ICPDataset (PyTorch Dataset)
↓
DataLoader (PyTorch DataLoader)
↓
Training Loop---
3. DLMDataLoader Overview
File: `[home]/Desktop/Computational Choreography/cc-tpo/packages/dlm/core/data_loader.py`
3.1 Key Classes
ConversationNode
@dataclass
class ConversationNode:
message_id: str
content: str
author: str
timestamp: float
parent_id: Optional[str] = None
coordinates: Optional[DLMCoordinate] = None # DLM-specific coordinate format
embedding: Optional[np.ndarray] = None
token_count: int = 0
metadata: Dict[str, Any] = field(default_factory=dict)ConversationGraph
@dataclass
class ConversationGraph:
conversation_id: str = ""
nodes: Dict[str, ConversationNode] = field(default_factory=dict)
root_ids: List[str] = field(default_factory=list)
metadata: Dict[str, Any] = field(default_factory=dict)
Methods:
- add_node(node: ConversationNode)
- get_children(message_id: str) -> List[ConversationNode]
- get_ancestors(message_id: str) -> List[ConversationNode]
- get_depth(message_id: str) -> int
- to_dict() -> Dict[str, Any]DLMDataLoader
class DLMDataLoader:
def __init__(self, db_path: Union[str, Path],
config: Optional[DLMConfig] = None,
logger: Optional[Any] = None)
Methods:
- get_conversation_ids() -> List[str]
- load_conversation(conversation_id: str) -> Optional[ConversationGraph]
- load_conversations(conversation_ids=None, max_count=None) -> Iterator[ConversationGraph]
- _load_coordinates_batch(message_ids: List[str]) -> Dict[str, DLMCoordinate]
- _load_embeddings_batch(message_ids: List[str]) -> Dict[str, np.ndarray]
- get_statistics() -> Dict[str, Any]
- close()
- __enter__() / __exit__() # Context manager support3.2 DLMCoordinate Format
DLM uses a different coordinate representation:
class DLMCoordinate:
x: float
y: float
z: float
t: float
n_parts: int # Number of conversation parts
depth_level: int # Depth in conversation tree
sibling_index: int # Position among siblings
confidence: float = 1.0### 3.3 Database Schema Expected by DLMDataLoader
- `conversations` table: conversation_id, total_messages
- `messages` table: message_id, conversation_id, parent_id, content, author, create_time, token_count, end_turn, weight
- `dlm_coordinates` table: message_id, x, y, z, t, n_parts, depth_level, sibling_index, confidence
- `embeddings` table: message_id, embedding (pickled numpy array)
---
4. Data Structure Compatibility Analysis
4.1 Coordinate Systems
IRCP DLMCoordinates:
@dataclass
class DLMCoordinates:
x: float # Depth coordinate
y: float # Sibling order coordinate
z: float # Homogeneity coordinate
t: float # Temporal coordinate
depth: int = 0
sibling_count: int = 0
is_linear: bool = False
confidence: float = 1.0
metadata: Dict[str, Any] = field(default_factory=dict)DLM DLMCoordinate:
class DLMCoordinate:
x: float
y: float
z: float
t: float
n_parts: int # vs sibling_count
depth_level: int # aligns with depth
sibling_index: int # more specific than y
confidence: floatMapping Strategy:
| IRCP DLMCoordinates | DLM DLMCoordinate | Compatibility |
|-------------------|------------------|----------------|
| x (depth) | x | Direct ✓ |
| y (sibling order) | y | Direct ✓ |
| z (homogeneity) | z | Direct ✓ |
| t (temporal) | t | Direct ✓ |
| depth | depth_level | Direct ✓ |
| sibling_count | n_parts | Semantic difference ⚠ |
| is_linear | (not present) | New attribute ⚠ |
| metadata["sibling_order"] | sibling_index | Extracted ✓ |
Conversion Functions Needed:
def dlm_to_ircp_coordinates(dlm_coord: DLMCoordinate) -> DLMCoordinates:
"""Convert DLMCoordinate to IRCP DLMCoordinates"""
return DLMCoordinates(
x=dlm_coord.x,
y=dlm_coord.y,
z=dlm_coord.z,
t=dlm_coord.t,
depth=dlm_coord.depth_level,
sibling_count=dlm_coord.n_parts, # Semantic mapping
is_linear=False, # Default, could be stored in DLMCoordinate
confidence=dlm_coord.confidence,
metadata={
"sibling_index": dlm_coord.sibling_index,
"n_parts": dlm_coord.n_parts,
}
)
def ircp_to_dlm_coordinates(ircp_coord: DLMCoordinates) -> DLMCoordinate:
"""Convert IRCP DLMCoordinates to DLMCoordinate"""
return DLMCoordinate(
x=ircp_coord.x,
y=ircp_coord.y,
z=ircp_coord.z,
t=ircp_coord.t or 0,
n_parts=ircp_coord.sibling_count,
depth_level=ircp_coord.depth,
sibling_index=ircp_coord.metadata.get("sibling_order", 0),
confidence=ircp_coord.confidence
)4.2 Conversation Node Compatibility
Both systems use ConversationNode but with slight differences:
IRCP ConversationNode (in database_loader):
- Uses IRCP DLMCoordinates
- Stored in ConversationGraph from database_loader
DLM ConversationNode (in DLMDataLoader):
- Uses DLM DLMCoordinate
- Stored in DLM ConversationGraph
Strategy: Create adapter to convert between them or modify IRCP trainer to accept DLM nodes.
---
5. Integration Points & Recommendations
5.1 Primary Integration Point: Data Loading in Training
Current Flow:
# IRCP current approach
data_loader = ConversationDataLoader(db_path)
train_data, val_data, test_data = data_loader.load_training_data()
trainer = ICPTrainer(model, config)
results = trainer.train(train_data, val_data)Proposed Flow with DLMDataLoader:
# New approach
from dlm.core.data_loader import DLMDataLoader, ConversationGraph
from ircp.training.icp_trainer import ICPTrainer
from ircp.data.adapter import convert_dlm_to_icp_dataset
# Use DLMDataLoader
with DLMDataLoader(db_path, config) as loader:
conversation_ids = loader.get_conversation_ids()
# Split IDs
train_ids = conversation_ids[:int(0.8*len(conversation_ids))]
val_ids = conversation_ids[int(0.8*len(conversation_ids)):]
# Load conversations
train_graphs = list(loader.load_conversations(train_ids))
val_graphs = list(loader.load_conversations(val_ids))
# Convert to IRCP format
train_data = convert_dlm_to_icp_dataset(train_graphs)
val_data = convert_dlm_to_icp_dataset(val_graphs)
# Training
trainer = ICPTrainer(model, config)
results = trainer.train(train_data, val_data)5.2 Integration Points Where DLMDataLoader Can Replace Current DatabaseLoader
1. Conversation Fetching (100
- `get_conversation_ids()` - Same interface
- `load_conversation()` - Same interface, returns ConversationGraph
2. Parallel Loading (100
- `load_conversations()` with parallelization already implemented
- Same ThreadPoolExecutor pattern
3. Coordinate Loading (95
- DLMDataLoader returns DLM DLMCoordinate
- Needs coordinate system conversion adapter
4. Embedding Loading (100
- Same SQLite schema and pickle deserialization
- Caching strategy identical
5. Database Connection Optimization (100
- WAL mode, pragma settings identical
- Connection pooling patterns match
5.3 Compatibility Issues to Address
#### Issue 1: Coordinate System Difference
Problem: IRCP uses DLMCoordinates while DLMDataLoader uses DLMCoordinate
Solutions:
- Option A: Create adapter layer for coordinate conversion (recommended)
- Option B: Modify IRCP DLMCoordinates to inherit from DLM DLMCoordinate
- Option C: Create unified coordinate class in shared location
Recommended: Option A - Adapter pattern for loose coupling
#### Issue 2: Database Column Names
Problem: Different column naming in dlm_coordinates table
- IRCP expects: `x_coord, y_coord, z_coord, t_coord, depth, sibling_order, sibling_count, is_linear`
- DLMDataLoader expects: `x, y, z, t, n_parts, depth_level, sibling_index, confidence`
Solution: Create database migration script or adapter that handles both schemas:
def load_coordinates_flexible(loader, message_ids):
"""Try IRCP schema, fallback to DLM schema"""
try:
# Try DLM schema first
return loader._load_coordinates_batch(message_ids)
except Exception:
# Fallback to IRCP schema
return load_ircp_coordinates(loader.conn, message_ids)#### Issue 3: ConversationGraph Structure
Problem: Both have ConversationGraph but with different attributes
- IRCP: uses `edges` and `reverse_edges` for parent-child relationships
- DLM: uses `root_ids` and traversal methods
Solution: Create adapter that converts DLM ConversationGraph to IRCP format:
def dlm_graph_to_ircp_graph(dlm_graph: ConversationGraph) -> ConversationGraph:
"""Convert DLM ConversationGraph to IRCP ConversationGraph"""
ircp_graph = ConversationGraph()
for node_id, dlm_node in dlm_graph.nodes.items():
# Convert DLM ConversationNode to IRCP ConversationNode
ircp_node = ConversationNode(
message_id=dlm_node.message_id,
content=dlm_node.content,
author=dlm_node.author,
timestamp=dlm_node.timestamp,
parent_id=dlm_node.parent_id,
coordinates=dlm_to_ircp_coordinates(dlm_node.coordinates),
embedding=dlm_node.embedding,
token_count=dlm_node.token_count,
metadata=dlm_node.metadata,
)
ircp_graph.add_node(ircp_node)
return ircp_graph5.4 Advantages of Integration
1. Unified Data Loading: Single DLMDataLoader for all coordinate-based training
2. Better Caching: DLMDataLoader has coordinate cache in addition to embedding cache
3. Improved Parallelization: AsyncIO-ready architecture
4. Reduced Code Duplication: Eliminate duplicate database access logic
5. Better Testing: DLMDataLoader has comprehensive test suite
6. Context Manager Support: Automatic resource cleanup
7. Better Logging: Integrated with DLMConfig logging system
---
6. Coordinate and Embedding Usage in Training
6.1 Coordinates in Training
Usage in ICPDataset (`__getitem__`):
coords = torch.tensor(
[dp.coordinates.x, dp.coordinates.y, dp.coordinates.z, dp.coordinates.t],
dtype=torch.float32,
)Usage in Loss Function:
1. Coordinate Prediction Loss: Direct MSE between predicted and actual 4D coordinates
2. Embedding Consistency Loss: Uses coordinate distances (pairwise L2 distance)
3. Conservation Constraint Loss: Covariance matrix of coordinates
4. Topological Consistency Loss: k-nearest neighbor distances in coordinate space
Critical Requirements:
- Coordinates must be valid floats (not NaN)
- All 4 dimensions (x, y, z, t) required
- Value ranges should be normalized for stability
6.2 Embeddings in Training
Usage in ICPDataset (`__getitem__`):
embedding = torch.tensor(dp.embedding, dtype=torch.float32)Usage in Model Forward Pass:
model_output = self.model(embeddings) # Shape: [batch_size, 4]Usage in Loss Function:
1. Coordinate Prediction Loss: Model input (embeddings) -> predicted coordinates
2. Embedding Consistency Loss: Cosine similarity between embedding pairs
3. Conservation Constraint Loss: Covariance matrix and determinant
4. Topological Consistency Loss: Pairwise embedding distances (L2)
Critical Requirements:
- Embeddings must be non-empty numpy arrays
- No NaN values in embeddings
- Typical dimensions: 384-768 (varies by encoder)
6.3 Data Quality Validation
Current Validation in ICPDataset:
if (
dp.embedding is not None
and len(dp.embedding) > 0
and not np.any(np.isnan(dp.embedding))
and dp.coordinates is not None
):
self.valid_data_points.append(dp)Recommendations for DLMDataLoader Integration:
- Add similar validation in adapter layer
- Log rejection reasons (missing coordinates, NaN embeddings, etc.)
- Provide statistics on data quality
---
7. Existing Database Loader Classes
7.1 IRCP DatabaseLoader Summary
Location: `/packages/ircp/data/database_loader.py`
Strengths:
- Efficient batch loading for coordinates and embeddings
- Caching of embeddings to reduce memory pressure
- Parallel conversation loading with ThreadPoolExecutor
- Relationship data loading
- Quality metrics computation
- Comprehensive statistics gathering
Weaknesses:
- Coupling to IRCP coordinate system (DLMCoordinates)
- No context manager support
- Limited logging (uses standard logging)
- Parallel loading not easily configurable
- No coordinate caching
7.2 TPO ConversationDataLoader Summary
Location: `/packages/tpo/training/trainer.py` (imports from dataset module)
Note: TPO has its own TPODataset class that wraps preferences, not conversations directly. It relies on preference generation from conversations.
7.3 DLMDataLoader Summary
Location: `/packages/dlm/core/data_loader.py`
Strengths:
- Clean dataclass-based design
- Both coordinate and embedding caching
- Context manager support for resource cleanup
- Iterator pattern for memory-efficient loading
- Integrated with DLMConfig system
- Better error handling and logging
- Flexible database schema handling
Weaknesses:
- Newer (less tested in production)
- Coordinate system is different from IRCP's
---
8. Training Loop and Data Pipeline Details
8.1 Complete Training Pipeline
# 1. DATA LOADING PHASE
conversation_ids = loader.get_conversation_ids() # Filter by min_messages, etc.
graphs = loader.load_conversations_parallel(conversation_ids)
icp_data_points = loader.create_icp_dataset(graphs)
# 2. DATA SPLITTING PHASE
train_data = icp_data_points[:split_train]
val_data = icp_data_points[split_train:split_train+split_val]
test_data = icp_data_points[split_train+split_val:]
# 3. DATASET CREATION PHASE
train_dataloader = trainer._create_dataloader(train_data, mode="train")
val_dataloader = trainer._create_dataloader(val_data, mode="val")
# 4. TRAINING LOOP PHASE
for epoch in range(epochs):
# 4a. Training Phase
for batch in train_dataloader:
batch_to_device()
loss = trainer._compute_loss(batch)
loss.backward()
optimizer.step()
scheduler.step()
# 4b. Validation Phase
for batch in val_dataloader:
with torch.no_grad():
loss = trainer._compute_loss(batch)
# 4c. Checkpoint Saving
if is_best_epoch:
save_checkpoint()
# 5. MODEL EXPORT PHASE
trainer.export_model(export_path, format="pytorch")8.2 Batch Processing Details
Batch Creation (ICPDataset.__getitem__):
batch = {
"embedding": torch.Tensor(embedding_dim,) # Input to model
"coordinates": torch.Tensor(4,) # Target for loss
"target": torch.Tensor # Train/val mode specific
"message_id": str
"conversation_id": str
"author": str
}Model Forward Pass:
output = model(batch["embedding"])
if isinstance(output, dict):
predicted_coords = output["coordinates"] # [batch_size, 4]
else:
predicted_coords = output # [batch_size, 4]Loss Computation:
# Requires tensors on device
embeddings.to(device)
coordinates.to(device)
targets.to(device)
# Multi-component loss calculation
total_loss = (
coord_loss * 1.0 +
consistency_loss * 0.1 +
conservation_loss * 0.05 +
topological_loss * 0.1 +
l2_regularization * 1e-5
)8.3 Device Management
Current Approach:
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if hasattr(self.model, "to"):
self.model.to(self.device)
# In training loop
batch_tensor.to(self.device)Recommendations for DLMDataLoader:
- DLMDataLoader is device-agnostic (returns numpy arrays)
- Conversion to torch tensors happens in ICPDataset.__getitem__
- No changes needed for device management
---
9. Integration Implementation Roadmap
### Phase 1: Adapter Layer (Low Risk)
Duration: 2-3 hours
Files to Create:
1. `/packages/ircp/data/dlm_adapter.py` - Coordinate and graph converters
2. `/packages/ircp/data/dlm_data_loader.py` - Wrapper around DLMDataLoader
Key Functions:
def dlm_coordinate_to_ircp(dlm_coord: DLMCoordinate) -> DLMCoordinates:
"""Convert DLM coordinate to IRCP format"""
def ircp_coordinate_to_dlm(ircp_coord: DLMCoordinates) -> DLMCoordinate:
"""Convert IRCP coordinate to DLM format"""
def create_icp_dataset_from_dlm_graphs(
graphs: List[ConversationGraph]
) -> List[ICPDataPoint]:
"""Convert DLM conversation graphs to IRCP data points"""
class DLMDataLoaderAdapter:
"""Adapter to use DLMDataLoader with IRCP trainer"""
def load_training_data(self, train_ratio=0.8, ...):
# Uses DLMDataLoader internally
# Returns IRCP format (List[ICPDataPoint])### Phase 2: Trainer Modifications (Medium Risk)
Duration: 2-3 hours
Modifications:
1. Add alternative data loading path in ICPTrainer
2. Add configuration option to switch between loaders
3. Ensure backward compatibility
Code Changes:
class ICPTrainer:
def train(self,
train_data: Union[List[ICPDataPoint], DLMDataLoader],
val_data: Optional[Union[List[ICPDataPoint], DLMDataLoader]] = None,
use_dlm_loader: bool = False):
"""Enhanced train method supporting both data sources"""
if use_dlm_loader:
train_data = self._prepare_dlm_data(train_data)
val_data = self._prepare_dlm_data(val_data) if val_data else None
# Rest of training loop remains unchanged### Phase 3: Testing and Validation (Medium Risk)
Duration: 3-4 hours
Test Cases:
1. Verify coordinate conversion preserves values
2. Compare training results with both loaders
3. Check embedding loading and caching
4. Validate performance improvements
5. Test with different database schemas
### Phase 4: Documentation and Migration (Low Risk)
Duration: 2-3 hours
Deliverables:
1. Migration guide for existing code
2. Usage examples
3. Troubleshooting guide
4. Performance comparison report
---
10. Compatibility Checklist
### Critical (Must Have)
- [ ] Coordinate system conversion with no precision loss
- [ ] Embedding loading produces identical arrays
- [ ] Training results within 1
- [ ] Backward compatibility maintained (old code still works)
### Important (Should Have)
- [ ] Improved performance metrics documented
- [ ] Better error messages for data issues
- [ ] Comprehensive logging
- [ ] Unit tests for adapter layer
### Nice to Have
- [ ] Configuration in single place
- [ ] Automatic schema detection
- [ ] Performance profiling tools
- [ ] Data quality reports
---
11. Code Examples
Example 1: Using Current IRCP Loader
from ircp.data.database_loader import ConversationDataLoader
from ircp.training.icp_trainer import ICPTrainer
# Current approach
loader = ConversationDataLoader(db_path)
train_data, val_data, test_data = loader.load_training_data()
config = {"epochs": 50, "batch_size": 32, ...}
trainer = ICPTrainer(model, config)
results = trainer.train(train_data, val_data)Example 2: Using DLMDataLoader with Adapter
from dlm.core.data_loader import DLMDataLoader
from ircp.training.icp_trainer import ICPTrainer
from ircp.data.dlm_adapter import create_icp_dataset_from_dlm_graphs
# New approach
with DLMDataLoader(db_path, config) as loader:
conv_ids = loader.get_conversation_ids()
n_total = len(conv_ids)
train_ids = conv_ids[:int(0.8*n_total)]
val_ids = conv_ids[int(0.8*n_total):]
train_graphs = list(loader.load_conversations(train_ids))
val_graphs = list(loader.load_conversations(val_ids))
train_data = create_icp_dataset_from_dlm_graphs(train_graphs)
val_data = create_icp_dataset_from_dlm_graphs(val_graphs)
config = {"epochs": 50, "batch_size": 32, ...}
trainer = ICPTrainer(model, config)
results = trainer.train(train_data, val_data)Example 3: Flexible Trainer with Both Loaders
# Future unified approach
from ircp.training.icp_trainer import ICPTrainer
trainer = ICPTrainer(model, config)
# Can accept either format
train_data = icp_data_points_list # Traditional
train_data = dlm_loader # New DLMDataLoader
train_data = adapter.load_training_data() # Hybrid
results = trainer.train(train_data)---
12. Performance Considerations
### Memory Efficiency
- DLMDataLoader coordinate caching reduces repeated queries
- Both systems implement embedding caching
- Iterator pattern in DLMDataLoader better for large datasets
### Speed Improvements
- Expected 10-20
- Parallel loading optimized in both systems
- Batch loading of coordinates and embeddings avoids N+1 queries
### Database Query Optimization
Both systems use:
- Batch loading instead of per-record queries
- WAL mode for concurrent access
- Cache pragmas for connection efficiency
---
13. Risk Assessment
### Low Risk
- Coordinate conversion (isolated function)
- Embedding loading (identical schema and approach)
- Graph conversion (straightforward mapping)
- Adding configuration option
### Medium Risk
- Changes to trainer input handling
- Database schema variations
- Parallel loading edge cases
- Memory usage with large datasets
### High Risk
- None identified for adapter-based approach
---
14. Success Criteria
1. Functional Equivalence: DLMDataLoader produces equivalent training results
2. Performance: 10
3. Maintainability: Reduced code duplication between IRCP and DLM loaders
4. Backward Compatibility: Existing code continues to work without changes
5. Testing: 95
6. Documentation: Clear migration guide and examples
---
15. Dependencies and Version Requirements
### Required
- Python 3.8+
- PyTorch 1.9+
- numpy 1.20+
- sqlite3 (standard library)
### Optional
- matplotlib (for training visualization)
- wandb (for experiment tracking)
- pandas (for statistics)
### Package Dependencies
- `packages/dlm` must be imported successfully
- `packages/ircp` remains primary interface
- No external dependencies added
---
Conclusion
The IRCP training infrastructure is well-designed with sophisticated multi-component loss functions and a clean separation of concerns. The DLMDataLoader from Phase 3.1 can be integrated with minimal risk through an adapter layer, providing improved data loading efficiency and unified coordinate system support without breaking existing code.
Key integration points are:
1. Coordinate conversion (requires adapter)
2. Data loading (can use DLMDataLoader directly with conversion)
3. Training loop (no changes needed)
4. Loss computation (no changes needed)
The recommended approach is to implement the adapter layer first (Phase 1), maintaining full backward compatibility while enabling gradual migration to the unified DLMDataLoader system.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/docs/summaries/IRCP_TRAINING_INFRASTRUCTURE_ANALYSIS.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture