Grand Diomande Research · Full HTML Reader

IRCP Sentence Transformer Training

This directory contains the training pipeline for fine-tuning sentence transformers with IRCP coordinate-based supervision.

Agents That Account for Themselves research note experiment writeup candidate score 24 .md

Full Public Reader

IRCP Sentence Transformer Training

This directory contains the training pipeline for fine-tuning sentence transformers with IRCP coordinate-based supervision.

Overview

We fine-tune sentence transformers (like `all-MiniLM-L6-v2`) to be IRCP-aware by using IRCP coordinate proximity as the similarity signal. This creates embeddings that understand conversation structure, intent depth, and temporal flow.

Why Train a Custom Model?

The default IRCP model freezes the sentence transformer encoder and only trains custom heads. This means:

❌ Embeddings are generic (not IRCP-aware)
❌ Can't learn conversation-specific patterns
❌ Limited by pre-trained semantic similarity

With fine-tuning:

✅ End-to-end training of embeddings
✅ Learn IRCP-specific patterns
✅ Better coordinate prediction
✅ Improved conversation understanding

Quick Start

Option 1: Use the Convenience Script (Recommended)

bash

# From project root
./scripts/train_ircp_sentence_transformer.sh \
    path/to/conversations.db \
    ./models/ircp_st

This will:
1. Prepare training data from your database
2. Train the sentence transformer
3. Test the model
4. Save to `./models/ircp_st`

Option 2: Manual Step-by-Step

Step 1: Prepare Training Data

bash

python -m ircp.training.prepare_sentence_transformer_data \
    --database path/to/conversations.db \
    --output-dir ./ircp_training_data \
    --min-similarity 0.0 \
    --max-similarity 1.0 \
    --sample-rate 1.0 \
    --negative-ratio 0.3 \
    --train-split 0.8 \
    --val-split 0.1

Output:
- `train_pairs.jsonl` - Training pairs with similarity labels
- `val_pairs.jsonl` - Validation pairs
- `test_pairs.jsonl` - Test pairs
- `train_triplets.jsonl` - Contrastive triplets
- `dataset_stats.json` - Dataset statistics

Step 2: Train Sentence Transformer

bash

python -m ircp.training.train_sentence_transformer \
    --base-model "sentence-transformers/all-MiniLM-L6-v2" \
    --max-seq-length 256 \
    --num-epochs 5 \
    --batch-size 16 \
    --learning-rate 2e-5 \
    --warmup-steps 100 \
    --loss-type cosine \
    --train-data ./ircp_training_data/train_pairs.jsonl \
    --val-data ./ircp_training_data/val_pairs.jsonl \
    --output-dir ./models/ircp_sentence_transformer \
    --checkpoint-steps 500 \
    --eval-steps 500 \
    --use-amp

Step 3: Use the Trained Model

python

from sentence_transformers import SentenceTransformer

# Load your fine-tuned model
model = SentenceTransformer('./models/ircp_sentence_transformer')

# Generate embeddings
texts = [
    "How do I deploy my application?",
    "What is the deployment process?",
]
embeddings = model.encode(texts)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embeddings[0:1], embeddings[1:2])[0][0]
print(f"Similarity: {similarity:.4f}")

Configuration

Model Options

Base Model	Embedding Dim	Speed	Quality	Recommended For
`all-MiniLM-L6-v2`	384	⚡⚡⚡ Fast	⭐⭐ Good	Production, real-time
`all-mpnet-base-v2`	768	⚡⚡ Medium	⭐⭐⭐ Best	High accuracy needed
`paraphrase-multilingual-MiniLM-L12-v2`	384	⚡⚡ Medium	⭐⭐ Good	Multilingual

Recommendation: Start with `all-MiniLM-L6-v2` for best speed/quality balance.

Loss Functions

Loss Type	Description	Use When
`cosine`	Regression on similarity scores	You have continuous similarity labels (RECOMMENDED)
`contrastive`	Binary similar/dissimilar	You have binary labels
`triplet`	Anchor, positive, negative	You want contrastive learning

Recommendation: Use `cosine` loss with IRCP coordinate-based similarity.

Training Parameters

json

{
  "num_epochs": 5,           // More epochs = better fit (but risk overfitting)
  "batch_size": 16,          // Larger = faster, but needs more memory
  "learning_rate": 2e-5,     // Lower = more stable training
  "warmup_steps": 100,       // Gradual learning rate warmup
  "max_seq_length": 256,     // Maximum tokens per text
  "use_amp": true            // Automatic mixed precision (faster)
}

Training Data Format

Pair Format (for cosine/contrastive loss)

jsonl

{"text1": "How do I deploy?", "text2": "What's the deployment process?", "similarity": 0.92, "coordinates1": {"x": 0.5, "y": 0.2, "z": 0.8, "t": 0.1}, "coordinates2": {"x": 0.52, "y": 0.19, "z": 0.79, "t": 0.12}, "metadata": {...}}

Triplet Format (for triplet loss)

jsonl

{"anchor": "How do I deploy?", "positive": "What's the deployment process?", "negative": "I like pizza", "anchor_coords": {...}, "positive_coords": {...}, "negative_coords": {...}}

How Similarity is Calculated

Similarity is derived from IRCP coordinate proximity using weighted Euclidean distance:

python

def calculate_similarity(coord1, coord2):
    # Weighted dimensions
    weights = {
        "x": 0.4,  # Intent depth (most important)
        "y": 0.2,  # Branching
        "z": 0.2,  # Consistency
        "t": 0.2,  # Temporal
    }

    # Weighted distance
    weighted_dist = sqrt(
        weights["x"] * (coord1.x - coord2.x)^2 +
        weights["y"] * (coord1.y - coord2.y)^2 +
        weights["z"] * (coord1.z - coord2.z)^2 +
        weights["t"] * (coord1.t - coord2.t)^2
    )

    # Convert to similarity
    similarity = exp(-weighted_dist)  # [0, 1]

    return similarity

Messages with similar coordinates get high similarity scores, messages with different coordinates get low scores.

Integration with IRCP Model

After training, integrate the fine-tuned model into your IRCP pipeline:

Option 1: Replace Encoder in SentenceTransformerICP

python

from ircp.models.sentence_transformer_icp import SentenceTransformerICP
from sentence_transformers import SentenceTransformer

# Load config
config = {...}

# Create IRCP model
model = SentenceTransformerICP(config)

# Replace with fine-tuned encoder
model.sentence_transformer = SentenceTransformer('./models/ircp_sentence_transformer')

# Now the IRCP model uses your fine-tuned embeddings!

Option 2: Use Fine-Tuned Config

Update your config to point to the fine-tuned model:

json

{
  "model_name": "./models/ircp_sentence_transformer",
  "embedding_dim": 384,
  "freeze_encoder": false,
  ...
}

Expected Results

After training on ~10k conversation pairs:

Validation Correlation: 0.75-0.85 (similarity prediction accuracy)
Coordinate Prediction: 15-25
Retrieval Accuracy: 20-30
Training Time: ~30-60 minutes on GPU, 2-4 hours on CPU

Monitoring Training

The trainer logs to console and saves checkpoints every 500 steps:

Epoch 1/5: 100%|████████| 625/625 [12:34<00:00, 1.20s/it]
Evaluation: Spearman: 0.7823, Pearson: 0.7654
✓ New best model saved!

Metrics:
- Spearman Correlation: Ranking quality (0-1, higher is better)
- Pearson Correlation: Linear relationship (0-1, higher is better)

Troubleshooting

Out of Memory

bash

# Reduce batch size
--batch-size 8

# Use gradient accumulation
--gradient-accumulation-steps 2

Training Too Slow

bash

# Enable mixed precision
--use-amp

# Use smaller model
--base-model "sentence-transformers/all-MiniLM-L6-v2"

# Reduce max sequence length
--max-seq-length 128

Poor Validation Score

Increase epochs: `--num-epochs 10`
Check data quality in `dataset_stats.json`
Try different loss: `--loss-type triplet`
Adjust similarity weights in `prepare_sentence_transformer_data.py`

Advanced: Multi-Task Training

For even better results, train with multiple objectives:

python

from sentence_transformers import losses

# Combine losses
loss1 = losses.CosineSimilarityLoss(model)
loss2 = losses.TripletLoss(model)

model.fit(
    train_objectives=[
        (dataloader1, loss1),
        (dataloader2, loss2),
    ],
    epochs=5,
)

Files Created

`prepare_sentence_transformer_data.py` - Data preparation script
`train_sentence_transformer.py` - Training script
`sentence_transformer_config.json` - Default configuration
`../scripts/train_ircp_sentence_transformer.sh` - End-to-end convenience script

References

[Sentence Transformers Documentation](https://www.sbert.net/)
[Training Custom Models](https://www.sbert.net/docs/training/overview.html)
[Loss Functions](https://www.sbert.net/docs/package_reference/losses.html)

Next Steps

1. Train the model using your conversation database
2. Evaluate on held-out test set
3. Integrate into IRCP pipeline
4. Compare performance vs frozen encoder
5. Iterate on hyperparameters if needed

Good luck with training! 🚀

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/packages/ircp/training/README_SENTENCE_TRANSFORMER.md

Detected Structure

Evaluation · References · Code Anchors · Architecture