IRCP Sentence Transformer Training
This directory contains the training pipeline for fine-tuning sentence transformers with IRCP coordinate-based supervision.
Full Public Reader
IRCP Sentence Transformer Training
This directory contains the training pipeline for fine-tuning sentence transformers with IRCP coordinate-based supervision.
Overview
We fine-tune sentence transformers (like `all-MiniLM-L6-v2`) to be IRCP-aware by using IRCP coordinate proximity as the similarity signal. This creates embeddings that understand conversation structure, intent depth, and temporal flow.
Why Train a Custom Model?
The default IRCP model freezes the sentence transformer encoder and only trains custom heads. This means:
❌ Embeddings are generic (not IRCP-aware)
❌ Can't learn conversation-specific patterns
❌ Limited by pre-trained semantic similarity
With fine-tuning:
✅ End-to-end training of embeddings
✅ Learn IRCP-specific patterns
✅ Better coordinate prediction
✅ Improved conversation understanding
Quick Start
Option 1: Use the Convenience Script (Recommended)
# From project root
./scripts/train_ircp_sentence_transformer.sh \
path/to/conversations.db \
./models/ircp_stThis will:
1. Prepare training data from your database
2. Train the sentence transformer
3. Test the model
4. Save to `./models/ircp_st`
Option 2: Manual Step-by-Step
Step 1: Prepare Training Data
python -m ircp.training.prepare_sentence_transformer_data \
--database path/to/conversations.db \
--output-dir ./ircp_training_data \
--min-similarity 0.0 \
--max-similarity 1.0 \
--sample-rate 1.0 \
--negative-ratio 0.3 \
--train-split 0.8 \
--val-split 0.1Output:
- `train_pairs.jsonl` - Training pairs with similarity labels
- `val_pairs.jsonl` - Validation pairs
- `test_pairs.jsonl` - Test pairs
- `train_triplets.jsonl` - Contrastive triplets
- `dataset_stats.json` - Dataset statistics
Step 2: Train Sentence Transformer
python -m ircp.training.train_sentence_transformer \
--base-model "sentence-transformers/all-MiniLM-L6-v2" \
--max-seq-length 256 \
--num-epochs 5 \
--batch-size 16 \
--learning-rate 2e-5 \
--warmup-steps 100 \
--loss-type cosine \
--train-data ./ircp_training_data/train_pairs.jsonl \
--val-data ./ircp_training_data/val_pairs.jsonl \
--output-dir ./models/ircp_sentence_transformer \
--checkpoint-steps 500 \
--eval-steps 500 \
--use-ampStep 3: Use the Trained Model
from sentence_transformers import SentenceTransformer
# Load your fine-tuned model
model = SentenceTransformer('./models/ircp_sentence_transformer')
# Generate embeddings
texts = [
"How do I deploy my application?",
"What is the deployment process?",
]
embeddings = model.encode(texts)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(embeddings[0:1], embeddings[1:2])[0][0]
print(f"Similarity: {similarity:.4f}")Configuration
Model Options
| Base Model | Embedding Dim | Speed | Quality | Recommended For |
|---|---|---|---|---|
| `all-MiniLM-L6-v2` | 384 | ⚡⚡⚡ Fast | ⭐⭐ Good | Production, real-time |
| `all-mpnet-base-v2` | 768 | ⚡⚡ Medium | ⭐⭐⭐ Best | High accuracy needed |
| `paraphrase-multilingual-MiniLM-L12-v2` | 384 | ⚡⚡ Medium | ⭐⭐ Good | Multilingual |
Recommendation: Start with `all-MiniLM-L6-v2` for best speed/quality balance.
Loss Functions
| Loss Type | Description | Use When |
|---|---|---|
| `cosine` | Regression on similarity scores | You have continuous similarity labels (RECOMMENDED) |
| `contrastive` | Binary similar/dissimilar | You have binary labels |
| `triplet` | Anchor, positive, negative | You want contrastive learning |
Recommendation: Use `cosine` loss with IRCP coordinate-based similarity.
Training Parameters
{
"num_epochs": 5, // More epochs = better fit (but risk overfitting)
"batch_size": 16, // Larger = faster, but needs more memory
"learning_rate": 2e-5, // Lower = more stable training
"warmup_steps": 100, // Gradual learning rate warmup
"max_seq_length": 256, // Maximum tokens per text
"use_amp": true // Automatic mixed precision (faster)
}Training Data Format
Pair Format (for cosine/contrastive loss)
{"text1": "How do I deploy?", "text2": "What's the deployment process?", "similarity": 0.92, "coordinates1": {"x": 0.5, "y": 0.2, "z": 0.8, "t": 0.1}, "coordinates2": {"x": 0.52, "y": 0.19, "z": 0.79, "t": 0.12}, "metadata": {...}}Triplet Format (for triplet loss)
{"anchor": "How do I deploy?", "positive": "What's the deployment process?", "negative": "I like pizza", "anchor_coords": {...}, "positive_coords": {...}, "negative_coords": {...}}How Similarity is Calculated
Similarity is derived from IRCP coordinate proximity using weighted Euclidean distance:
def calculate_similarity(coord1, coord2):
# Weighted dimensions
weights = {
"x": 0.4, # Intent depth (most important)
"y": 0.2, # Branching
"z": 0.2, # Consistency
"t": 0.2, # Temporal
}
# Weighted distance
weighted_dist = sqrt(
weights["x"] * (coord1.x - coord2.x)^2 +
weights["y"] * (coord1.y - coord2.y)^2 +
weights["z"] * (coord1.z - coord2.z)^2 +
weights["t"] * (coord1.t - coord2.t)^2
)
# Convert to similarity
similarity = exp(-weighted_dist) # [0, 1]
return similarityMessages with similar coordinates get high similarity scores, messages with different coordinates get low scores.
Integration with IRCP Model
After training, integrate the fine-tuned model into your IRCP pipeline:
Option 1: Replace Encoder in SentenceTransformerICP
from ircp.models.sentence_transformer_icp import SentenceTransformerICP
from sentence_transformers import SentenceTransformer
# Load config
config = {...}
# Create IRCP model
model = SentenceTransformerICP(config)
# Replace with fine-tuned encoder
model.sentence_transformer = SentenceTransformer('./models/ircp_sentence_transformer')
# Now the IRCP model uses your fine-tuned embeddings!Option 2: Use Fine-Tuned Config
Update your config to point to the fine-tuned model:
{
"model_name": "./models/ircp_sentence_transformer",
"embedding_dim": 384,
"freeze_encoder": false,
...
}Expected Results
After training on ~10k conversation pairs:
- Validation Correlation: 0.75-0.85 (similarity prediction accuracy)
- Coordinate Prediction: 15-25
- Retrieval Accuracy: 20-30
- Training Time: ~30-60 minutes on GPU, 2-4 hours on CPU
Monitoring Training
The trainer logs to console and saves checkpoints every 500 steps:
Epoch 1/5: 100%|████████| 625/625 [12:34<00:00, 1.20s/it]
Evaluation: Spearman: 0.7823, Pearson: 0.7654
✓ New best model saved!Metrics:
- Spearman Correlation: Ranking quality (0-1, higher is better)
- Pearson Correlation: Linear relationship (0-1, higher is better)
Troubleshooting
Out of Memory
# Reduce batch size
--batch-size 8
# Use gradient accumulation
--gradient-accumulation-steps 2Training Too Slow
# Enable mixed precision
--use-amp
# Use smaller model
--base-model "sentence-transformers/all-MiniLM-L6-v2"
# Reduce max sequence length
--max-seq-length 128Poor Validation Score
- Increase epochs: `--num-epochs 10`
- Check data quality in `dataset_stats.json`
- Try different loss: `--loss-type triplet`
- Adjust similarity weights in `prepare_sentence_transformer_data.py`
Advanced: Multi-Task Training
For even better results, train with multiple objectives:
from sentence_transformers import losses
# Combine losses
loss1 = losses.CosineSimilarityLoss(model)
loss2 = losses.TripletLoss(model)
model.fit(
train_objectives=[
(dataloader1, loss1),
(dataloader2, loss2),
],
epochs=5,
)Files Created
- `prepare_sentence_transformer_data.py` - Data preparation script
- `train_sentence_transformer.py` - Training script
- `sentence_transformer_config.json` - Default configuration
- `../scripts/train_ircp_sentence_transformer.sh` - End-to-end convenience script
References
- [Sentence Transformers Documentation](https://www.sbert.net/)
- [Training Custom Models](https://www.sbert.net/docs/training/overview.html)
- [Loss Functions](https://www.sbert.net/docs/package_reference/losses.html)
Next Steps
1. Train the model using your conversation database
2. Evaluate on held-out test set
3. Integrate into IRCP pipeline
4. Compare performance vs frozen encoder
5. Iterate on hyperparameters if needed
Good luck with training! 🚀
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/packages/ircp/training/README_SENTENCE_TRANSFORMER.md
Detected Structure
Evaluation · References · Code Anchors · Architecture