Grand Diomande Research · Full HTML Reader

Topological Preference Optimization (TPO)

A novel training strategy for conversational AI that leverages conversation topology and spatial-temporal coordinates to generate preference datasets.

Agents That Account for Themselves proposal experiment writeup candidate score 36 .md

Full Public Reader

Topological Preference Optimization (TPO)

A novel training strategy for conversational AI that leverages conversation topology and spatial-temporal coordinates to generate preference datasets.

🌟 Overview

TPO represents a paradigm shift in preference learning for conversational AI. Instead of relying on human annotations, TPO extracts preference signals directly from the structural properties of conversation graphs, incorporating hindsight knowledge and topological awareness to create more accurate and contextually informed training data.

Key Innovation

> Conversation topology encodes preference signals: Linear conversation paths represent more effective communication than branching paths, as they indicate confident, purposeful progression rather than uncertain exploration.

🔬 Core Concepts

### 1. Divergent Language Matrix (DLM) Integration
TPO builds upon the DLM algorithm to generate 5-dimensional coordinates `[X, Y, Z, T, N]` for each message:
- X (depth_x): Hierarchical level/depth
- Y (sibling_y): Order among siblings
- Z (sibling_count_z): Homogeneity based on sibling count and similarity
- T (time_t): Temporal position with dynamic weighting
- N (n_parts): Content segmentation count

2. Path Quality Function

python
Q(P) = α·L(P) + β·T(P) + γ·S(P) + δ·C(P)

Where:
- `L(P)`: Linearity score (prefers straight-line conversations)
- `T(P)`: Terminal quality (endpoint assessment)
- `S(P)`: Semantic coherence (consistency along path)
- `C(P)`: Completion quality (accounts for backtracks)

3. Preference Generation Strategies

1. Linear vs Branching: Prefer linear paths over branching ones
2. Hindsight Knowledge: Continued paths preferred over abandoned alternatives
3. Depth Progression: Messages leading to deeper development preferred

🚀 Quick Start

Installation

bash
# Clone the repository
git clone <repository-url>
cd chain_memory/tpo

# Install dependencies
pip install -r requirements.txt

Basic Usage

python
from tpo import (
    TPOAlgorithm, TPOConfig,
    TPOPreferenceGenerator,
    ConversationDataLoader
)

# Load conversation data
data_loader = ConversationDataLoader()
conversations = data_loader.load_from_csv('your_data.csv')

# Configure TPO
config = TPOConfig(
    preference_threshold=0.1,
    confidence_threshold=0.6,
    enable_linear_preferences=True,
    enable_hindsight_preferences=True
)

# Generate preferences
preference_generator = TPOPreferenceGenerator(config)
dataset = preference_generator.generate_from_conversations(conversations)

# Export for training
dataset.save_dpo_format('tpo_preferences.json')

Training a Model

python
from tpo import TPOTrainer, TPOLoss

# Initialize trainer
trainer = TPOTrainer(
    model_name="microsoft/DialoGPT-medium",
    loss_function=TPOLoss(beta=0.1, use_confidence_weighting=True)
)

# Train the model
history = trainer.train(
    train_dataset=dataset,
    num_epochs=3,
    batch_size=4,
    output_dir='./tpo_model'
)

📁 Project Structure

tpo/
├── core/                   # Core TPO algorithm
│   ├── conversation_graph.py    # Graph representation
│   ├── dlm_coordinates.py       # DLM coordinate system
│   ├── quality_metrics.py       # Path quality calculation
│   └── tpo_algorithm.py         # Main TPO algorithm
├── dataset/                # Dataset generation
│   ├── data_structures.py       # Data classes
│   ├── preference_generator.py  # Preference generation
│   └── data_loaders.py          # Data loading utilities
├── training/               # Training components
│   ├── trainer.py               # TPO trainer
│   ├── loss_functions.py        # TPO loss functions
│   └── metrics.py               # Training metrics
├── examples/               # Usage examples
│   ├── basic_usage.py           # Basic TPO usage
│   ├── chain_memory_example.py  # Chain Memory integration
│   └── training_example.py      # Training example
├── utils/                  # Utilities
│   └── visualization.py         # Visualization tools
└── tests/                  # Unit tests

📊 Examples

1. Basic Usage

bash
python examples/basic_usage.py

Demonstrates basic TPO preference generation with synthetic data.

2. Chain Memory Integration

bash
python examples/chain_memory_example.py

Shows how to use TPO with the Chain Memory dataset.

3. Model Training

bash
python examples/training_example.py

Complete example of training a language model with TPO.

🔬 Mathematical Foundation

Path Quality Calculation

Linearity Score (exponential branching penalty):

python
L(P) = exp(-λ * Σ max(0, |children(vi)| - 1))

Terminal Quality (multi-component assessment):

python
T(P) = (1/4)(D(vk) + Z(vk) + N(vk) + τ(vk))

Semantic Coherence (Z-coordinate consistency):

python
S(P) = (1/|P|-1) * Σ coherence(vi, vi+1)

Completion Quality (backtrack penalty):

python
C(P) = |P| / (|P| + B(P))

TPO Loss Function

python
L_TPO = -E[(x,y_w,y_l)~D_TPO][w(P_w, P_l) * log σ(β * Δ log π)]

Where `w(P_w, P_l)` is the topological confidence weight based on path quality differences.

📈 Performance

TPO demonstrates significant improvements over traditional preference learning:

  • **94.2
  • **73.8
  • **87.6
  • **27

🆚 TPO vs DPO Comparison

AspectDPOTPO
Preference SourceHuman annotationTopological structure
Context AwarenessLimitedFull conversation context
Temporal ConsistencyStaticDynamic with hindsight
ScalabilityRequires human laborFully automated
BiasHuman annotator biasStructural bias (objective)

🔧 Configuration

TPOConfig Parameters

python
config = TPOConfig(
    # Quality calculation weights
    quality_weights=QualityWeights(
        linearity=0.4,      # α - Linear progression weight
        terminal=0.3,       # β - Terminal quality weight
        semantic=0.2,       # γ - Semantic coherence weight
        completion=0.1      # δ - Completion quality weight
    ),

    # Preference generation
    preference_threshold=0.1,     # θ - min quality difference
    confidence_threshold=0.6,     # min confidence for preference

    # Path filtering
    min_path_length=2,
    max_path_length=50,

    # DLM parameters
    alpha_scale=0.7,
    time_decay_factor=0.1,

    # Strategy toggles
    enable_linear_preferences=True,
    enable_hindsight_preferences=True,
    enable_depth_preferences=True
)

📚 API Reference

Core Classes

  • `TPOAlgorithm`: Main algorithm orchestrator
  • `ConversationGraph`: Graph representation of conversations
  • `PathQualityCalculator`: Quality metric computation
  • `TPOPreferenceGenerator`: Preference dataset generation

Dataset Classes

  • `TPODataset`: Container for preference pairs
  • `PreferencePair`: Individual preference data structure
  • `ConversationDataLoader`: Multi-format data loading

Training Classes

  • `TPOTrainer`: Model training with TPO loss
  • `TPOLoss`: TPO loss function implementation
  • `TPOMetrics`: Training and evaluation metrics

🧪 Testing

Run the test suite:

bash
python -m pytest tests/ -v

Run specific test categories:

bash
# Core algorithm tests
python -m pytest tests/test_core/ -v

# Dataset generation tests
python -m pytest tests/test_dataset/ -v

# Training tests
python -m pytest tests/test_training/ -v

🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

Development Setup

bash
# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

# Run code formatting
black tpo/
isort tpo/

# Run linting
flake8 tpo/
mypy tpo/

📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

📖 Citation

If you use TPO in your research, please cite:

bibtex
@article{tpo2024,
  title={Topological Preference Optimization: A Novel Training Strategy for Conversational AI},
  author={Chain Memory Research Team},
  journal={arXiv preprint},
  year={2024}
}

🔗 Related Work

  • [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290)
  • [Constitutional AI](https://arxiv.org/abs/2212.08073)
  • [RLHF: Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2203.02155)
  • [Chain Memory: Divergent Language Matrix](../README.md)

📞 Support

  • Documentation: [Full documentation](TOPO_DOCUMENTATION.md)
  • Mathematical Details: [Mathematical supplement](TPO_MATHEMATICAL_SUPPLEMENT.md)
  • Issues: [GitHub Issues](https://github.com/your-repo/issues)
  • Discussions: [GitHub Discussions](https://github.com/your-repo/discussions)

---

TPO: Where conversation topology meets preference learning 🚀

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/packages/tpo/README.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture