Grand Diomande Research · Full HTML Reader

Getting Started: Build Your Personal AI

You have **289 MB of personal data** across 5 files: - conversations.json (190 MB) - conversations_new.json (64 MB) - conversation_openai.json (8 MB) - notes.json (15 MB) - cc_conversations.json (13 MB)

Agents That Account for Themselves research note experiment writeup candidate score 24 .md

Full Public Reader

Getting Started: Build Your Personal AI

Quick guide to building your personalized AI inference system with full context memory.

Overview

You have 289 MB of personal data across 5 files:
- conversations.json (190 MB)
- conversations_new.json (64 MB)
- conversation_openai.json (8 MB)
- notes.json (15 MB)
- cc_conversations.json (13 MB)

We'll transform this into a personal AI that knows everything about you and your projects.

---

Prerequisites

Install Dependencies

bash
# Navigate to project
cd [home]/Desktop/Computational\ Choreography/cc-tpo

# Install required packages
pip install sentence-transformers scikit-learn tqdm numpy

That's it! The `embeddinggemma-300m` model will download automatically on first run (~600 MB).

---

Step-by-Step Guide

Phase 1: Unify Your Data (5-10 minutes)

Combines all 5 data sources into one unified knowledge base.

bash
python scripts/unify_personal_data.py

What it does:
- ✅ Loads all conversations from all sources
- ✅ Extracts clean message threads
- ✅ Deduplicates content
- ✅ Auto-categorizes by topic (CC, music, business, etc.)
- ✅ Creates `data/unified_knowledge.json`

Expected output:

📊 UNIFIED KNOWLEDGE BASE SUMMARY
   Total Conversations: ~500-800
   Total Messages: ~10,000-15,000
   Total Notes: ~1,000-2,000

---

Phase 2: Generate Embeddings (15-30 minutes)

Uses `sentence-transformers` with Google's `embeddinggemma-300m` model to create semantic embeddings.

bash
python scripts/generate_personal_embeddings.py

What it does:
- ✅ Downloads `embeddinggemma-300m` model (first run only)
- ✅ Generates embeddings for ALL your messages and notes
- ✅ Uses batched processing (~100 texts/second)
- ✅ Saves to `data/embeddings/personal_embeddings.npy`
- ✅ Tests semantic search with example queries

Expected output:

🔮 Generating embeddings for 12,000 texts...
   Batch size: 32
   Embedding dimension: 768
   [████████████████████] 100% complete
   ✅ Generated embeddings with shape: (12000, 768)

Time estimate:
- ~10,000 texts: 15-20 minutes
- ~15,000 texts: 25-30 minutes

---

Phase 3: Test Semantic Search (Immediate!)

Once embeddings are generated, you can immediately start searching your knowledge:

python
import numpy as np
import json
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = SentenceTransformer("google/embeddinggemma-300m")

# Load embeddings
embeddings = np.load('data/embeddings/personal_embeddings.npy')

# Load metadata
with open('data/embeddings/metadata.json') as f:
    data = json.load(f)
metadata = data['text_metadata']

# Your query
query = "How does LIM-RPS handle gesture detection?"

# Search
query_emb = model.encode([query], normalize_embeddings=True)
similarities = cosine_similarity(query_emb, embeddings)[0]
top_10 = similarities.argsort()[-10:][::-1]

# Results
print(f"Top 10 results for: {query}\n")
for rank, idx in enumerate(top_10, 1):
    meta = metadata[idx]
    score = similarities[idx]

    if meta['type'] == 'conversation_message':
        print(f"{rank}. [{score:.3f}] {meta['conversation_title']}")
    else:
        print(f"{rank}. [{score:.3f}] Note")

Output example:

Top 10 results for: How does LIM-RPS handle gesture detection?

1. [0.892] LIM-RPS overview
2. [0.854] Echelon DAW comparison
3. [0.832] Code implementation update
4. [0.801] Computational choreography explained
...

---

Quick Test

After completing Phase 1 & 2, test your system:

python
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from sklearn.metrics.pairwise import cosine_similarity

# Initialize
model = SentenceTransformer("google/embeddinggemma-300m")
embeddings = np.load('data/embeddings/personal_embeddings.npy')

with open('data/embeddings/metadata.json') as f:
    metadata = json.load(f)['text_metadata']

# Test queries
queries = [
    "computational choreography",
    "music production tips",
    "distributor email",
    "recursive synthesis",
]

for query in queries:
    query_emb = model.encode([query], normalize_embeddings=True)
    similarities = cosine_similarity(query_emb, embeddings)[0]
    best_match_idx = similarities.argmax()

    meta = metadata[best_match_idx]
    score = similarities[best_match_idx]

    print(f"Query: '{query}'")
    print(f"  → Best match: {meta.get('conversation_title', 'Note')} (score: {score:.3f})")
    print()

---

Next Steps

Option A: Full Personal AI System (Week 2-4)

Continue with the complete implementation:
1. Build knowledge topology with I-RCP
2. Extract personal profile
3. Create PersonalAI inference system
4. Add persistent state

See: [PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md](PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md)

Option B: Quick Semantic Search Tool (Today!)

Build a simple search tool right now:

python
# save as: search_my_knowledge.py
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from sklearn.metrics.pairwise import cosine_similarity
import sys

# Load
model = SentenceTransformer("google/embeddinggemma-300m")
embeddings = np.load('data/embeddings/personal_embeddings.npy')

with open('data/embeddings/metadata.json') as f:
    data = json.load(f)

metadata = data['text_metadata']

# Load full conversations
with open('data/unified_knowledge.json') as f:
    knowledge = json.load(f)

# Search function
def search(query, top_k=5):
    query_emb = model.encode([query], normalize_embeddings=True)
    similarities = cosine_similarity(query_emb, embeddings)[0]
    top_indices = similarities.argsort()[-top_k:][::-1]

    print(f"\n🔍 Search: '{query}'")
    print("="*60)

    for rank, idx in enumerate(top_indices, 1):
        meta = metadata[idx]
        score = similarities[idx]

        print(f"\n[{rank}] Score: {score:.3f}")

        if meta['type'] == 'conversation_message':
            # Find the conversation
            conv = next(c for c in knowledge['conversations'] if c['id'] == meta['conversation_id'])
            msg = conv['messages'][meta['message_index']]

            print(f"    Conversation: {meta['conversation_title']}")
            print(f"    Role: {meta['role']}")
            print(f"    Preview: {msg['content'][:200]}...")
        else:
            print(f"    Type: Note")
            print(f"    Source: {meta['source']}")

# CLI
if __name__ == '__main__':
    if len(sys.argv) > 1:
        query = ' '.join(sys.argv[1:])
        search(query)
    else:
        print("Usage: python search_my_knowledge.py <query>")
        print("Example: python search_my_knowledge.py how does LIM-RPS work")

Usage:

bash
python search_my_knowledge.py "explain gesture detection"
python search_my_knowledge.py "distributor email"
python search_my_knowledge.py "LIM-RPS convergence"

---

File Structure After Setup

cc-tpo/
├── data/
│   ├── unified_knowledge.json          # ← Created in Phase 1
│   └── embeddings/
│       ├── personal_embeddings.npy     # ← Created in Phase 2
│       ├── metadata.json               # ← Created in Phase 2
│       └── embeddings_cache.pkl        # ← Created in Phase 2
│
├── scripts/
│   ├── unify_personal_data.py          # ✅ Ready
│   └── generate_personal_embeddings.py # ✅ Ready
│
└── search_my_knowledge.py              # ← Build this for quick search

---

Troubleshooting

"ModuleNotFoundError: No module named 'sentence_transformers'"

bash
pip install sentence-transformers

"OutOfMemoryError" during embedding generation

Reduce batch size in the script:

python
# In generate_personal_embeddings.py, line ~300
generator = PersonalEmbeddingGenerator(
    model_name="google/embeddinggemma-300m",
    batch_size=16  # ← Reduce from 32 to 16 or 8
)

"File not found: unified_knowledge.json"

Make sure you ran Phase 1 first:

bash
python scripts/unify_personal_data.py

---

Performance Expectations

### System Requirements
- CPU: Any modern CPU (M1/M2 Mac works great!)
- RAM: 8 GB minimum, 16 GB recommended
- Storage: ~1 GB for model + embeddings
- GPU: Optional (will use CPU if no GPU)

### Speed
- Embedding generation: ~100 texts/second (CPU), ~500/second (GPU)
- Semantic search: < 100ms for 10,000 embeddings
- Total setup time: 20-40 minutes

### Accuracy
- Google's `embeddinggemma-300m` is state-of-the-art for semantic similarity
- Comparable to OpenAI's embeddings but runs locally
- Perfect for finding relevant conversations in your personal data

---

What You'll Have

After completing Phase 1 & 2:

All your data unified - One clean knowledge base
Semantic search - Find any conversation instantly
Local & private - Everything runs on your machine
Fast queries - < 100ms search time
Foundation ready - Ready for full PersonalAI system

---

Questions?

  • Full architecture: See [PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md](PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md)
  • CC analysis: See [CC_CONVERSATION_ANALYSIS_PLAN.md](CC_CONVERSATION_ANALYSIS_PLAN.md)
  • DLM documentation: See [packages/dlm/response/README.md](packages/dlm/response/README.md)

---

Ready? Start with:

bash
python scripts/unify_personal_data.py

Let's build your personal AI! 🚀

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/docs/guides/GETTING_STARTED.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture