Getting Started: Build Your Personal AI
You have **289 MB of personal data** across 5 files: - conversations.json (190 MB) - conversations_new.json (64 MB) - conversation_openai.json (8 MB) - notes.json (15 MB) - cc_conversations.json (13 MB)
Full Public Reader
Getting Started: Build Your Personal AI
Quick guide to building your personalized AI inference system with full context memory.
Overview
You have 289 MB of personal data across 5 files:
- conversations.json (190 MB)
- conversations_new.json (64 MB)
- conversation_openai.json (8 MB)
- notes.json (15 MB)
- cc_conversations.json (13 MB)
We'll transform this into a personal AI that knows everything about you and your projects.
---
Prerequisites
Install Dependencies
# Navigate to project
cd [home]/Desktop/Computational\ Choreography/cc-tpo
# Install required packages
pip install sentence-transformers scikit-learn tqdm numpyThat's it! The `embeddinggemma-300m` model will download automatically on first run (~600 MB).
---
Step-by-Step Guide
Phase 1: Unify Your Data (5-10 minutes)
Combines all 5 data sources into one unified knowledge base.
python scripts/unify_personal_data.pyWhat it does:
- ✅ Loads all conversations from all sources
- ✅ Extracts clean message threads
- ✅ Deduplicates content
- ✅ Auto-categorizes by topic (CC, music, business, etc.)
- ✅ Creates `data/unified_knowledge.json`
Expected output:
📊 UNIFIED KNOWLEDGE BASE SUMMARY
Total Conversations: ~500-800
Total Messages: ~10,000-15,000
Total Notes: ~1,000-2,000---
Phase 2: Generate Embeddings (15-30 minutes)
Uses `sentence-transformers` with Google's `embeddinggemma-300m` model to create semantic embeddings.
python scripts/generate_personal_embeddings.pyWhat it does:
- ✅ Downloads `embeddinggemma-300m` model (first run only)
- ✅ Generates embeddings for ALL your messages and notes
- ✅ Uses batched processing (~100 texts/second)
- ✅ Saves to `data/embeddings/personal_embeddings.npy`
- ✅ Tests semantic search with example queries
Expected output:
🔮 Generating embeddings for 12,000 texts...
Batch size: 32
Embedding dimension: 768
[████████████████████] 100% complete
✅ Generated embeddings with shape: (12000, 768)Time estimate:
- ~10,000 texts: 15-20 minutes
- ~15,000 texts: 25-30 minutes
---
Phase 3: Test Semantic Search (Immediate!)
Once embeddings are generated, you can immediately start searching your knowledge:
import numpy as np
import json
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load model
model = SentenceTransformer("google/embeddinggemma-300m")
# Load embeddings
embeddings = np.load('data/embeddings/personal_embeddings.npy')
# Load metadata
with open('data/embeddings/metadata.json') as f:
data = json.load(f)
metadata = data['text_metadata']
# Your query
query = "How does LIM-RPS handle gesture detection?"
# Search
query_emb = model.encode([query], normalize_embeddings=True)
similarities = cosine_similarity(query_emb, embeddings)[0]
top_10 = similarities.argsort()[-10:][::-1]
# Results
print(f"Top 10 results for: {query}\n")
for rank, idx in enumerate(top_10, 1):
meta = metadata[idx]
score = similarities[idx]
if meta['type'] == 'conversation_message':
print(f"{rank}. [{score:.3f}] {meta['conversation_title']}")
else:
print(f"{rank}. [{score:.3f}] Note")Output example:
Top 10 results for: How does LIM-RPS handle gesture detection?
1. [0.892] LIM-RPS overview
2. [0.854] Echelon DAW comparison
3. [0.832] Code implementation update
4. [0.801] Computational choreography explained
...---
Quick Test
After completing Phase 1 & 2, test your system:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from sklearn.metrics.pairwise import cosine_similarity
# Initialize
model = SentenceTransformer("google/embeddinggemma-300m")
embeddings = np.load('data/embeddings/personal_embeddings.npy')
with open('data/embeddings/metadata.json') as f:
metadata = json.load(f)['text_metadata']
# Test queries
queries = [
"computational choreography",
"music production tips",
"distributor email",
"recursive synthesis",
]
for query in queries:
query_emb = model.encode([query], normalize_embeddings=True)
similarities = cosine_similarity(query_emb, embeddings)[0]
best_match_idx = similarities.argmax()
meta = metadata[best_match_idx]
score = similarities[best_match_idx]
print(f"Query: '{query}'")
print(f" → Best match: {meta.get('conversation_title', 'Note')} (score: {score:.3f})")
print()---
Next Steps
Option A: Full Personal AI System (Week 2-4)
Continue with the complete implementation:
1. Build knowledge topology with I-RCP
2. Extract personal profile
3. Create PersonalAI inference system
4. Add persistent state
See: [PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md](PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md)
Option B: Quick Semantic Search Tool (Today!)
Build a simple search tool right now:
# save as: search_my_knowledge.py
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from sklearn.metrics.pairwise import cosine_similarity
import sys
# Load
model = SentenceTransformer("google/embeddinggemma-300m")
embeddings = np.load('data/embeddings/personal_embeddings.npy')
with open('data/embeddings/metadata.json') as f:
data = json.load(f)
metadata = data['text_metadata']
# Load full conversations
with open('data/unified_knowledge.json') as f:
knowledge = json.load(f)
# Search function
def search(query, top_k=5):
query_emb = model.encode([query], normalize_embeddings=True)
similarities = cosine_similarity(query_emb, embeddings)[0]
top_indices = similarities.argsort()[-top_k:][::-1]
print(f"\n🔍 Search: '{query}'")
print("="*60)
for rank, idx in enumerate(top_indices, 1):
meta = metadata[idx]
score = similarities[idx]
print(f"\n[{rank}] Score: {score:.3f}")
if meta['type'] == 'conversation_message':
# Find the conversation
conv = next(c for c in knowledge['conversations'] if c['id'] == meta['conversation_id'])
msg = conv['messages'][meta['message_index']]
print(f" Conversation: {meta['conversation_title']}")
print(f" Role: {meta['role']}")
print(f" Preview: {msg['content'][:200]}...")
else:
print(f" Type: Note")
print(f" Source: {meta['source']}")
# CLI
if __name__ == '__main__':
if len(sys.argv) > 1:
query = ' '.join(sys.argv[1:])
search(query)
else:
print("Usage: python search_my_knowledge.py <query>")
print("Example: python search_my_knowledge.py how does LIM-RPS work")Usage:
python search_my_knowledge.py "explain gesture detection"
python search_my_knowledge.py "distributor email"
python search_my_knowledge.py "LIM-RPS convergence"---
File Structure After Setup
cc-tpo/
├── data/
│ ├── unified_knowledge.json # ← Created in Phase 1
│ └── embeddings/
│ ├── personal_embeddings.npy # ← Created in Phase 2
│ ├── metadata.json # ← Created in Phase 2
│ └── embeddings_cache.pkl # ← Created in Phase 2
│
├── scripts/
│ ├── unify_personal_data.py # ✅ Ready
│ └── generate_personal_embeddings.py # ✅ Ready
│
└── search_my_knowledge.py # ← Build this for quick search---
Troubleshooting
"ModuleNotFoundError: No module named 'sentence_transformers'"
pip install sentence-transformers"OutOfMemoryError" during embedding generation
Reduce batch size in the script:
# In generate_personal_embeddings.py, line ~300
generator = PersonalEmbeddingGenerator(
model_name="google/embeddinggemma-300m",
batch_size=16 # ← Reduce from 32 to 16 or 8
)"File not found: unified_knowledge.json"
Make sure you ran Phase 1 first:
python scripts/unify_personal_data.py---
Performance Expectations
### System Requirements
- CPU: Any modern CPU (M1/M2 Mac works great!)
- RAM: 8 GB minimum, 16 GB recommended
- Storage: ~1 GB for model + embeddings
- GPU: Optional (will use CPU if no GPU)
### Speed
- Embedding generation: ~100 texts/second (CPU), ~500/second (GPU)
- Semantic search: < 100ms for 10,000 embeddings
- Total setup time: 20-40 minutes
### Accuracy
- Google's `embeddinggemma-300m` is state-of-the-art for semantic similarity
- Comparable to OpenAI's embeddings but runs locally
- Perfect for finding relevant conversations in your personal data
---
What You'll Have
After completing Phase 1 & 2:
✅ All your data unified - One clean knowledge base
✅ Semantic search - Find any conversation instantly
✅ Local & private - Everything runs on your machine
✅ Fast queries - < 100ms search time
✅ Foundation ready - Ready for full PersonalAI system
---
Questions?
- Full architecture: See [PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md](PERSONALIZED_AI_SYSTEM_ARCHITECTURE.md)
- CC analysis: See [CC_CONVERSATION_ANALYSIS_PLAN.md](CC_CONVERSATION_ANALYSIS_PLAN.md)
- DLM documentation: See [packages/dlm/response/README.md](packages/dlm/response/README.md)
---
Ready? Start with:
python scripts/unify_personal_data.pyLet's build your personal AI! 🚀
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/backend/cc-trajectory/legacy/cc-tpo-original/cc-tpo/docs/guides/GETTING_STARTED.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture