RAG++ ChatGPT Memory Integration: Hard Specification
Build a conversation-memory substrate for Computational Choreography where ChatGPT export is ingested **without losing DAG truth**, exposed through retrieval that answers four question classes:
Full Public Reader
RAG++ ChatGPT Memory Integration: Hard Specification
Version: 1.0.0
Status: LOCKED - Implementation Grade
Last Updated: 2025-12-27
---
0. Goal Statement
Build a conversation-memory substrate for Computational Choreography where ChatGPT export is ingested without losing DAG truth, exposed through retrieval that answers four question classes:
1. Exact: "What did I say?" (lexical, IDs, timestamps)
2. Semantic: "What else like this?" (dense vectors)
3. Causal: "How did we get here?" (DAG traversal)
4. CC-Specific: "What stabilized last time?" (motifs/invariants/decisions)
### Non-Goals (MVP)
- No agent memory reasoning engine
- No training or online embedding updates (except stats/annotations)
- No perfect phase classification (rule-based first)
---
1. Core Principle: Preserve Truth, Add Views
The ChatGPT export is a DAG, not a transcript.
The `mapping` in conversations.json contains nodes with parent/children links. Multiple children = regenerations/branching. If you flatten to single sequence, you destroy the most valuable property: alternative continuations and exact adjacency relations.
Hard Rule: Canonical store preserves graph exactly. Linear projections are explicitly labeled with declared selection policy.
---
2. Database as Memory Ledger with Update Laws
Field Categories
Category 1: Canonical Facts (IMMUTABLE)
- Identifiers, timestamps, content, graph structure
- Never change after insert
Category 2: Annotations (MUTABLE, VERSIONED)
- Phase labels, salience scores, tags, primary-path membership
- Allowed to change; updates must carry version
Category 3: Outcome Stats (MUTABLE)
- Welford mean/m2/count
- Updated online as CC uses memory
Separation enforced by database triggers.
---
3. Schema (Exact SQL)
3.1 Extensions and Core Tables
-- Enable extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Conversations
CREATE TABLE memory_conversations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
chatgpt_conversation_id TEXT UNIQUE NOT NULL,
title TEXT,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL,
turn_count INT NOT NULL DEFAULT 0,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb
);
-- Turns (canonical + annotations + stats)
CREATE TABLE memory_turns (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Canonical identifiers (IMMUTABLE)
chatgpt_message_id TEXT UNIQUE NOT NULL,
chatgpt_node_id TEXT UNIQUE NOT NULL,
conversation_id UUID NOT NULL REFERENCES memory_conversations(id) ON DELETE CASCADE,
-- Canonical content (IMMUTABLE)
role TEXT NOT NULL CHECK (role IN ('user','assistant','system','tool')),
content_type TEXT,
content_text TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
-- Full-text search (generated)
content_tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content_text)) STORED,
-- Embeddings (IMMUTABLE once set)
embedding_dim SMALLINT NOT NULL DEFAULT 0 CHECK (embedding_dim >= 0),
embedding VECTOR(768),
-- Derived projections (MUTABLE)
branch_id TEXT,
turn_index_primary INT,
is_primary_path BOOLEAN NOT NULL DEFAULT false,
-- Annotations (MUTABLE, versioned)
phase TEXT CHECK (phase IN ('exploration','consolidation','synthesis','debugging','planning') OR phase IS NULL),
salience_score FLOAT NOT NULL DEFAULT 0.5 CHECK (salience_score >= 0 AND salience_score <= 1),
tags TEXT[] NOT NULL DEFAULT '{}'::text[],
annotation_version JSONB NOT NULL DEFAULT '{}'::jsonb,
-- Outcome stats - Welford (MUTABLE)
outcome_mean FLOAT NOT NULL DEFAULT 0.5,
outcome_m2 FLOAT NOT NULL DEFAULT 0.0, -- Store M2, NOT variance
outcome_count INT NOT NULL DEFAULT 0,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb
);
-- Turn edges (DAG structure - CANONICAL)
CREATE TABLE memory_turn_edges (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversation_id UUID NOT NULL REFERENCES memory_conversations(id) ON DELETE CASCADE,
parent_turn_id UUID NOT NULL REFERENCES memory_turns(id) ON DELETE CASCADE,
child_turn_id UUID NOT NULL REFERENCES memory_turns(id) ON DELETE CASCADE,
edge_type TEXT NOT NULL DEFAULT 'unknown',
UNIQUE(conversation_id, parent_turn_id, child_turn_id)
);
-- Feedback
CREATE TABLE memory_feedback (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
turn_id UUID NOT NULL UNIQUE REFERENCES memory_turns(id) ON DELETE CASCADE,
rating TEXT NOT NULL CHECK (rating IN ('thumbs_up','thumbs_down')),
created_at TIMESTAMPTZ NOT NULL
);
-- Media (images + audio)
CREATE TABLE memory_media (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
turn_id UUID NOT NULL REFERENCES memory_turns(id) ON DELETE CASCADE,
media_type TEXT NOT NULL CHECK (media_type IN ('image','audio')),
file_path TEXT NOT NULL,
storage_path TEXT,
-- Image captions
caption_text TEXT,
caption_embedding VECTOR(768),
-- Audio transcripts (REQUIRED for audio)
transcript_text TEXT,
transcript_embedding VECTOR(768),
metadata JSONB NOT NULL DEFAULT '{}'::jsonb
);
-- Motifs (CC-specific, MANDATORY)
CREATE TABLE memory_motifs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
motif_type TEXT NOT NULL CHECK (motif_type IN ('decision','invariant','pattern','architecture','warning','milestone')),
title TEXT NOT NULL,
body_text TEXT NOT NULL,
embedding VECTOR(768),
phase TEXT CHECK (phase IN ('exploration','consolidation','synthesis','debugging','planning') OR phase IS NULL),
salience_score FLOAT NOT NULL DEFAULT 0.8 CHECK (salience_score >= 0 AND salience_score <= 1),
tags TEXT[] NOT NULL DEFAULT '{}'::text[],
source_turn_ids UUID[] NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
metadata JSONB NOT NULL DEFAULT '{}'::jsonb
);3.2 Indices
-- Full-text indices (lexical retrieval)
CREATE INDEX idx_turns_tsv ON memory_turns USING gin (content_tsv);
CREATE INDEX idx_motifs_trgm_title ON memory_motifs USING gin (title gin_trgm_ops);
-- Metadata filters
CREATE INDEX idx_turns_convo ON memory_turns(conversation_id);
CREATE INDEX idx_turns_created_at_desc ON memory_turns(created_at DESC);
CREATE INDEX idx_turns_role ON memory_turns(role);
CREATE INDEX idx_turns_phase ON memory_turns(phase) WHERE phase IS NOT NULL;
CREATE INDEX idx_turns_primary ON memory_turns(is_primary_path);
CREATE INDEX idx_turns_salience_desc ON memory_turns(salience_score DESC);
CREATE INDEX idx_edges_parent ON memory_turn_edges(parent_turn_id);
CREATE INDEX idx_edges_child ON memory_turn_edges(child_turn_id);
CREATE INDEX idx_media_turn ON memory_media(turn_id);
CREATE INDEX idx_media_type ON memory_media(media_type);
-- Vector indices (HNSW - preferred)
CREATE INDEX idx_turns_embedding_hnsw ON memory_turns
USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_motifs_embedding_hnsw ON memory_motifs
USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_media_caption_embedding_hnsw ON memory_media
USING hnsw (caption_embedding vector_cosine_ops);
CREATE INDEX idx_media_transcript_embedding_hnsw ON memory_media
USING hnsw (transcript_embedding vector_cosine_ops);3.3 Immutability Trigger
CREATE OR REPLACE FUNCTION memory_turns_enforce_immutability()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $$
BEGIN
-- Block updates to canonical identity
IF NEW.chatgpt_message_id IS DISTINCT FROM OLD.chatgpt_message_id THEN
RAISE EXCEPTION 'Immutable field chatgpt_message_id cannot be updated';
END IF;
IF NEW.chatgpt_node_id IS DISTINCT FROM OLD.chatgpt_node_id THEN
RAISE EXCEPTION 'Immutable field chatgpt_node_id cannot be updated';
END IF;
IF NEW.conversation_id IS DISTINCT FROM OLD.conversation_id THEN
RAISE EXCEPTION 'Immutable field conversation_id cannot be updated';
END IF;
IF NEW.role IS DISTINCT FROM OLD.role THEN
RAISE EXCEPTION 'Immutable field role cannot be updated';
END IF;
IF NEW.content_type IS DISTINCT FROM OLD.content_type THEN
RAISE EXCEPTION 'Immutable field content_type cannot be updated';
END IF;
IF NEW.content_text IS DISTINCT FROM OLD.content_text THEN
RAISE EXCEPTION 'Immutable field content_text cannot be updated';
END IF;
IF NEW.created_at IS DISTINCT FROM OLD.created_at THEN
RAISE EXCEPTION 'Immutable field created_at cannot be updated';
END IF;
-- Embedding immutability: once set (dim > 0), cannot change
IF OLD.embedding_dim > 0 THEN
IF NEW.embedding_dim IS DISTINCT FROM OLD.embedding_dim THEN
RAISE EXCEPTION 'Immutable field embedding_dim cannot be updated after set';
END IF;
IF NEW.embedding IS DISTINCT FROM OLD.embedding THEN
RAISE EXCEPTION 'Immutable field embedding cannot be updated after set';
END IF;
ELSE
IF NEW.embedding_dim < 0 THEN
RAISE EXCEPTION 'embedding_dim must be >= 0';
END IF;
END IF;
RETURN NEW;
END;
$$;
CREATE TRIGGER trg_memory_turns_immutability
BEFORE UPDATE ON memory_turns
FOR EACH ROW
EXECUTE FUNCTION memory_turns_enforce_immutability();3.4 Welford Update Function
CREATE OR REPLACE FUNCTION memory_turns_update_outcome(
p_turn_id UUID,
p_outcome FLOAT
)
RETURNS VOID
LANGUAGE plpgsql
AS $$
DECLARE
v_mean FLOAT;
v_m2 FLOAT;
v_n INT;
v_n1 INT;
v_delta FLOAT;
v_delta2 FLOAT;
v_new_mean FLOAT;
v_new_m2 FLOAT;
BEGIN
-- Lock row for update
SELECT outcome_mean, outcome_m2, outcome_count
INTO v_mean, v_m2, v_n
FROM memory_turns
WHERE id = p_turn_id
FOR UPDATE;
v_n1 := v_n + 1;
v_delta := p_outcome - v_mean;
v_new_mean := v_mean + v_delta / v_n1;
v_delta2 := p_outcome - v_new_mean;
v_new_m2 := v_m2 + v_delta * v_delta2;
UPDATE memory_turns
SET outcome_mean = v_new_mean,
outcome_m2 = v_new_m2,
outcome_count = v_n1
WHERE id = p_turn_id;
END;
$$;---
4. Hybrid Retrieval (Real Implementation)
4.1 Search Function
CREATE OR REPLACE FUNCTION search_memory_hybrid(
p_query_text TEXT,
p_query_embedding VECTOR(768),
p_limit INT DEFAULT 20,
-- Filters
p_conversation_id UUID DEFAULT NULL,
p_role TEXT DEFAULT NULL,
p_phase TEXT DEFAULT NULL,
p_min_salience FLOAT DEFAULT 0.0,
p_only_primary_path BOOLEAN DEFAULT FALSE,
p_time_start TIMESTAMPTZ DEFAULT NULL,
p_time_end TIMESTAMPTZ DEFAULT NULL,
-- Fusion weights
p_lexical_weight FLOAT DEFAULT 0.4,
p_semantic_weight FLOAT DEFAULT 0.6,
-- Retrieval params
p_lexical_limit INT DEFAULT 50,
p_semantic_limit INT DEFAULT 50
)
RETURNS TABLE (
turn_id UUID,
content_text TEXT,
role TEXT,
phase TEXT,
salience_score FLOAT,
created_at TIMESTAMPTZ,
conversation_id UUID,
combined_score FLOAT,
lexical_rank INT,
semantic_rank INT
)
LANGUAGE plpgsql
AS $$
DECLARE
v_query_tsquery TSQUERY;
BEGIN
v_query_tsquery := plainto_tsquery('english', p_query_text);
RETURN QUERY
WITH lexical_hits AS (
SELECT
t.id,
ts_rank(t.content_tsv, v_query_tsquery) AS lex_score,
ROW_NUMBER() OVER (ORDER BY ts_rank(t.content_tsv, v_query_tsquery) DESC) AS lex_rank
FROM memory_turns t
WHERE t.content_tsv @@ v_query_tsquery
AND (p_conversation_id IS NULL OR t.conversation_id = p_conversation_id)
AND (p_role IS NULL OR t.role = p_role)
AND (p_phase IS NULL OR t.phase = p_phase)
AND t.salience_score >= p_min_salience
AND (NOT p_only_primary_path OR t.is_primary_path)
AND (p_time_start IS NULL OR t.created_at >= p_time_start)
AND (p_time_end IS NULL OR t.created_at <= p_time_end)
ORDER BY lex_score DESC
LIMIT p_lexical_limit
),
semantic_hits AS (
SELECT
t.id,
1 - (t.embedding <=> p_query_embedding) AS sem_score,
ROW_NUMBER() OVER (ORDER BY t.embedding <=> p_query_embedding) AS sem_rank
FROM memory_turns t
WHERE t.embedding IS NOT NULL
AND (p_conversation_id IS NULL OR t.conversation_id = p_conversation_id)
AND (p_role IS NULL OR t.role = p_role)
AND (p_phase IS NULL OR t.phase = p_phase)
AND t.salience_score >= p_min_salience
AND (NOT p_only_primary_path OR t.is_primary_path)
AND (p_time_start IS NULL OR t.created_at >= p_time_start)
AND (p_time_end IS NULL OR t.created_at <= p_time_end)
ORDER BY t.embedding <=> p_query_embedding
LIMIT p_semantic_limit
),
fused AS (
SELECT
COALESCE(l.id, s.id) AS fused_id,
COALESCE(l.lex_rank, p_lexical_limit + 1) AS lex_r,
COALESCE(s.sem_rank, p_semantic_limit + 1) AS sem_r,
-- Reciprocal Rank Fusion
(p_lexical_weight / (60 + COALESCE(l.lex_rank, p_lexical_limit + 1))) +
(p_semantic_weight / (60 + COALESCE(s.sem_rank, p_semantic_limit + 1))) AS rrf_score
FROM lexical_hits l
FULL OUTER JOIN semantic_hits s ON l.id = s.id
)
SELECT
t.id AS turn_id,
t.content_text,
t.role,
t.phase,
t.salience_score,
t.created_at,
t.conversation_id,
f.rrf_score AS combined_score,
f.lex_r::INT AS lexical_rank,
f.sem_r::INT AS semantic_rank
FROM fused f
JOIN memory_turns t ON t.id = f.fused_id
ORDER BY f.rrf_score DESC
LIMIT p_limit;
END;
$$;4.2 Motif-First Search (CC-Conductor)
CREATE OR REPLACE FUNCTION search_motifs_first(
p_query_embedding VECTOR(768),
p_tags TEXT[] DEFAULT NULL,
p_motif_types TEXT[] DEFAULT NULL,
p_limit INT DEFAULT 10
)
RETURNS TABLE (
motif_id UUID,
motif_type TEXT,
title TEXT,
body_text TEXT,
similarity FLOAT,
source_turn_ids UUID[]
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
m.id AS motif_id,
m.motif_type,
m.title,
m.body_text,
1 - (m.embedding <=> p_query_embedding) AS similarity,
m.source_turn_ids
FROM memory_motifs m
WHERE m.embedding IS NOT NULL
AND (p_tags IS NULL OR m.tags && p_tags)
AND (p_motif_types IS NULL OR m.motif_type = ANY(p_motif_types))
ORDER BY m.embedding <=> p_query_embedding
LIMIT p_limit;
END;
$$;---
5. Embedding Policy (Locked)
| Entity | Dimension | Task Type | Model |
|---|---|---|---|
| Turn content | 768 | RETRIEVAL_DOCUMENT | gemini-embedding-001 |
| Motif body | 768 | RETRIEVAL_DOCUMENT | gemini-embedding-001 |
| Image caption | 768 | RETRIEVAL_DOCUMENT | gemini-embedding-001 |
| Audio transcript | 768 | RETRIEVAL_DOCUMENT | gemini-embedding-001 |
| Query | 768 | RETRIEVAL_QUERY | gemini-embedding-001 |
Embed only:
- `memory_turns.content_text` where role IN ('user', 'assistant')
- `memory_motifs.body_text`
- `memory_media.caption_text` (images)
- `memory_media.transcript_text` (audio - REQUIRED)
Do NOT embed: system/tool boilerplate unless useful.
---
6. Media Requirements
### 6.1 Images
- Generate `caption_text` via Gemini 3 Flash
- Embed caption
- Link to parent turn
### 6.2 Audio (REQUIRED - not optional)
- Transcribe all audio via Whisper (API or local whisper.cpp)
- Store `transcript_text`
- Embed transcript
- Link to parent turn
Even rough transcription is better than dead memory.
---
7. Primary Path Policy
Primary path is a derived view, not a deletion.
### Policy v1.0 (Deterministic)
At each branching point, choose child that:
1. Has downstream `thumbs_up` feedback
2. Has longer assistant content (weak proxy)
3. Has fewer error tokens
4. Otherwise: first child by `created_at` then `node_id`
### Storage
- Set `is_primary_path = true`
- Set `branch_id = <computed>`
- Set `turn_index_primary = <sequence>`
Do NOT delete other branches.
---
8. Phase Inference Rules (v1.0)
| Phase | Indicators |
|---|---|
| exploration | Question marks, short turns, topic shifts |
| consolidation | Long responses, code blocks, file references |
| synthesis | "Decision:", "Let's go with", summaries |
| debugging | "Traceback", "Exception", "doesn't work", stack traces |
| planning | "Phase", "roadmap", "steps", numbered lists |
Store with version: `annotation_version.phase_version = "v1.0"`
---
9. Salience Scoring (v1.0)
def compute_salience(turn, feedback, context) -> float:
base = 0.5
# Feedback boost (strongest signal)
if feedback == 'thumbs_up':
base += 0.35
elif feedback == 'thumbs_down':
base -= 0.35
# Phase transition boost
if is_phase_transition(turn, context):
base += 0.10
# Reference boost (capped)
ref_count = count_downstream_references(turn)
base += min(ref_count * 0.05, 0.15)
# Novelty boost
novelty = embedding_distance_from_window(turn, context.recent_window)
base += min(novelty * 0.1, 0.10)
return max(0.0, min(1.0, base))Store with version: `annotation_version.salience_version = "v1.0"`
---
10. CC-Conductor Integration
10.1 Motifs Table (Mandatory)
Motifs are stabilized abstractions that prevent CC from drowning in raw history.
Motif Types:
- `decision`: Architecture/design decisions
- `invariant`: System constraints that must hold
- `pattern`: Recurring successful approaches
- `architecture`: Component designs
- `warning`: Known failure modes/anti-patterns
- `milestone`: Significant progress points
10.2 `what_matters_now()` Contract
def what_matters_now(
current_module: str, # e.g., "cc-window-aligner"
current_objective: str, # e.g., "design invariant"
context_text: str = None
) -> WhatMattersBundle:
"""
Returns:
1. Top motifs (first priority)
2. Supporting turns (second)
3. Anti-pattern warnings (third)
"""
# 1. Query motifs first
motifs = search_motifs_first(
query_embedding=embed(current_objective),
tags=[current_module],
limit=5
)
# 2. Get supporting turns for top motifs
supporting_turns = []
for m in motifs:
turns = get_turns_by_ids(m.source_turn_ids)
supporting_turns.extend(turns)
# 3. Get warnings
warnings = search_motifs_first(
query_embedding=embed(current_objective),
motif_types=['warning'],
limit=3
)
return WhatMattersBundle(
motifs=motifs,
supporting_turns=dedupe(supporting_turns),
warnings=warnings
)---
11. Execution Order
### Session 1: Schema + Canonicalization
1. Apply migrations (tables, triggers, indices)
2. Parse export to canonical JSONL
3. Upload: conversations → turns → edges → feedback → media refs
4. Validate counts match: 310 convos / 21,675 turns / edges / feedback
### Session 2: Embeddings
1. Embed turn content at dim=768
2. Update `memory_turns.embedding` + `embedding_dim`
3. Build HNSW index
4. Smoke test: semantic query returns reasonable neighbors
### Session 3: Media Processing
1. Caption all images → `caption_text` + `caption_embedding`
2. Transcribe all audio → `transcript_text` + `transcript_embedding`
3. Build media indices
### Session 4: Projections + Annotations
1. Compute primary path per conversation
2. Phase labeling (versioned)
3. Salience scoring (versioned)
4. Validate: distribution not degenerate
### Session 5: Motifs + CC Integration
1. Extract motifs from high-salience clusters
2. Implement hybrid retrieval API
3. Implement CC-Conductor bridge:
- `query_motifs()`
- `query_turns()`
- `what_matters_now()`
---
12. Validation Checklist
### Data Integrity
- [ ] Conversations inserted = 310
- [ ] Turns inserted = 21,675
- [ ] Edges inserted = sum of all children counts
- [ ] Feedback inserted = 66
- [ ] Media inserted = 216 images + 100+ audio
### Invariants
- [ ] UPDATE to immutable column fails (trigger test)
- [ ] Embedding dim = 768 for all vectors
- [ ] All audio has transcript_text
### Retrieval
- [ ] Dense query latency < 100ms
- [ ] Lexical query matches exact phrases
- [ ] Hybrid fusion improves recall for proper-noun queries
### CC Integration
- [ ] Motif retrieval returns stable abstractions
- [ ] `what_matters_now()` returns motifs first
- [ ] Known CC topics retrievable: "window aligner", "anticipator", "conductor"
---
13. Approval Conditions
Plan must explicitly state:
1. DAG preserved via `memory_turn_edges` + primary-path is derived view
2. Vector index choice is explicit (HNSW) and not contradictory
3. Immutability enforced in DB with triggers, not documentation
4. Embedding dim strategy locked (768 for turns)
5. True hybrid retrieval (tsvector + GIN + vector + RRF fusion)
6. Audio is transcribed (not reference-only)
7. Motifs table exists and is queried first by CC-Conductor
---
Sources
- [Gemini Embeddings API](https://ai.google.dev/gemini-api/docs/embeddings)
- [pgvector Documentation](https://github.com/pgvector/pgvector)
- [Welford's Online Algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/core/retrieval/cc-rag-plus-plus/SPECIFICATION_HARD.md
Detected Structure
Method · Evaluation · References · Math · Architecture