EVO-CUBE REPORT: CognitiveTwin Pipeline
**Date:** 2025-07-18 **Codebase:** `Desktop/Comp-Core/packages/cognitive-twin/` (93 Python files, ~47K LOC) **Data:** 43K records across 8 expansion stages + combined_v5_v8 final dataset **Target Model:** Kimi-K2-Thinking (MoE-1T, 32B active params)
Full Public Reader
# EVO-CUBE REPORT: CognitiveTwin Pipeline
### CEF + DEP-2 + Evolution — Full Audit
Date: 2025-07-18
Codebase: `Desktop/Comp-Core/packages/cognitive-twin/` (93 Python files, ~47K LOC)
Data: 43K records across 8 expansion stages + combined_v5_v8 final dataset
Target Model: Kimi-K2-Thinking (MoE-1T, 32B active params)
---
TABLE OF CONTENTS
1. [Phase 1 — CEF (Critique–Evil–Find)](#phase-1--cef)
- [Meta-Evil: Pipeline-Level Attacks](#11-meta-evil-pipeline-level-attacks)
- [Chunk-Evil: Per-Module Attacks](#12-chunk-evil-per-module-attacks)
- [Synthesis-Evil: Cross-Stage Data Quality](#13-synthesis-evil-cross-stage-data-quality)
2. [Phase 2 — DEP-2 (6-Level RTD + Fixes)](#phase-2--dep-2)
3. [Phase 3 — Evolution](#phase-3--evolution)
4. [Issue Tracker](#issue-tracker)
5. [Verdict](#verdict)
---
PHASE 1 — CEF
1.1 Meta-Evil: Pipeline-Level Attacks
ATTACK 1: Classification Threshold Regression → Silent Data Corruption
Finding: CRITICAL 🔴
In `corpus_surgery/constants.py`, the stall threshold was lowered from 3 → 1:
# Original: 3, 1, 0 - too strict, only caught 5/2177
STALL_THRESHOLD_UNJUSTIFIED = 1 # Any stalling patternThe comment says "only caught 5/2177" so the threshold was weakened. But lowering to 1 means any message ending with a question mark (stall_score=1) triggers UNJUSTIFIED classification — even messages like "Here is the implementation. Does this approach work?" which scored exec=3 but would now hit stall≥1. The only protection is the `exec_score == 0` check in `is_unjustified()`, but the secondary rule bypasses exec:
# Secondary rule: strong permission phrase + ends with ? + high completeness
if (ends_with_question(text) and
has_strong_permission and
directive_completeness >= DIRECTIVE_HIGH_THRESHOLD):
return TrueThis secondary rule has no exec_score gate — a response that contains "should we" (strong permission phrase) PLUS code PLUS question mark will be classified UNJUSTIFIED even though it executed. This corrupts the corpus surgery stage by flagging legitimate clarifications-after-execution as unjustified.
Impact: Potential false-positive rate on valid assistant turns estimated 8-15
Fix: Add `and exec_score == 0` to the secondary rule. Or restore stall threshold to 2 as a middle ground between 1 and 3.
---
ATTACK 2: FunctionGemma Integration is Dead Code in Production
Finding: HIGH 🟠
The `functiongemma_scorer.py` module is sophisticated (400+ lines) but always falls back to mock mode in production:
def __init__(self, ..., use_mock: bool = False):
...
if self._model_path is None:
logger.warning("No model path provided, using mock mode")
self.use_mock = TrueNo model path is ever configured in any script or config file. The mock parser uses keyword matching to estimate parsability — it's a heuristic pretending to be ML inference. The classifier's fusion logic weights parsability at 60
fused_completeness = (
0.4 * directive_completeness +
0.6 * parsability_score * parsability_info.confidence
)When mock confidence is 0.3-0.7 (hardcoded), this drags the fused score down compared to pure heuristic. The FunctionGemma rule `parsability_score >= 0.8 and stall_score >= 2` can still fire from mock data, creating phantom classifications.
Impact: Classification operates on heuristic-mock data pretending to be ML-scored data. Fusion logic degrades rather than improves accuracy.
Fix: Either (a) ship a real FunctionGemma model and integrate it, or (b) remove the fusion logic and disable the FunctionGemma classification rule when in mock mode. Don't let mock data influence production classifications.
---
ATTACK 3: Stall Score Double-Counting
Finding: MEDIUM 🟡
The `compute_stall_score()` function counts every matching phrase independently, but many phrases overlap:
- "i'm sorry, but i can't" matches BOTH `REFUSAL_PHRASES` (+4) AND `CLARIFICATION_PREAMBLES` ("i'm sorry" → not exact, but close patterns overlap)
- "would you like me to" (+3) can co-occur with "sound good?" (+3 from another strong permission phrase)
A single response saying "I apologize, but I cannot do that. Would you like me to try a different approach? Should I proceed?" would score: 4 (refusal) + 3 (would you like) + 3 (should I) + 1 (question mark) = 11. The threshold is 1. This response is definitely bad, but the score is meaninglessly inflated, making it impossible to use score magnitude for severity calibration or confidence weighting in DPO pairs.
Impact: Score inflation prevents meaningful gradient between "slightly stalling" and "completely refusing." DPO confidence could be calibrated to stall_score magnitude but currently isn't usable for this.
Fix: Either (a) deduplicate phrase matches (only count highest-scoring match per semantic category), or (b) cap stall_score at 10, or (c) normalize stall_score to [0,1] range for downstream use.
---
ATTACK 4: Rewriter Has No Semantic Fidelity Check
Finding: HIGH 🟠
The `rewrite_assistant_turn()` function validates rewrites for:
- No question mark at end ✓
- No permission phrases ✓
- Required artifacts present ✓
- Format compliance ✓
But there is no check that the rewrite is semantically faithful to the user's request. The rewriter calls GPT to transform permission-seeking responses into direct execution — but nothing verifies the generated content is correct. A rewrite could hallucinate a completely wrong implementation, and it would pass validation because it has a code block and doesn't end with a question.
Impact: Hallucinated rewrites enter SFT training data as "gold" examples. The model learns to produce confident-sounding wrong answers instead of correct permission-seeking ones. This is worse than the original problem.
Fix: Add a semantic similarity check between the rewrite and the user's request (embedding cosine similarity > 0.6). Or add a "code compiles" check for code rewrites. Or flag all rewrites as `review_status: auto_rewrite` and weight them lower (0.5) in training.
---
ATTACK 5: No Data Versioning or Lineage Tracking
Finding: MEDIUM 🟡
The pipeline has 8 expansion stages (v1-v8), combined datasets, augmented datasets, and multiple export versions (ctv3_export through ctv3_export_v5). But there is no mechanism to:
- Track which source records end up in which final dataset
- Know if a record was rewritten, augmented, or original
- Reproduce a specific dataset version
- Audit data flow from ingestion → scoring → WORMS → export
The `record_id` field uses random UUIDs, so there's no parent-child linkage. A record generated by conversation_worm has `source.origin = "convo_worm"` but no reference to which original conversation it branched from (the field `source_id` exists but is just the conversation_id, not the specific turn).
Impact: Cannot debug training failures back to data issues. Cannot identify which augmentation stage introduced problematic records. Dataset reproducibility is zero.
Fix: Add a `lineage` field to CTv3Record: `{parent_id: str, stage: str, transform: str}`. Chain parent_ids through pipeline stages. Log every record transformation.
---
1.2 Chunk-Evil: Per-Module Attacks
INGESTION (`v3/ingest/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| ING-1 | Claude JSON parser assumes specific export format | 🟡 MEDIUM | `claude_json.py` and `openai_json.py` parse specific export schemas. No schema validation — malformed exports silently drop turns rather than erroring. |
| ING-2 | Deduplicator hash truncation | 🟢 LOW | `sha256.hexdigest()[:16]` gives 64-bit hash space — collision probability rises past ~100K conversations. At 43K records, estimated <0.1 |
| ING-3 | Normalizer doesn't handle multi-modal content | 🟡 MEDIUM | Image/file attachments in conversations are silently dropped. No flag to indicate content was lost. Model trains on incomplete context. |
| ING-4 | Supabase extractor has no pagination | 🟡 MEDIUM | `.limit(limit * 10)` is a guess, not proper cursor-based pagination. Will miss data in large corpora. |
SCORING (`v3/corpus_surgery/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| SCR-1 | `ends_with_question()` over-fires | 🟡 MEDIUM | Checks if last sentence starts with "is", "are", "will" — catches declaratives like "This is the implementation." if preceded by a sentence ending with no period. Also: markdown headers like `## What is X` would fire. |
| SCR-2 | `check_missing_input()` false positives | 🟡 MEDIUM | Triggers on "refactor" keyword + no code block, but the code might be in an attachment or referenced by file path. `has_file_path` regex `[/ |
| ]+\.\w+` doesn't match many real paths (e.g., `Desktop/...` with tildes). | |||
| SCR-3 | `directive_completeness` maxes at 0.80 | 🟢 LOW | Max achievable: +0.35 (verb) + 0.25 (format) + 0.20 (inputs) = 0.80. No message can score 1.0 without negative terms canceling out. The 0.7 threshold for `no_questions` policy is reachable but leaves thin margin. |
| SCR-4 | `compute_blocked_score()` can go negative | 🟡 MEDIUM | `FORMAT_SPECIFIED_BONUS = -1` and `USER_ASKED_OPTIONS_BONUS = -2`. Starting from 0, a high-completeness message with format spec + options asked = -3, then `max(0, score)` clips it. But if `check_missing_input` fires (+3) and format specified (-1), the interaction is non-intuitive. |
| SCR-5 | No unit tests for classifier | 🔴 CRITICAL | The classifier has 3 hardcoded test cases in `__main__` but zero pytest tests. Classification logic changed (threshold 3→1, FunctionGemma rule added) with no regression testing. Cannot verify correctness. |
WORMS (`v3/worms/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| WRM-1 | ConversationWorm depends on Supabase at runtime | 🟡 MEDIUM | `_load_conversation()` requires Supabase. No local-file fallback for the main processing path. The `process_conversation()` method returns empty if no Supabase client — silently produces zero output. |
| WRM-2 | WORMS multiplier is 1.0x | 🟡 MEDIUM | `worms_stats.json` shows total_sft=3867, orig_sft=3852 — only 15 new conversation branches were generated. The WORMS augmentation pipeline added 0.4 |
| WRM-3 | RepoWorm generated 0 records | 🔴 CRITICAL | `repo_sft: 0, repo_dpo: 0` in worms_stats. The entire code-grounded training data pipeline produced nothing. Either no repos were scanned, or the CodeScanner/TaskGenerator found no tasks. This is an entire dead pipeline. |
| WRM-4 | DPO generator hardcodes "gpt-5.2" as provider | 🟢 LOW | `create_sft_record()` sets `provider: "gpt-5.2"` regardless of actual generation model. Stats show `model: "gemini-2.0-flash"` was used. Metadata lies about provenance. |
| WRM-5 | `_get_original_response()` always returns None | 🟡 MEDIUM | In `ConversationWormPipeline`, this method returns `None` with a comment "This would typically look up the original response from Supabase." Every DPO pair that depends on finding the original dispreferred response fails silently. |
EXPORT & DATASET (`v3/dataset/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| EXP-1 | DPO format mismatch between pipeline stages | 🔴 CRITICAL | Corpus surgery exports DPO as `{prompt, chosen, rejected}` (Together AI format). Combined v5_v8 data uses `{input, preferred_output, non_preferred_output}`. The `DataPreparer.prepare_dpo_data()` expects `{candidates.preferred.assistant_content}` (CTv3.1 schema format). These three formats are incompatible. No adapter exists to normalize them. |
| EXP-2 | No format validation on final export | 🟡 MEDIUM | `validate_export.py` exists (724 lines) but there's no evidence it's integrated into the pipeline. Data can be exported in an incompatible format and only discovered when training fails. |
| EXP-3 | 80/10/10 split has no stratification | 🟡 MEDIUM | `DatasetSplit` does random shuffle + split. No stratification by source, domain, task_type, or quality tier. Training set could over-represent one expansion stage and under-represent another. |
| EXP-4 | Eval set has no coverage guarantee | 🟡 MEDIUM | Eval records are exported as-is with no check for coverage of all failure modes, task types, or domains. |
TRAINING (`v3/pipeline.py`, `scripts/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| TRN-1 | Training pipeline is Together AI-only | 🟡 MEDIUM | `V3TrainingPipeline` only supports Together AI's fine-tuning API. The target model is Kimi-K2-Thinking (Moonshot AI). Together AI does support Kimi-K2, but the pipeline has no local training path, no Vast.ai integration despite having `deploy/vastai/` scripts, and no fallback. |
| TRN-2 | LoRA config is hardcoded, not optimized | 🟢 LOW | `lora_r=16, lora_alpha=32` are defaults. For a 32B-active MoE model with 43K records, these may be undertrained. No hyperparameter sweep infrastructure. |
| TRN-3 | DPO beta=0.1 is low for 43K records | 🟢 LOW | `dpo_beta=0.1` is standard for small datasets. With 43K records and a MoE model, a higher beta (0.3-0.5) may be needed to prevent reward hacking. |
| TRN-4 | No early stopping | 🟡 MEDIUM | `wait_for_completion()` polls until success/failure with a 2-hour timeout but no eval-loss-based early stopping. Overfitting risk on a small DPO set (882 records). |
| TRN-5 | `submit_training.py` and `train.py` scripts lack error recovery | 🟡 MEDIUM | No checkpointing, no resume capability. A timeout or API error means restarting from scratch. |
EVALUATION (`v3/eval/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| EVL-1 | Regression tester only checks string patterns | 🟡 MEDIUM | `_check_constraints()` checks disallowed phrases and question endings. No semantic evaluation, no LLM-as-judge, no behavioral scoring. A model could satisfy all constraints while producing meaningless output. |
| EVL-2 | A/B comparison scoring is simplistic | 🟡 MEDIUM | `_score_response()` only penalizes permission phrases and rewards code blocks. Doesn't evaluate correctness, coherence, or style alignment. |
| EVL-3 | No human eval integration | 🟡 MEDIUM | Zero support for human annotation, blind evaluation, or inter-annotator agreement. |
FRAMEWORK (`framework/`)
| ID | Finding | Severity | Detail |
|---|---|---|---|
| FRM-1 | Framework module is architectural skeleton | 🟢 LOW | `framework/` contains `twin.py`, `trainer.py`, `pattern_memory.py`, etc. These define abstract interfaces (CognitiveTwin, PatternMemory, ReasoningEncoder) but are not used by the actual v3 pipeline. They're aspirational architecture, not production code. This is fine for a v3 system, but they inflate the apparent capability of the codebase. |
| FRM-2 | `_compat.py` imports from non-existent packages | 🟢 LOW | References `FunctionCall`, `ToolSchema`, `FunctionGemmaRuntime` from `cognitive_twin._compat` — these are compatibility stubs that define mock interfaces when the real packages aren't installed. Safe but confusing. |
---
1.3 Synthesis-Evil: Cross-Stage Data Quality
ATTACK 6: The 43K Number is Inflated
Finding: HIGH 🟠
Actual record counts from the data directory:
| Dataset | SFT Train | SFT Val | DPO Train | DPO Val | Total |
|---|---|---|---|---|---|
| ctv3_export_v5 | 38,189 | 4,244 | 666 | 74 | 43,173 |
| combined_v5_v8 | 34,573 | 3,842 | 793 | 89 | 39,297 |
| worms augmented | 3,867 | — | 48 | — | 3,915 |
The "43K" figure comes from ctv3_export_v5, but:
1. The combined_v5_v8 has fewer records (39K) — meaning ~4K records were pruned or deduplicated between v5 export and the merge. But there's no record of what was removed or why.
2. SFT massively outnumbers DPO: 38K SFT vs 740 DPO. The DPO signal is 50:1 underrepresented. For DPO to meaningfully shape behavior, you need thousands of preference pairs, not hundreds.
3. WORMS added only 15 new SFT records — the augmentation system essentially didn't work.
4. Zero repo_worm data — the code-grounded augmentation produced nothing.
Impact: The model will be overwhelmingly shaped by SFT (pattern imitation) with minimal DPO correction signal. The anti-permission-seeking behavior that DPO is supposed to enforce will be underrepresented.
---
ATTACK 7: Data Provenance Chain is Broken
Finding: HIGH 🟠
Following a record through the pipeline:
1. Ingestion: Raw conversations → `extracted/*.jsonl` (source-specific formats)
2. Corpus Surgery: Classified, rewritten → internal schema
3. Expansion v1-v8: LLM-generated augmentations → expansion-specific formats
4. WORMS: Paraphrases, ideal responses → worms_output format
5. Combined: Merged → `combined_v5_v8` (Together AI chat format `{messages: [...]}`)
6. Export: Final → `ctv3_export_v5` (CTv3.1 schema? Or Together AI format?)
The problem: no single record can be traced from final dataset back to source. The combined_v5_v8 records are in `{messages: [...]}` format with no schema_version, no record_id, no source info. All CTv3.1 metadata is stripped in the final export.
Additionally, the DPO records in combined_v5_v8 use `{input, preferred_output, non_preferred_output}` — a format that doesn't match ANY of the three DPO formats defined in the codebase.
---
ATTACK 8: Expansion Stages May Have Data Leakage
Finding: MEDIUM 🟡
Expansion stages v6-v8 use LLMs to generate synthetic data:
- v6: "cross-domain synthesis, pattern recognition, architecture chains"
- v7: "methods, process conversations, evolution self-description, tool building, meta DPO"
- v8: "deep conversations, session mining, RLM enhanced, cross-system"
These stages call LLMs with the user's real conversation data as context and ask for synthetic extensions. If the LLM memorizes and regurgitates specific conversation content, the same information appears in both train and eval — data leakage through generative augmentation.
The deduplicator only catches exact hash matches and session overlaps. Paraphrased duplicates (which are the entire point of v6-v8) would pass through.
---
PHASE 2 — DEP-2
6-Level Recursive Task Decomposition
Level 1: Structure ✅ PASS (with caveats)
- Module organization: Clean v3/ package structure with clear separation: `ingest/`, `corpus_surgery/`, `worms/`, `dataset/`, `eval/`, `pruning/`, `generators/`, `api/`, `tools/`
- Schema definition: CTv3.1 schema is well-defined in `schema.py` with proper dataclasses, enums, serialization
- Import graph: No circular imports detected. `_compat.py` handles optional dependencies gracefully
- ⚠️ Issue: `framework/` module is disconnected from `v3/` pipeline — architectural dead code
Level 2: Compilation ⚠️ CONDITIONAL PASS
- Type hints: Comprehensive throughout. Python 3.10+ syntax used consistently
- Dependencies: `pyproject.toml` declares proper optional dependency groups
- ⚠️ Issue: No `requirements.txt` or `uv.lock` — exact dependency resolution not pinned
- ⚠️ Issue: Many imports are guarded with try/except but fail silently, making it hard to know what's actually available at runtime
Level 3: Integration ❌ FAIL
- DPO format mismatch (EXP-1): Three incompatible DPO formats across pipeline stages
- WORMS Supabase dependency (WRM-1): Pipeline stages have hard runtime dependencies on external services with no fallback
- RepoWorm dead (WRM-3): Entire code-grounded pipeline produced zero output
- FunctionGemma mock masquerading as real (META-EVIL-2): Integration point pretends to work but degrades classification
Required Fixes:
1. Normalize all DPO data to a single format at pipeline entry. Add a `normalize_dpo_record(record: dict) -> dict` function that handles all three schemas
2. Add JSONL-file fallback path for ConversationWorm when Supabase is unavailable
3. Either fix RepoWorm or remove it from the stats/pipeline to avoid confusion
4. Gate FunctionGemma classification rules behind `if not self.use_mock`
Level 4: Content ⚠️ CONDITIONAL PASS
- Classification logic: Sophisticated multi-signal approach (stall/exec/blocked). Core design is sound
- Density scoring: Excellent composite scorer with good dimension separation
- WORMS design: ConversationWorm, RepoWorm, EnhancerAgent architecture is well-thought-out
- ⚠️ Issue: Classifier has no tests (SCR-5)
- ⚠️ Issue: Rewriter has no semantic fidelity check (META-EVIL-4)
- ⚠️ Issue: DPO dataset too small (740 records) to meaningfully train 32B-active model
Required Fixes:
1. Write 50+ test cases for the classifier covering all scoring dimensions and edge cases
2. Add `review_status: "auto_rewrite"` to all rewritten records and weight them at 0.5
3. Generate at minimum 5,000 DPO pairs using the expansion pipeline before training
Level 5: User Journey ⚠️ CONDITIONAL PASS
- CLI interfaces: Every pipeline stage has `argparse` CLI with sensible defaults
- Documentation: Extensive docs/ directory with 8 guides + paper
- README: Production-quality with architecture diagram, API reference, examples
- ⚠️ Issue: No end-to-end `make train` or `just run-all` command
- ⚠️ Issue: No configuration file support — all params are CLI args or hardcoded
Required Fix: Create a `config.yaml` system for pipeline configuration and a single `run_full_pipeline.py` orchestrator script.
Level 6: Deployment ❌ FAIL
- No CI/CD: No GitHub Actions, no automated testing
- No data validation gate: Data can flow to training without format/quality checks
- No model registry: Trained models are referenced by Together AI job IDs with no versioning
- No monitoring: Training metrics are polled but not persisted or alerted on
- Vast.ai scripts exist but are untested/unintegrated: `deploy/vastai/` is aspirational
Required Fixes:
1. Add GitHub Actions workflow: lint → type-check → test → validate-data
2. Add a data validation gate that runs `validate_export.py` before any training submission
3. Create a model registry file that tracks `{version, job_id, model_id, dataset_hash, metrics}`
---
DEP-2 Fix Summary
| Priority | Fix | Effort | Impact |
|---|---|---|---|
| P0 🔴 | Add exec_score gate to secondary unjustified rule | 1 line | Prevents false-positive classification |
| P0 🔴 | Normalize DPO formats across pipeline | 2 hours | Enables actual DPO training |
| P0 🔴 | Write classifier unit tests | 4 hours | Prevents regression on core logic |
| P1 🟠 | Disable FunctionGemma rules in mock mode | 30 min | Removes phantom classifications |
| P1 🟠 | Add semantic fidelity check to rewriter | 2 hours | Prevents hallucinated rewrites entering training |
| P1 🟠 | Generate 5K+ DPO pairs | 8 hours | Makes DPO training viable |
| P1 🟠 | Fix RepoWorm or remove it | 4 hours | Eliminates dead pipeline confusion |
| P2 🟡 | Add data lineage tracking | 4 hours | Enables debugging and reproducibility |
| P2 🟡 | Add JSONL fallback to ConversationWorm | 2 hours | Makes WORMS work without Supabase |
| P2 🟡 | Stratified train/val/test splits | 1 hour | Better evaluation reliability |
| P2 🟡 | Config file system | 3 hours | Better reproducibility |
| P3 🟢 | CI/CD pipeline | 4 hours | Standard engineering hygiene |
| P3 🟢 | Model registry | 2 hours | Track trained models |
---
PHASE 3 — Evolution
3.1 Product Vision: "Train Your Own Cognitive Twin"
The core insight: CognitiveTwin is a framework for creating personalized AI behavior models from conversation history. Today it's Mohamed's personal tool. Tomorrow it could be:
> "Upload your conversation exports. Get a LoRA adapter that makes any LLM think like you."
3.2 Product Architecture
┌─────────────────────────────────────────────────────────────┐
│ CognitiveTwin Cloud │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Upload │──▶│ Corpus │──▶│ WORMS │──▶│ Train │ │
│ │ Portal │ │ Surgery │ │ Augment │ │ Engine │ │
│ │ │ │ │ │ │ │ │ │
│ │ ChatGPT │ │ Classify │ │ Paraphrase│ │ SFT + │ │
│ │ Claude │ │ Rewrite │ │ Extend │ │ DPO │ │
│ │ Discord │ │ Score │ │ DPO Gen │ │ LoRA │ │
│ │ Slack │ │ Quarantine│ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Dashboard │ │ Model │ │
│ │ Analytics │ │ Store │ │
│ │ Twin Edit │ │ Deploy │ │
│ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘3.3 Onboarding Flow
Step 1: Export Your Conversations (5 minutes)
- User exports from ChatGPT (Settings → Data → Export)
- User exports from Claude (Account → Export Data)
- Optional: Discord bot token, Slack export
Step 2: Upload & Analyze (automated, ~2 minutes)
- Drag-and-drop ZIP/JSON files
- Pipeline runs corpus surgery automatically
- Dashboard shows: conversation count, density distribution, personality traits detected, stall patterns found
Step 3: Review Twin Profile (optional, 3 minutes)
- "Your Twin Profile" page shows:
- Communication style (direct vs. exploratory, terse vs. verbose)
- Expertise domains detected (code, research, planning, ops)
- Behavioral patterns (preference for X, tendency to Y)
- Permission-seeking score (how much the AI stalled with you)
- User can toggle: "I WANT my twin to ask clarifying questions sometimes" (adjusts question_policy)
- User can mark conversations as "Not me" (exclude from training)
Step 4: Train (automated, 30-60 minutes)
- Select base model tier:
- Starter (3B params): Fast, cheap, good for simple style matching
- Pro (8-32B active): Full cognitive twin with reasoning patterns
- Enterprise (70B+): Maximum fidelity, multi-domain mastery
- Training runs in cloud (Together AI / Vast.ai / RunPod)
- Real-time progress: loss curves, sample outputs at checkpoints
Step 5: Deploy & Use (immediate)
- API endpoint (OpenAI-compatible)
- Download LoRA adapter for local use (Ollama, vLLM, etc.)
- Integration with:
- Cursor / VS Code extension
- Discord bot
- Chrome extension (reply suggestion)
- Slack bot
- Apple Shortcuts
3.4 Pricing Model
| Tier | Price | What You Get |
|---|---|---|
| Free | $0 | Upload & analyze (dashboard only). No training. |
| Personal | $29/mo | 1 Twin, Starter model, 10K conversations, 3 retrains/month |
| Pro | $99/mo | 3 Twins, Pro model, 100K conversations, unlimited retrains, API access |
| Team | $49/seat/mo | Shared twins, team style guides, role-specific twins (CEO tone, eng tone) |
| Enterprise | Custom | Self-hosted, 70B+ models, SSO, compliance, SLA |
Usage-based add-ons:
- API inference: $0.001/1K tokens (pass-through + 20
- Extra training: $5/retrain (covers compute)
- Custom base model: $50/train (larger model fine-tuning)
3.5 What Makes This a Product (Not Just a Tool)
1. Network Effects: As more people train twins, the corpus surgery classifier gets better (aggregate anonymized stall patterns improve detection)
2. Data Flywheel: Each user's corrections to their twin generate more DPO training data
3. Platform Lock-in: Your twin's adapter only works with CognitiveTwin's deployment infra (or you can export it, but the retraining cadence keeps you subscribed)
4. Upsell Path: Free → dashboard addiction → "I want to actually USE this" → paid tier
3.6 Technical Evolution Requirements
To go from "Mohamed's personal pipeline" to "anyone's cognitive twin":
Must-Have for MVP:
- [ ] Web upload portal (drag-and-drop conversation exports)
- [ ] Multi-tenant pipeline (isolated user data, parallel training jobs)
- [ ] Auto-detect conversation format (ChatGPT vs Claude vs generic chat)
- [ ] Hosted inference endpoint with OpenAI-compatible API
- [ ] Dashboard: conversation analytics, twin profile, style metrics
Must-Have for Launch:
- [ ] User accounts (auth, billing, usage tracking)
- [ ] Data encryption at rest (user conversations are sensitive)
- [ ] GDPR compliance (data deletion, export)
- [ ] Rate limiting and abuse prevention
- [ ] Monitoring and alerting on training jobs
Nice-to-Have (v2):
- [ ] Real-time twin updates (process new conversations incrementally)
- [ ] Twin-to-twin comparison ("How does my twin differ from average?")
- [ ] Style transfer ("Make my twin sound more [professional/casual/technical]")
- [ ] Multi-language support
- [ ] Voice clone integration (text style + ElevenLabs voice = full digital twin)
- [ ] "Twin memories" — RAG++ integration for long-term context
3.7 Competitive Landscape
| Competitor | What They Do | CognitiveTwin Advantage |
|---|---|---|
| Character.ai | Create fictional characters | CT trains on YOUR real conversations. Authentic, not fictional. |
| Ditto AI | AI clone from social media | CT uses deep conversation data, not surface-level posts. |
| Personal.ai | Personal AI assistant | CT produces a LoRA adapter you OWN and can run locally. |
| Fine-tuning APIs (OpenAI, Together) | Raw fine-tuning | CT handles the entire pipeline: ingestion, cleaning, augmentation, training, eval. |
Moat: The corpus surgery + WORMS pipeline is the moat. Anyone can fine-tune a model, but CT's multi-signal classification, friction quarantine, and trajectory-aware DPO create better training data from the same raw conversations. The DATA QUALITY is the product.
3.8 Risk Analysis
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Users upload sensitive/illegal content | HIGH | HIGH | Content filtering on upload, ToS, automated scanning |
| Model outputs harmful content in user's "style" | MEDIUM | HIGH | Output guardrails, content policy enforcement on inference |
| Privacy breach (conversation data leaked) | LOW | CRITICAL | Encryption at rest, SOC2, minimal retention |
| Training costs exceed revenue | MEDIUM | MEDIUM | Start with LoRA-only (cheap), tier pricing covers compute |
| Users expect perfection from small data | HIGH | MEDIUM | Set expectations: "50+ conversations for good results, 500+ for great" |
---
ISSUE TRACKER
| # | Category | Severity | Title | Status |
|---|---|---|---|---|
| 1 | Classification | 🔴 CRITICAL | Secondary unjustified rule missing exec_score gate | OPEN |
| 2 | Integration | 🔴 CRITICAL | DPO format mismatch across 3 pipeline stages | OPEN |
| 3 | Testing | 🔴 CRITICAL | Zero unit tests for classifier | OPEN |
| 4 | WORMS | 🔴 CRITICAL | RepoWorm produced 0 records — dead pipeline | OPEN |
| 5 | Classification | 🟠 HIGH | FunctionGemma mock mode corrupts fusion scoring | OPEN |
| 6 | Rewriter | 🟠 HIGH | No semantic fidelity check on rewrites | OPEN |
| 7 | Data | 🟠 HIGH | Only 740 DPO pairs for 32B model training | OPEN |
| 8 | Data | 🟠 HIGH | WORMS multiplier 1.0x — augmentation barely worked | OPEN |
| 9 | Data | 🟠 HIGH | No lineage tracking — records untraceable | OPEN |
| 10 | Scoring | 🟡 MEDIUM | Stall score double-counting inflates values | OPEN |
| 11 | Scoring | 🟡 MEDIUM | `ends_with_question()` over-fires on declarations | OPEN |
| 12 | Ingestion | 🟡 MEDIUM | Silent turn dropping on malformed exports | OPEN |
| 13 | Ingestion | 🟡 MEDIUM | No pagination in Supabase extractor | OPEN |
| 14 | Export | 🟡 MEDIUM | No stratified train/val/test splits | OPEN |
| 15 | Training | 🟡 MEDIUM | No early stopping — overfitting risk on small DPO | OPEN |
| 16 | Eval | 🟡 MEDIUM | Regression tester only checks string patterns | OPEN |
| 17 | Deploy | 🟡 MEDIUM | No CI/CD, no automated quality gates | OPEN |
| 18 | WORMS | 🟡 MEDIUM | ConversationWorm hard-depends on Supabase | OPEN |
| 19 | WORMS | 🟡 MEDIUM | `_get_original_response()` always returns None | OPEN |
| 20 | Data | 🟡 MEDIUM | Expansion v6-v8 potential data leakage | OPEN |
---
VERDICT
What's Actually Good
1. The classification system design is excellent. Three-signal scoring (stall/exec/blocked) with directive completeness and question policy is a genuinely novel approach. The density scorer's 4-dimension model is well-thought-out. This is publishable research.
2. The CTv3.1 schema is production-quality. Comprehensive type definitions, proper enums, clean serialization. This is the kind of schema that can power a product.
3. The architecture is extensible. The WORMS system (ConversationWorm + RepoWorm + EnhancerAgent) is a great design even though the current implementation underdelivers on data multiplication.
4. Documentation is exceptional. 8 docs + paper + detailed README with architecture diagrams. This is ready for external contributors.
5. The 38K SFT records are real value. Dense, scored, filtered conversation data from actual usage. This is hard to recreate and forms a genuine training corpus.
What Needs Fixing Before Training
1. Fix the classifier (Issues #1, #5, #3): The core classification logic has a false-positive path (secondary rule), mock data influencing production scores, and zero tests. This affects every record.
2. Fix DPO format (Issue #2): Cannot train DPO if the data format doesn't match what the training pipeline expects.
3. Generate more DPO data (Issue #7): 740 pairs is insufficient. Target 5K minimum.
4. Decide on RepoWorm (Issue #4): Either fix it or remove it. Dead pipelines erode trust.
Training Readiness Score: 6/10
The SFT data is solid. The DPO data is critically undersized and format-broken. The classification layer that produces both has bugs that need fixing. Fix Issues #1-7 and this jumps to 8/10.
Product Readiness Score: 3/10
Excellent foundation, but needs web portal, multi-tenancy, auth, billing, and data privacy before it's a product. The technical pipeline is 70
---
Generated by Evo-Cube analysis. Reviewed 93 Python files, 47K LOC, 43K training records, 8 expansion stages, 6 pipeline phases.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/packages/cognitive-twin/EVOCUBE_REPORT.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture