Grand Diomande Research · Full HTML Reader

EVO-CUBE REPORT: CognitiveTwin Pipeline

**Date:** 2025-07-18 **Codebase:** `Desktop/Comp-Core/packages/cognitive-twin/` (93 Python files, ~47K LOC) **Data:** 43K records across 8 expansion stages + combined_v5_v8 final dataset **Target Model:** Kimi-K2-Thinking (MoE-1T, 32B active params)

Agents That Account for Themselves research note experiment writeup candidate score 52 .md

Full Public Reader

# EVO-CUBE REPORT: CognitiveTwin Pipeline
### CEF + DEP-2 + Evolution — Full Audit

Date: 2025-07-18
Codebase: `Desktop/Comp-Core/packages/cognitive-twin/` (93 Python files, ~47K LOC)
Data: 43K records across 8 expansion stages + combined_v5_v8 final dataset
Target Model: Kimi-K2-Thinking (MoE-1T, 32B active params)

---

TABLE OF CONTENTS

1. [Phase 1 — CEF (Critique–Evil–Find)](#phase-1--cef)
- [Meta-Evil: Pipeline-Level Attacks](#11-meta-evil-pipeline-level-attacks)
- [Chunk-Evil: Per-Module Attacks](#12-chunk-evil-per-module-attacks)
- [Synthesis-Evil: Cross-Stage Data Quality](#13-synthesis-evil-cross-stage-data-quality)
2. [Phase 2 — DEP-2 (6-Level RTD + Fixes)](#phase-2--dep-2)
3. [Phase 3 — Evolution](#phase-3--evolution)
4. [Issue Tracker](#issue-tracker)
5. [Verdict](#verdict)

---

PHASE 1 — CEF

1.1 Meta-Evil: Pipeline-Level Attacks

ATTACK 1: Classification Threshold Regression → Silent Data Corruption

Finding: CRITICAL 🔴

In `corpus_surgery/constants.py`, the stall threshold was lowered from 3 → 1:

python
# Original: 3, 1, 0 - too strict, only caught 5/2177
STALL_THRESHOLD_UNJUSTIFIED = 1  # Any stalling pattern

The comment says "only caught 5/2177" so the threshold was weakened. But lowering to 1 means any message ending with a question mark (stall_score=1) triggers UNJUSTIFIED classification — even messages like "Here is the implementation. Does this approach work?" which scored exec=3 but would now hit stall≥1. The only protection is the `exec_score == 0` check in `is_unjustified()`, but the secondary rule bypasses exec:

python
# Secondary rule: strong permission phrase + ends with ? + high completeness
if (ends_with_question(text) and
    has_strong_permission and
    directive_completeness >= DIRECTIVE_HIGH_THRESHOLD):
    return True

This secondary rule has no exec_score gate — a response that contains "should we" (strong permission phrase) PLUS code PLUS question mark will be classified UNJUSTIFIED even though it executed. This corrupts the corpus surgery stage by flagging legitimate clarifications-after-execution as unjustified.

Impact: Potential false-positive rate on valid assistant turns estimated 8-15

Fix: Add `and exec_score == 0` to the secondary rule. Or restore stall threshold to 2 as a middle ground between 1 and 3.

---

ATTACK 2: FunctionGemma Integration is Dead Code in Production

Finding: HIGH 🟠

The `functiongemma_scorer.py` module is sophisticated (400+ lines) but always falls back to mock mode in production:

python
def __init__(self, ..., use_mock: bool = False):
    ...
    if self._model_path is None:
        logger.warning("No model path provided, using mock mode")
        self.use_mock = True

No model path is ever configured in any script or config file. The mock parser uses keyword matching to estimate parsability — it's a heuristic pretending to be ML inference. The classifier's fusion logic weights parsability at 60

python
fused_completeness = (
    0.4 * directive_completeness +
    0.6 * parsability_score * parsability_info.confidence
)

When mock confidence is 0.3-0.7 (hardcoded), this drags the fused score down compared to pure heuristic. The FunctionGemma rule `parsability_score >= 0.8 and stall_score >= 2` can still fire from mock data, creating phantom classifications.

Impact: Classification operates on heuristic-mock data pretending to be ML-scored data. Fusion logic degrades rather than improves accuracy.

Fix: Either (a) ship a real FunctionGemma model and integrate it, or (b) remove the fusion logic and disable the FunctionGemma classification rule when in mock mode. Don't let mock data influence production classifications.

---

ATTACK 3: Stall Score Double-Counting

Finding: MEDIUM 🟡

The `compute_stall_score()` function counts every matching phrase independently, but many phrases overlap:

  • "i'm sorry, but i can't" matches BOTH `REFUSAL_PHRASES` (+4) AND `CLARIFICATION_PREAMBLES` ("i'm sorry" → not exact, but close patterns overlap)
  • "would you like me to" (+3) can co-occur with "sound good?" (+3 from another strong permission phrase)

A single response saying "I apologize, but I cannot do that. Would you like me to try a different approach? Should I proceed?" would score: 4 (refusal) + 3 (would you like) + 3 (should I) + 1 (question mark) = 11. The threshold is 1. This response is definitely bad, but the score is meaninglessly inflated, making it impossible to use score magnitude for severity calibration or confidence weighting in DPO pairs.

Impact: Score inflation prevents meaningful gradient between "slightly stalling" and "completely refusing." DPO confidence could be calibrated to stall_score magnitude but currently isn't usable for this.

Fix: Either (a) deduplicate phrase matches (only count highest-scoring match per semantic category), or (b) cap stall_score at 10, or (c) normalize stall_score to [0,1] range for downstream use.

---

ATTACK 4: Rewriter Has No Semantic Fidelity Check

Finding: HIGH 🟠

The `rewrite_assistant_turn()` function validates rewrites for:
- No question mark at end ✓
- No permission phrases ✓
- Required artifacts present ✓
- Format compliance ✓

But there is no check that the rewrite is semantically faithful to the user's request. The rewriter calls GPT to transform permission-seeking responses into direct execution — but nothing verifies the generated content is correct. A rewrite could hallucinate a completely wrong implementation, and it would pass validation because it has a code block and doesn't end with a question.

Impact: Hallucinated rewrites enter SFT training data as "gold" examples. The model learns to produce confident-sounding wrong answers instead of correct permission-seeking ones. This is worse than the original problem.

Fix: Add a semantic similarity check between the rewrite and the user's request (embedding cosine similarity > 0.6). Or add a "code compiles" check for code rewrites. Or flag all rewrites as `review_status: auto_rewrite` and weight them lower (0.5) in training.

---

ATTACK 5: No Data Versioning or Lineage Tracking

Finding: MEDIUM 🟡

The pipeline has 8 expansion stages (v1-v8), combined datasets, augmented datasets, and multiple export versions (ctv3_export through ctv3_export_v5). But there is no mechanism to:
- Track which source records end up in which final dataset
- Know if a record was rewritten, augmented, or original
- Reproduce a specific dataset version
- Audit data flow from ingestion → scoring → WORMS → export

The `record_id` field uses random UUIDs, so there's no parent-child linkage. A record generated by conversation_worm has `source.origin = "convo_worm"` but no reference to which original conversation it branched from (the field `source_id` exists but is just the conversation_id, not the specific turn).

Impact: Cannot debug training failures back to data issues. Cannot identify which augmentation stage introduced problematic records. Dataset reproducibility is zero.

Fix: Add a `lineage` field to CTv3Record: `{parent_id: str, stage: str, transform: str}`. Chain parent_ids through pipeline stages. Log every record transformation.

---

1.2 Chunk-Evil: Per-Module Attacks

INGESTION (`v3/ingest/`)

IDFindingSeverityDetail
ING-1Claude JSON parser assumes specific export format🟡 MEDIUM`claude_json.py` and `openai_json.py` parse specific export schemas. No schema validation — malformed exports silently drop turns rather than erroring.
ING-2Deduplicator hash truncation🟢 LOW`sha256.hexdigest()[:16]` gives 64-bit hash space — collision probability rises past ~100K conversations. At 43K records, estimated <0.1
ING-3Normalizer doesn't handle multi-modal content🟡 MEDIUMImage/file attachments in conversations are silently dropped. No flag to indicate content was lost. Model trains on incomplete context.
ING-4Supabase extractor has no pagination🟡 MEDIUM`.limit(limit * 10)` is a guess, not proper cursor-based pagination. Will miss data in large corpora.

SCORING (`v3/corpus_surgery/`)

IDFindingSeverityDetail
SCR-1`ends_with_question()` over-fires🟡 MEDIUMChecks if last sentence starts with "is", "are", "will" — catches declaratives like "This is the implementation." if preceded by a sentence ending with no period. Also: markdown headers like `## What is X` would fire.
SCR-2`check_missing_input()` false positives🟡 MEDIUMTriggers on "refactor" keyword + no code block, but the code might be in an attachment or referenced by file path. `has_file_path` regex `[/
]+\.\w+` doesn't match many real paths (e.g., `Desktop/...` with tildes).
SCR-3`directive_completeness` maxes at 0.80🟢 LOWMax achievable: +0.35 (verb) + 0.25 (format) + 0.20 (inputs) = 0.80. No message can score 1.0 without negative terms canceling out. The 0.7 threshold for `no_questions` policy is reachable but leaves thin margin.
SCR-4`compute_blocked_score()` can go negative🟡 MEDIUM`FORMAT_SPECIFIED_BONUS = -1` and `USER_ASKED_OPTIONS_BONUS = -2`. Starting from 0, a high-completeness message with format spec + options asked = -3, then `max(0, score)` clips it. But if `check_missing_input` fires (+3) and format specified (-1), the interaction is non-intuitive.
SCR-5No unit tests for classifier🔴 CRITICALThe classifier has 3 hardcoded test cases in `__main__` but zero pytest tests. Classification logic changed (threshold 3→1, FunctionGemma rule added) with no regression testing. Cannot verify correctness.

WORMS (`v3/worms/`)

IDFindingSeverityDetail
WRM-1ConversationWorm depends on Supabase at runtime🟡 MEDIUM`_load_conversation()` requires Supabase. No local-file fallback for the main processing path. The `process_conversation()` method returns empty if no Supabase client — silently produces zero output.
WRM-2WORMS multiplier is 1.0x🟡 MEDIUM`worms_stats.json` shows total_sft=3867, orig_sft=3852 — only 15 new conversation branches were generated. The WORMS augmentation pipeline added 0.4
WRM-3RepoWorm generated 0 records🔴 CRITICAL`repo_sft: 0, repo_dpo: 0` in worms_stats. The entire code-grounded training data pipeline produced nothing. Either no repos were scanned, or the CodeScanner/TaskGenerator found no tasks. This is an entire dead pipeline.
WRM-4DPO generator hardcodes "gpt-5.2" as provider🟢 LOW`create_sft_record()` sets `provider: "gpt-5.2"` regardless of actual generation model. Stats show `model: "gemini-2.0-flash"` was used. Metadata lies about provenance.
WRM-5`_get_original_response()` always returns None🟡 MEDIUMIn `ConversationWormPipeline`, this method returns `None` with a comment "This would typically look up the original response from Supabase." Every DPO pair that depends on finding the original dispreferred response fails silently.

EXPORT & DATASET (`v3/dataset/`)

IDFindingSeverityDetail
EXP-1DPO format mismatch between pipeline stages🔴 CRITICALCorpus surgery exports DPO as `{prompt, chosen, rejected}` (Together AI format). Combined v5_v8 data uses `{input, preferred_output, non_preferred_output}`. The `DataPreparer.prepare_dpo_data()` expects `{candidates.preferred.assistant_content}` (CTv3.1 schema format). These three formats are incompatible. No adapter exists to normalize them.
EXP-2No format validation on final export🟡 MEDIUM`validate_export.py` exists (724 lines) but there's no evidence it's integrated into the pipeline. Data can be exported in an incompatible format and only discovered when training fails.
EXP-380/10/10 split has no stratification🟡 MEDIUM`DatasetSplit` does random shuffle + split. No stratification by source, domain, task_type, or quality tier. Training set could over-represent one expansion stage and under-represent another.
EXP-4Eval set has no coverage guarantee🟡 MEDIUMEval records are exported as-is with no check for coverage of all failure modes, task types, or domains.

TRAINING (`v3/pipeline.py`, `scripts/`)

IDFindingSeverityDetail
TRN-1Training pipeline is Together AI-only🟡 MEDIUM`V3TrainingPipeline` only supports Together AI's fine-tuning API. The target model is Kimi-K2-Thinking (Moonshot AI). Together AI does support Kimi-K2, but the pipeline has no local training path, no Vast.ai integration despite having `deploy/vastai/` scripts, and no fallback.
TRN-2LoRA config is hardcoded, not optimized🟢 LOW`lora_r=16, lora_alpha=32` are defaults. For a 32B-active MoE model with 43K records, these may be undertrained. No hyperparameter sweep infrastructure.
TRN-3DPO beta=0.1 is low for 43K records🟢 LOW`dpo_beta=0.1` is standard for small datasets. With 43K records and a MoE model, a higher beta (0.3-0.5) may be needed to prevent reward hacking.
TRN-4No early stopping🟡 MEDIUM`wait_for_completion()` polls until success/failure with a 2-hour timeout but no eval-loss-based early stopping. Overfitting risk on a small DPO set (882 records).
TRN-5`submit_training.py` and `train.py` scripts lack error recovery🟡 MEDIUMNo checkpointing, no resume capability. A timeout or API error means restarting from scratch.

EVALUATION (`v3/eval/`)

IDFindingSeverityDetail
EVL-1Regression tester only checks string patterns🟡 MEDIUM`_check_constraints()` checks disallowed phrases and question endings. No semantic evaluation, no LLM-as-judge, no behavioral scoring. A model could satisfy all constraints while producing meaningless output.
EVL-2A/B comparison scoring is simplistic🟡 MEDIUM`_score_response()` only penalizes permission phrases and rewards code blocks. Doesn't evaluate correctness, coherence, or style alignment.
EVL-3No human eval integration🟡 MEDIUMZero support for human annotation, blind evaluation, or inter-annotator agreement.

FRAMEWORK (`framework/`)

IDFindingSeverityDetail
FRM-1Framework module is architectural skeleton🟢 LOW`framework/` contains `twin.py`, `trainer.py`, `pattern_memory.py`, etc. These define abstract interfaces (CognitiveTwin, PatternMemory, ReasoningEncoder) but are not used by the actual v3 pipeline. They're aspirational architecture, not production code. This is fine for a v3 system, but they inflate the apparent capability of the codebase.
FRM-2`_compat.py` imports from non-existent packages🟢 LOWReferences `FunctionCall`, `ToolSchema`, `FunctionGemmaRuntime` from `cognitive_twin._compat` — these are compatibility stubs that define mock interfaces when the real packages aren't installed. Safe but confusing.

---

1.3 Synthesis-Evil: Cross-Stage Data Quality

ATTACK 6: The 43K Number is Inflated

Finding: HIGH 🟠

Actual record counts from the data directory:

DatasetSFT TrainSFT ValDPO TrainDPO ValTotal
ctv3_export_v538,1894,2446667443,173
combined_v5_v834,5733,8427938939,297
worms augmented3,867483,915

The "43K" figure comes from ctv3_export_v5, but:
1. The combined_v5_v8 has fewer records (39K) — meaning ~4K records were pruned or deduplicated between v5 export and the merge. But there's no record of what was removed or why.
2. SFT massively outnumbers DPO: 38K SFT vs 740 DPO. The DPO signal is 50:1 underrepresented. For DPO to meaningfully shape behavior, you need thousands of preference pairs, not hundreds.
3. WORMS added only 15 new SFT records — the augmentation system essentially didn't work.
4. Zero repo_worm data — the code-grounded augmentation produced nothing.

Impact: The model will be overwhelmingly shaped by SFT (pattern imitation) with minimal DPO correction signal. The anti-permission-seeking behavior that DPO is supposed to enforce will be underrepresented.

---

ATTACK 7: Data Provenance Chain is Broken

Finding: HIGH 🟠

Following a record through the pipeline:

1. Ingestion: Raw conversations → `extracted/*.jsonl` (source-specific formats)
2. Corpus Surgery: Classified, rewritten → internal schema
3. Expansion v1-v8: LLM-generated augmentations → expansion-specific formats
4. WORMS: Paraphrases, ideal responses → worms_output format
5. Combined: Merged → `combined_v5_v8` (Together AI chat format `{messages: [...]}`)
6. Export: Final → `ctv3_export_v5` (CTv3.1 schema? Or Together AI format?)

The problem: no single record can be traced from final dataset back to source. The combined_v5_v8 records are in `{messages: [...]}` format with no schema_version, no record_id, no source info. All CTv3.1 metadata is stripped in the final export.

Additionally, the DPO records in combined_v5_v8 use `{input, preferred_output, non_preferred_output}` — a format that doesn't match ANY of the three DPO formats defined in the codebase.

---

ATTACK 8: Expansion Stages May Have Data Leakage

Finding: MEDIUM 🟡

Expansion stages v6-v8 use LLMs to generate synthetic data:
- v6: "cross-domain synthesis, pattern recognition, architecture chains"
- v7: "methods, process conversations, evolution self-description, tool building, meta DPO"
- v8: "deep conversations, session mining, RLM enhanced, cross-system"

These stages call LLMs with the user's real conversation data as context and ask for synthetic extensions. If the LLM memorizes and regurgitates specific conversation content, the same information appears in both train and eval — data leakage through generative augmentation.

The deduplicator only catches exact hash matches and session overlaps. Paraphrased duplicates (which are the entire point of v6-v8) would pass through.

---

PHASE 2 — DEP-2

6-Level Recursive Task Decomposition

Level 1: Structure ✅ PASS (with caveats)

  • Module organization: Clean v3/ package structure with clear separation: `ingest/`, `corpus_surgery/`, `worms/`, `dataset/`, `eval/`, `pruning/`, `generators/`, `api/`, `tools/`
  • Schema definition: CTv3.1 schema is well-defined in `schema.py` with proper dataclasses, enums, serialization
  • Import graph: No circular imports detected. `_compat.py` handles optional dependencies gracefully
  • ⚠️ Issue: `framework/` module is disconnected from `v3/` pipeline — architectural dead code

Level 2: Compilation ⚠️ CONDITIONAL PASS

  • Type hints: Comprehensive throughout. Python 3.10+ syntax used consistently
  • Dependencies: `pyproject.toml` declares proper optional dependency groups
  • ⚠️ Issue: No `requirements.txt` or `uv.lock` — exact dependency resolution not pinned
  • ⚠️ Issue: Many imports are guarded with try/except but fail silently, making it hard to know what's actually available at runtime

Level 3: Integration ❌ FAIL

  • DPO format mismatch (EXP-1): Three incompatible DPO formats across pipeline stages
  • WORMS Supabase dependency (WRM-1): Pipeline stages have hard runtime dependencies on external services with no fallback
  • RepoWorm dead (WRM-3): Entire code-grounded pipeline produced zero output
  • FunctionGemma mock masquerading as real (META-EVIL-2): Integration point pretends to work but degrades classification

Required Fixes:
1. Normalize all DPO data to a single format at pipeline entry. Add a `normalize_dpo_record(record: dict) -> dict` function that handles all three schemas
2. Add JSONL-file fallback path for ConversationWorm when Supabase is unavailable
3. Either fix RepoWorm or remove it from the stats/pipeline to avoid confusion
4. Gate FunctionGemma classification rules behind `if not self.use_mock`

Level 4: Content ⚠️ CONDITIONAL PASS

  • Classification logic: Sophisticated multi-signal approach (stall/exec/blocked). Core design is sound
  • Density scoring: Excellent composite scorer with good dimension separation
  • WORMS design: ConversationWorm, RepoWorm, EnhancerAgent architecture is well-thought-out
  • ⚠️ Issue: Classifier has no tests (SCR-5)
  • ⚠️ Issue: Rewriter has no semantic fidelity check (META-EVIL-4)
  • ⚠️ Issue: DPO dataset too small (740 records) to meaningfully train 32B-active model

Required Fixes:
1. Write 50+ test cases for the classifier covering all scoring dimensions and edge cases
2. Add `review_status: "auto_rewrite"` to all rewritten records and weight them at 0.5
3. Generate at minimum 5,000 DPO pairs using the expansion pipeline before training

Level 5: User Journey ⚠️ CONDITIONAL PASS

  • CLI interfaces: Every pipeline stage has `argparse` CLI with sensible defaults
  • Documentation: Extensive docs/ directory with 8 guides + paper
  • README: Production-quality with architecture diagram, API reference, examples
  • ⚠️ Issue: No end-to-end `make train` or `just run-all` command
  • ⚠️ Issue: No configuration file support — all params are CLI args or hardcoded

Required Fix: Create a `config.yaml` system for pipeline configuration and a single `run_full_pipeline.py` orchestrator script.

Level 6: Deployment ❌ FAIL

  • No CI/CD: No GitHub Actions, no automated testing
  • No data validation gate: Data can flow to training without format/quality checks
  • No model registry: Trained models are referenced by Together AI job IDs with no versioning
  • No monitoring: Training metrics are polled but not persisted or alerted on
  • Vast.ai scripts exist but are untested/unintegrated: `deploy/vastai/` is aspirational

Required Fixes:
1. Add GitHub Actions workflow: lint → type-check → test → validate-data
2. Add a data validation gate that runs `validate_export.py` before any training submission
3. Create a model registry file that tracks `{version, job_id, model_id, dataset_hash, metrics}`

---

DEP-2 Fix Summary

PriorityFixEffortImpact
P0 🔴Add exec_score gate to secondary unjustified rule1 linePrevents false-positive classification
P0 🔴Normalize DPO formats across pipeline2 hoursEnables actual DPO training
P0 🔴Write classifier unit tests4 hoursPrevents regression on core logic
P1 🟠Disable FunctionGemma rules in mock mode30 minRemoves phantom classifications
P1 🟠Add semantic fidelity check to rewriter2 hoursPrevents hallucinated rewrites entering training
P1 🟠Generate 5K+ DPO pairs8 hoursMakes DPO training viable
P1 🟠Fix RepoWorm or remove it4 hoursEliminates dead pipeline confusion
P2 🟡Add data lineage tracking4 hoursEnables debugging and reproducibility
P2 🟡Add JSONL fallback to ConversationWorm2 hoursMakes WORMS work without Supabase
P2 🟡Stratified train/val/test splits1 hourBetter evaluation reliability
P2 🟡Config file system3 hoursBetter reproducibility
P3 🟢CI/CD pipeline4 hoursStandard engineering hygiene
P3 🟢Model registry2 hoursTrack trained models

---

PHASE 3 — Evolution

3.1 Product Vision: "Train Your Own Cognitive Twin"

The core insight: CognitiveTwin is a framework for creating personalized AI behavior models from conversation history. Today it's Mohamed's personal tool. Tomorrow it could be:

> "Upload your conversation exports. Get a LoRA adapter that makes any LLM think like you."

3.2 Product Architecture

┌─────────────────────────────────────────────────────────────┐
│                    CognitiveTwin Cloud                        │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐ │
│  │  Upload   │──▶│  Corpus  │──▶│  WORMS   │──▶│  Train   │ │
│  │  Portal   │   │  Surgery │   │  Augment │   │  Engine  │ │
│  │           │   │          │   │          │   │          │ │
│  │ ChatGPT   │   │ Classify │   │ Paraphrase│  │ SFT +    │ │
│  │ Claude    │   │ Rewrite  │   │ Extend   │   │ DPO      │ │
│  │ Discord   │   │ Score    │   │ DPO Gen  │   │ LoRA     │ │
│  │ Slack     │   │ Quarantine│  │          │   │          │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘ │
│       │                                              │       │
│       ▼                                              ▼       │
│  ┌──────────┐                                 ┌──────────┐  │
│  │ Dashboard │                                │  Model   │  │
│  │ Analytics │                                │  Store   │  │
│  │ Twin Edit │                                │  Deploy  │  │
│  └──────────┘                                 └──────────┘  │
│                                                               │
└─────────────────────────────────────────────────────────────┘

3.3 Onboarding Flow

Step 1: Export Your Conversations (5 minutes)
- User exports from ChatGPT (Settings → Data → Export)
- User exports from Claude (Account → Export Data)
- Optional: Discord bot token, Slack export

Step 2: Upload & Analyze (automated, ~2 minutes)
- Drag-and-drop ZIP/JSON files
- Pipeline runs corpus surgery automatically
- Dashboard shows: conversation count, density distribution, personality traits detected, stall patterns found

Step 3: Review Twin Profile (optional, 3 minutes)
- "Your Twin Profile" page shows:
- Communication style (direct vs. exploratory, terse vs. verbose)
- Expertise domains detected (code, research, planning, ops)
- Behavioral patterns (preference for X, tendency to Y)
- Permission-seeking score (how much the AI stalled with you)
- User can toggle: "I WANT my twin to ask clarifying questions sometimes" (adjusts question_policy)
- User can mark conversations as "Not me" (exclude from training)

Step 4: Train (automated, 30-60 minutes)
- Select base model tier:
- Starter (3B params): Fast, cheap, good for simple style matching
- Pro (8-32B active): Full cognitive twin with reasoning patterns
- Enterprise (70B+): Maximum fidelity, multi-domain mastery
- Training runs in cloud (Together AI / Vast.ai / RunPod)
- Real-time progress: loss curves, sample outputs at checkpoints

Step 5: Deploy & Use (immediate)
- API endpoint (OpenAI-compatible)
- Download LoRA adapter for local use (Ollama, vLLM, etc.)
- Integration with:
- Cursor / VS Code extension
- Discord bot
- Chrome extension (reply suggestion)
- Slack bot
- Apple Shortcuts

3.4 Pricing Model

TierPriceWhat You Get
Free$0Upload & analyze (dashboard only). No training.
Personal$29/mo1 Twin, Starter model, 10K conversations, 3 retrains/month
Pro$99/mo3 Twins, Pro model, 100K conversations, unlimited retrains, API access
Team$49/seat/moShared twins, team style guides, role-specific twins (CEO tone, eng tone)
EnterpriseCustomSelf-hosted, 70B+ models, SSO, compliance, SLA

Usage-based add-ons:
- API inference: $0.001/1K tokens (pass-through + 20
- Extra training: $5/retrain (covers compute)
- Custom base model: $50/train (larger model fine-tuning)

3.5 What Makes This a Product (Not Just a Tool)

1. Network Effects: As more people train twins, the corpus surgery classifier gets better (aggregate anonymized stall patterns improve detection)
2. Data Flywheel: Each user's corrections to their twin generate more DPO training data
3. Platform Lock-in: Your twin's adapter only works with CognitiveTwin's deployment infra (or you can export it, but the retraining cadence keeps you subscribed)
4. Upsell Path: Free → dashboard addiction → "I want to actually USE this" → paid tier

3.6 Technical Evolution Requirements

To go from "Mohamed's personal pipeline" to "anyone's cognitive twin":

Must-Have for MVP:
- [ ] Web upload portal (drag-and-drop conversation exports)
- [ ] Multi-tenant pipeline (isolated user data, parallel training jobs)
- [ ] Auto-detect conversation format (ChatGPT vs Claude vs generic chat)
- [ ] Hosted inference endpoint with OpenAI-compatible API
- [ ] Dashboard: conversation analytics, twin profile, style metrics

Must-Have for Launch:
- [ ] User accounts (auth, billing, usage tracking)
- [ ] Data encryption at rest (user conversations are sensitive)
- [ ] GDPR compliance (data deletion, export)
- [ ] Rate limiting and abuse prevention
- [ ] Monitoring and alerting on training jobs

Nice-to-Have (v2):
- [ ] Real-time twin updates (process new conversations incrementally)
- [ ] Twin-to-twin comparison ("How does my twin differ from average?")
- [ ] Style transfer ("Make my twin sound more [professional/casual/technical]")
- [ ] Multi-language support
- [ ] Voice clone integration (text style + ElevenLabs voice = full digital twin)
- [ ] "Twin memories" — RAG++ integration for long-term context

3.7 Competitive Landscape

CompetitorWhat They DoCognitiveTwin Advantage
Character.aiCreate fictional charactersCT trains on YOUR real conversations. Authentic, not fictional.
Ditto AIAI clone from social mediaCT uses deep conversation data, not surface-level posts.
Personal.aiPersonal AI assistantCT produces a LoRA adapter you OWN and can run locally.
Fine-tuning APIs (OpenAI, Together)Raw fine-tuningCT handles the entire pipeline: ingestion, cleaning, augmentation, training, eval.

Moat: The corpus surgery + WORMS pipeline is the moat. Anyone can fine-tune a model, but CT's multi-signal classification, friction quarantine, and trajectory-aware DPO create better training data from the same raw conversations. The DATA QUALITY is the product.

3.8 Risk Analysis

RiskLikelihoodImpactMitigation
Users upload sensitive/illegal contentHIGHHIGHContent filtering on upload, ToS, automated scanning
Model outputs harmful content in user's "style"MEDIUMHIGHOutput guardrails, content policy enforcement on inference
Privacy breach (conversation data leaked)LOWCRITICALEncryption at rest, SOC2, minimal retention
Training costs exceed revenueMEDIUMMEDIUMStart with LoRA-only (cheap), tier pricing covers compute
Users expect perfection from small dataHIGHMEDIUMSet expectations: "50+ conversations for good results, 500+ for great"

---

ISSUE TRACKER

#CategorySeverityTitleStatus
1Classification🔴 CRITICALSecondary unjustified rule missing exec_score gateOPEN
2Integration🔴 CRITICALDPO format mismatch across 3 pipeline stagesOPEN
3Testing🔴 CRITICALZero unit tests for classifierOPEN
4WORMS🔴 CRITICALRepoWorm produced 0 records — dead pipelineOPEN
5Classification🟠 HIGHFunctionGemma mock mode corrupts fusion scoringOPEN
6Rewriter🟠 HIGHNo semantic fidelity check on rewritesOPEN
7Data🟠 HIGHOnly 740 DPO pairs for 32B model trainingOPEN
8Data🟠 HIGHWORMS multiplier 1.0x — augmentation barely workedOPEN
9Data🟠 HIGHNo lineage tracking — records untraceableOPEN
10Scoring🟡 MEDIUMStall score double-counting inflates valuesOPEN
11Scoring🟡 MEDIUM`ends_with_question()` over-fires on declarationsOPEN
12Ingestion🟡 MEDIUMSilent turn dropping on malformed exportsOPEN
13Ingestion🟡 MEDIUMNo pagination in Supabase extractorOPEN
14Export🟡 MEDIUMNo stratified train/val/test splitsOPEN
15Training🟡 MEDIUMNo early stopping — overfitting risk on small DPOOPEN
16Eval🟡 MEDIUMRegression tester only checks string patternsOPEN
17Deploy🟡 MEDIUMNo CI/CD, no automated quality gatesOPEN
18WORMS🟡 MEDIUMConversationWorm hard-depends on SupabaseOPEN
19WORMS🟡 MEDIUM`_get_original_response()` always returns NoneOPEN
20Data🟡 MEDIUMExpansion v6-v8 potential data leakageOPEN

---

VERDICT

What's Actually Good

1. The classification system design is excellent. Three-signal scoring (stall/exec/blocked) with directive completeness and question policy is a genuinely novel approach. The density scorer's 4-dimension model is well-thought-out. This is publishable research.

2. The CTv3.1 schema is production-quality. Comprehensive type definitions, proper enums, clean serialization. This is the kind of schema that can power a product.

3. The architecture is extensible. The WORMS system (ConversationWorm + RepoWorm + EnhancerAgent) is a great design even though the current implementation underdelivers on data multiplication.

4. Documentation is exceptional. 8 docs + paper + detailed README with architecture diagrams. This is ready for external contributors.

5. The 38K SFT records are real value. Dense, scored, filtered conversation data from actual usage. This is hard to recreate and forms a genuine training corpus.

What Needs Fixing Before Training

1. Fix the classifier (Issues #1, #5, #3): The core classification logic has a false-positive path (secondary rule), mock data influencing production scores, and zero tests. This affects every record.

2. Fix DPO format (Issue #2): Cannot train DPO if the data format doesn't match what the training pipeline expects.

3. Generate more DPO data (Issue #7): 740 pairs is insufficient. Target 5K minimum.

4. Decide on RepoWorm (Issue #4): Either fix it or remove it. Dead pipelines erode trust.

Training Readiness Score: 6/10

The SFT data is solid. The DPO data is critically undersized and format-broken. The classification layer that produces both has bugs that need fixing. Fix Issues #1-7 and this jumps to 8/10.

Product Readiness Score: 3/10

Excellent foundation, but needs web portal, multi-tenancy, auth, billing, and data privacy before it's a product. The technical pipeline is 70

---

Generated by Evo-Cube analysis. Reviewed 93 Python files, 47K LOC, 43K training records, 8 expansion stages, 6 pipeline phases.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/EVOCUBE_REPORT.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture