Grand Diomande Research · Full HTML Reader

EVO-CUBE REPORT: CognitiveTwin Pipeline

**Date:** 2025-07-18 **Codebase:** `Desktop/Comp-Core/packages/cognitive-twin/` (93 Python files, ~47K LOC) **Data:** 43K records across 8 expansion stages + combined_v5_v8 final dataset **Target Model:** Kimi-K2-Thinking (MoE-1T, 32B active params)

Agents That Account for Themselves research note experiment writeup candidate score 52 .md

Full Public Reader

# EVO-CUBE REPORT: CognitiveTwin Pipeline
### CEF + DEP-2 + Evolution — Full Audit

Date: 2025-07-18
Codebase: `Desktop/Comp-Core/packages/cognitive-twin/` (93 Python files, ~47K LOC)
Data: 43K records across 8 expansion stages + combined_v5_v8 final dataset
Target Model: Kimi-K2-Thinking (MoE-1T, 32B active params)

---

1. [Phase 1 — CEF (Critique–Evil–Find)](#phase-1--cef)
- [Meta-Evil: Pipeline-Level Attacks](#11-meta-evil-pipeline-level-attacks)
- [Chunk-Evil: Per-Module Attacks](#12-chunk-evil-per-module-attacks)
- [Synthesis-Evil: Cross-Stage Data Quality](#13-synthesis-evil-cross-stage-data-quality)
2. [Phase 2 — DEP-2 (6-Level RTD + Fixes)](#phase-2--dep-2)
3. [Phase 3 — Evolution](#phase-3--evolution)
4. [Issue Tracker](#issue-tracker)
5. [Verdict](#verdict)

---

PHASE 1 — CEF

1.1 Meta-Evil: Pipeline-Level Attacks

ATTACK 1: Classification Threshold Regression → Silent Data Corruption

Finding: CRITICAL 🔴

In `corpus_surgery/constants.py`, the stall threshold was lowered from 3 → 1:

python

# Original: 3, 1, 0 - too strict, only caught 5/2177
STALL_THRESHOLD_UNJUSTIFIED = 1  # Any stalling pattern

The comment says "only caught 5/2177" so the threshold was weakened. But lowering to 1 means any message ending with a question mark (stall_score=1) triggers UNJUSTIFIED classification — even messages like "Here is the implementation. Does this approach work?" which scored exec=3 but would now hit stall≥1. The only protection is the `exec_score == 0` check in `is_unjustified()`, but the secondary rule bypasses exec:

python

# Secondary rule: strong permission phrase + ends with ? + high completeness
if (ends_with_question(text) and
    has_strong_permission and
    directive_completeness >= DIRECTIVE_HIGH_THRESHOLD):
    return True

This secondary rule has no exec_score gate — a response that contains "should we" (strong permission phrase) PLUS code PLUS question mark will be classified UNJUSTIFIED even though it executed. This corrupts the corpus surgery stage by flagging legitimate clarifications-after-execution as unjustified.

Impact: Potential false-positive rate on valid assistant turns estimated 8-15

Fix: Add `and exec_score == 0` to the secondary rule. Or restore stall threshold to 2 as a middle ground between 1 and 3.

---

ATTACK 2: FunctionGemma Integration is Dead Code in Production

Finding: HIGH 🟠

The `functiongemma_scorer.py` module is sophisticated (400+ lines) but always falls back to mock mode in production:

python

def __init__(self, ..., use_mock: bool = False):
    ...
    if self._model_path is None:
        logger.warning("No model path provided, using mock mode")
        self.use_mock = True

No model path is ever configured in any script or config file. The mock parser uses keyword matching to estimate parsability — it's a heuristic pretending to be ML inference. The classifier's fusion logic weights parsability at 60

python

fused_completeness = (
    0.4 * directive_completeness +
    0.6 * parsability_score * parsability_info.confidence
)

When mock confidence is 0.3-0.7 (hardcoded), this drags the fused score down compared to pure heuristic. The FunctionGemma rule `parsability_score >= 0.8 and stall_score >= 2` can still fire from mock data, creating phantom classifications.

Impact: Classification operates on heuristic-mock data pretending to be ML-scored data. Fusion logic degrades rather than improves accuracy.

Fix: Either (a) ship a real FunctionGemma model and integrate it, or (b) remove the fusion logic and disable the FunctionGemma classification rule when in mock mode. Don't let mock data influence production classifications.

---

ATTACK 3: Stall Score Double-Counting

Finding: MEDIUM 🟡

The `compute_stall_score()` function counts every matching phrase independently, but many phrases overlap:

"i'm sorry, but i can't" matches BOTH `REFUSAL_PHRASES` (+4) AND `CLARIFICATION_PREAMBLES` ("i'm sorry" → not exact, but close patterns overlap)
"would you like me to" (+3) can co-occur with "sound good?" (+3 from another strong permission phrase)

A single response saying "I apologize, but I cannot do that. Would you like me to try a different approach? Should I proceed?" would score: 4 (refusal) + 3 (would you like) + 3 (should I) + 1 (question mark) = 11. The threshold is 1. This response is definitely bad, but the score is meaninglessly inflated, making it impossible to use score magnitude for severity calibration or confidence weighting in DPO pairs.

Impact: Score inflation prevents meaningful gradient between "slightly stalling" and "completely refusing." DPO confidence could be calibrated to stall_score magnitude but currently isn't usable for this.

Fix: Either (a) deduplicate phrase matches (only count highest-scoring match per semantic category), or (b) cap stall_score at 10, or (c) normalize stall_score to [0,1] range for downstream use.

---

ATTACK 4: Rewriter Has No Semantic Fidelity Check

Finding: HIGH 🟠

The `rewrite_assistant_turn()` function validates rewrites for:
- No question mark at end ✓
- No permission phrases ✓
- Required artifacts present ✓
- Format compliance ✓

But there is no check that the rewrite is semantically faithful to the user's request. The rewriter calls GPT to transform permission-seeking responses into direct execution — but nothing verifies the generated content is correct. A rewrite could hallucinate a completely wrong implementation, and it would pass validation because it has a code block and doesn't end with a question.

Impact: Hallucinated rewrites enter SFT training data as "gold" examples. The model learns to produce confident-sounding wrong answers instead of correct permission-seeking ones. This is worse than the original problem.

Fix: Add a semantic similarity check between the rewrite and the user's request (embedding cosine similarity > 0.6). Or add a "code compiles" check for code rewrites. Or flag all rewrites as `review_status: auto_rewrite` and weight them lower (0.5) in training.

---

ATTACK 5: No Data Versioning or Lineage Tracking

Finding: MEDIUM 🟡

The pipeline has 8 expansion stages (v1-v8), combined datasets, augmented datasets, and multiple export versions (ctv3_export through ctv3_export_v5). But there is no mechanism to:
- Track which source records end up in which final dataset
- Know if a record was rewritten, augmented, or original
- Reproduce a specific dataset version
- Audit data flow from ingestion → scoring → WORMS → export

The `record_id` field uses random UUIDs, so there's no parent-child linkage. A record generated by conversation_worm has `source.origin = "convo_worm"` but no reference to which original conversation it branched from (the field `source_id` exists but is just the conversation_id, not the specific turn).

Impact: Cannot debug training failures back to data issues. Cannot identify which augmentation stage introduced problematic records. Dataset reproducibility is zero.

Fix: Add a `lineage` field to CTv3Record: `{parent_id: str, stage: str, transform: str}`. Chain parent_ids through pipeline stages. Log every record transformation.

---

1.2 Chunk-Evil: Per-Module Attacks

INGESTION (`v3/ingest/`)

ID	Finding	Severity	Detail
ING-1	Claude JSON parser assumes specific export format	🟡 MEDIUM	`claude_json.py` and `openai_json.py` parse specific export schemas. No schema validation — malformed exports silently drop turns rather than erroring.
ING-2	Deduplicator hash truncation	🟢 LOW	`sha256.hexdigest()[:16]` gives 64-bit hash space — collision probability rises past ~100K conversations. At 43K records, estimated <0.1
ING-3	Normalizer doesn't handle multi-modal content	🟡 MEDIUM	Image/file attachments in conversations are silently dropped. No flag to indicate content was lost. Model trains on incomplete context.
ING-4	Supabase extractor has no pagination	🟡 MEDIUM	`.limit(limit * 10)` is a guess, not proper cursor-based pagination. Will miss data in large corpora.

SCORING (`v3/corpus_surgery/`)

ID	Finding	Severity	Detail
SCR-1	`ends_with_question()` over-fires	🟡 MEDIUM	Checks if last sentence starts with "is", "are", "will" — catches declaratives like "This is the implementation." if preceded by a sentence ending with no period. Also: markdown headers like `## What is X` would fire.
SCR-2	`check_missing_input()` false positives	🟡 MEDIUM	Triggers on "refactor" keyword + no code block, but the code might be in an attachment or referenced by file path. `has_file_path` regex `[/
]+\.\w+` doesn't match many real paths (e.g., `Desktop/...` with tildes).
SCR-3	`directive_completeness` maxes at 0.80	🟢 LOW	Max achievable: +0.35 (verb) + 0.25 (format) + 0.20 (inputs) = 0.80. No message can score 1.0 without negative terms canceling out. The 0.7 threshold for `no_questions` policy is reachable but leaves thin margin.
SCR-4	`compute_blocked_score()` can go negative	🟡 MEDIUM	`FORMAT_SPECIFIED_BONUS = -1` and `USER_ASKED_OPTIONS_BONUS = -2`. Starting from 0, a high-completeness message with format spec + options asked = -3, then `max(0, score)` clips it. But if `check_missing_input` fires (+3) and format specified (-1), the interaction is non-intuitive.
SCR-5	No unit tests for classifier	🔴 CRITICAL	The classifier has 3 hardcoded test cases in `__main__` but zero pytest tests. Classification logic changed (threshold 3→1, FunctionGemma rule added) with no regression testing. Cannot verify correctness.

WORMS (`v3/worms/`)

ID	Finding	Severity	Detail
WRM-1	ConversationWorm depends on Supabase at runtime	🟡 MEDIUM	`_load_conversation()` requires Supabase. No local-file fallback for the main processing path. The `process_conversation()` method returns empty if no Supabase client — silently produces zero output.
WRM-2	WORMS multiplier is 1.0x	🟡 MEDIUM	`worms_stats.json` shows total_sft=3867, orig_sft=3852 — only 15 new conversation branches were generated. The WORMS augmentation pipeline added 0.4
WRM-3	RepoWorm generated 0 records	🔴 CRITICAL	`repo_sft: 0, repo_dpo: 0` in worms_stats. The entire code-grounded training data pipeline produced nothing. Either no repos were scanned, or the CodeScanner/TaskGenerator found no tasks. This is an entire dead pipeline.
WRM-4	DPO generator hardcodes "gpt-5.2" as provider	🟢 LOW	`create_sft_record()` sets `provider: "gpt-5.2"` regardless of actual generation model. Stats show `model: "gemini-2.0-flash"` was used. Metadata lies about provenance.
WRM-5	`_get_original_response()` always returns None	🟡 MEDIUM	In `ConversationWormPipeline`, this method returns `None` with a comment "This would typically look up the original response from Supabase." Every DPO pair that depends on finding the original dispreferred response fails silently.

EXPORT & DATASET (`v3/dataset/`)

ID	Finding	Severity	Detail
EXP-1	DPO format mismatch between pipeline stages	🔴 CRITICAL	Corpus surgery exports DPO as `{prompt, chosen, rejected}` (Together AI format). Combined v5_v8 data uses `{input, preferred_output, non_preferred_output}`. The `DataPreparer.prepare_dpo_data()` expects `{candidates.preferred.assistant_content}` (CTv3.1 schema format). These three formats are incompatible. No adapter exists to normalize them.
EXP-2	No format validation on final export	🟡 MEDIUM	`validate_export.py` exists (724 lines) but there's no evidence it's integrated into the pipeline. Data can be exported in an incompatible format and only discovered when training fails.
EXP-3	80/10/10 split has no stratification	🟡 MEDIUM	`DatasetSplit` does random shuffle + split. No stratification by source, domain, task_type, or quality tier. Training set could over-represent one expansion stage and under-represent another.
EXP-4	Eval set has no coverage guarantee	🟡 MEDIUM	Eval records are exported as-is with no check for coverage of all failure modes, task types, or domains.

TRAINING (`v3/pipeline.py`, `scripts/`)

ID	Finding	Severity	Detail
TRN-1	Training pipeline is Together AI-only	🟡 MEDIUM	`V3TrainingPipeline` only supports Together AI's fine-tuning API. The target model is Kimi-K2-Thinking (Moonshot AI). Together AI does support Kimi-K2, but the pipeline has no local training path, no Vast.ai integration despite having `deploy/vastai/` scripts, and no fallback.
TRN-2	LoRA config is hardcoded, not optimized	🟢 LOW	`lora_r=16, lora_alpha=32` are defaults. For a 32B-active MoE model with 43K records, these may be undertrained. No hyperparameter sweep infrastructure.
TRN-3	DPO beta=0.1 is low for 43K records	🟢 LOW	`dpo_beta=0.1` is standard for small datasets. With 43K records and a MoE model, a higher beta (0.3-0.5) may be needed to prevent reward hacking.
TRN-4	No early stopping	🟡 MEDIUM	`wait_for_completion()` polls until success/failure with a 2-hour timeout but no eval-loss-based early stopping. Overfitting risk on a small DPO set (882 records).
TRN-5	`submit_training.py` and `train.py` scripts lack error recovery	🟡 MEDIUM	No checkpointing, no resume capability. A timeout or API error means restarting from scratch.

EVALUATION (`v3/eval/`)

ID	Finding	Severity	Detail
EVL-1	Regression tester only checks string patterns	🟡 MEDIUM	`_check_constraints()` checks disallowed phrases and question endings. No semantic evaluation, no LLM-as-judge, no behavioral scoring. A model could satisfy all constraints while producing meaningless output.
EVL-2	A/B comparison scoring is simplistic	🟡 MEDIUM	`_score_response()` only penalizes permission phrases and rewards code blocks. Doesn't evaluate correctness, coherence, or style alignment.
EVL-3	No human eval integration	🟡 MEDIUM	Zero support for human annotation, blind evaluation, or inter-annotator agreement.

FRAMEWORK (`framework/`)

ID	Finding	Severity	Detail
FRM-1	Framework module is architectural skeleton	🟢 LOW	`framework/` contains `twin.py`, `trainer.py`, `pattern_memory.py`, etc. These define abstract interfaces (CognitiveTwin, PatternMemory, ReasoningEncoder) but are not used by the actual v3 pipeline. They're aspirational architecture, not production code. This is fine for a v3 system, but they inflate the apparent capability of the codebase.
FRM-2	`_compat.py` imports from non-existent packages	🟢 LOW	References `FunctionCall`, `ToolSchema`, `FunctionGemmaRuntime` from `cognitive_twin._compat` — these are compatibility stubs that define mock interfaces when the real packages aren't installed. Safe but confusing.

---

1.3 Synthesis-Evil: Cross-Stage Data Quality

ATTACK 6: The 43K Number is Inflated

Finding: HIGH 🟠

Actual record counts from the data directory:

Dataset	SFT Train	SFT Val	DPO Train	DPO Val	Total
ctv3_export_v5	38,189	4,244	666	74	43,173
combined_v5_v8	34,573	3,842	793	89	39,297
worms augmented	3,867	—	48	—	3,915

The "43K" figure comes from ctv3_export_v5, but:
1. The combined_v5_v8 has fewer records (39K) — meaning ~4K records were pruned or deduplicated between v5 export and the merge. But there's no record of what was removed or why.
2. SFT massively outnumbers DPO: 38K SFT vs 740 DPO. The DPO signal is 50:1 underrepresented. For DPO to meaningfully shape behavior, you need thousands of preference pairs, not hundreds.
3. WORMS added only 15 new SFT records — the augmentation system essentially didn't work.
4. Zero repo_worm data — the code-grounded augmentation produced nothing.

Impact: The model will be overwhelmingly shaped by SFT (pattern imitation) with minimal DPO correction signal. The anti-permission-seeking behavior that DPO is supposed to enforce will be underrepresented.

---

ATTACK 7: Data Provenance Chain is Broken

Finding: HIGH 🟠

Following a record through the pipeline:

1. Ingestion: Raw conversations → `extracted/*.jsonl` (source-specific formats)
2. Corpus Surgery: Classified, rewritten → internal schema
3. Expansion v1-v8: LLM-generated augmentations → expansion-specific formats
4. WORMS: Paraphrases, ideal responses → worms_output format
5. Combined: Merged → `combined_v5_v8` (Together AI chat format `{messages: [...]}`)
6. Export: Final → `ctv3_export_v5` (CTv3.1 schema? Or Together AI format?)

The problem: no single record can be traced from final dataset back to source. The combined_v5_v8 records are in `{messages: [...]}` format with no schema_version, no record_id, no source info. All CTv3.1 metadata is stripped in the final export.

Additionally, the DPO records in combined_v5_v8 use `{input, preferred_output, non_preferred_output}` — a format that doesn't match ANY of the three DPO formats defined in the codebase.

---

ATTACK 8: Expansion Stages May Have Data Leakage

Finding: MEDIUM 🟡

Expansion stages v6-v8 use LLMs to generate synthetic data:
- v6: "cross-domain synthesis, pattern recognition, architecture chains"
- v7: "methods, process conversations, evolution self-description, tool building, meta DPO"
- v8: "deep conversations, session mining, RLM enhanced, cross-system"

These stages call LLMs with the user's real conversation data as context and ask for synthetic extensions. If the LLM memorizes and regurgitates specific conversation content, the same information appears in both train and eval — data leakage through generative augmentation.

The deduplicator only catches exact hash matches and session overlaps. Paraphrased duplicates (which are the entire point of v6-v8) would pass through.

---

PHASE 2 — DEP-2

6-Level Recursive Task Decomposition

Level 1: Structure ✅ PASS (with caveats)

Module organization: Clean v3/ package structure with clear separation: `ingest/`, `corpus_surgery/`, `worms/`, `dataset/`, `eval/`, `pruning/`, `generators/`, `api/`, `tools/`
Schema definition: CTv3.1 schema is well-defined in `schema.py` with proper dataclasses, enums, serialization
Import graph: No circular imports detected. `_compat.py` handles optional dependencies gracefully
⚠️ Issue: `framework/` module is disconnected from `v3/` pipeline — architectural dead code

Level 2: Compilation ⚠️ CONDITIONAL PASS

Type hints: Comprehensive throughout. Python 3.10+ syntax used consistently
Dependencies: `pyproject.toml` declares proper optional dependency groups
⚠️ Issue: No `requirements.txt` or `uv.lock` — exact dependency resolution not pinned
⚠️ Issue: Many imports are guarded with try/except but fail silently, making it hard to know what's actually available at runtime

Level 3: Integration ❌ FAIL

DPO format mismatch (EXP-1): Three incompatible DPO formats across pipeline stages
WORMS Supabase dependency (WRM-1): Pipeline stages have hard runtime dependencies on external services with no fallback
RepoWorm dead (WRM-3): Entire code-grounded pipeline produced zero output
FunctionGemma mock masquerading as real (META-EVIL-2): Integration point pretends to work but degrades classification

Required Fixes:
1. Normalize all DPO data to a single format at pipeline entry. Add a `normalize_dpo_record(record: dict) -> dict` function that handles all three schemas
2. Add JSONL-file fallback path for ConversationWorm when Supabase is unavailable
3. Either fix RepoWorm or remove it from the stats/pipeline to avoid confusion
4. Gate FunctionGemma classification rules behind `if not self.use_mock`

Level 4: Content ⚠️ CONDITIONAL PASS

Classification logic: Sophisticated multi-signal approach (stall/exec/blocked). Core design is sound
Density scoring: Excellent composite scorer with good dimension separation
WORMS design: ConversationWorm, RepoWorm, EnhancerAgent architecture is well-thought-out
⚠️ Issue: Classifier has no tests (SCR-5)
⚠️ Issue: Rewriter has no semantic fidelity check (META-EVIL-4)
⚠️ Issue: DPO dataset too small (740 records) to meaningfully train 32B-active model

Required Fixes:
1. Write 50+ test cases for the classifier covering all scoring dimensions and edge cases
2. Add `review_status: "auto_rewrite"` to all rewritten records and weight them at 0.5
3. Generate at minimum 5,000 DPO pairs using the expansion pipeline before training

Level 5: User Journey ⚠️ CONDITIONAL PASS

CLI interfaces: Every pipeline stage has `argparse` CLI with sensible defaults
Documentation: Extensive docs/ directory with 8 guides + paper
README: Production-quality with architecture diagram, API reference, examples
⚠️ Issue: No end-to-end `make train` or `just run-all` command
⚠️ Issue: No configuration file support — all params are CLI args or hardcoded

Required Fix: Create a `config.yaml` system for pipeline configuration and a single `run_full_pipeline.py` orchestrator script.

Level 6: Deployment ❌ FAIL

No CI/CD: No GitHub Actions, no automated testing
No data validation gate: Data can flow to training without format/quality checks
No model registry: Trained models are referenced by Together AI job IDs with no versioning
No monitoring: Training metrics are polled but not persisted or alerted on
Vast.ai scripts exist but are untested/unintegrated: `deploy/vastai/` is aspirational

Required Fixes:
1. Add GitHub Actions workflow: lint → type-check → test → validate-data
2. Add a data validation gate that runs `validate_export.py` before any training submission
3. Create a model registry file that tracks `{version, job_id, model_id, dataset_hash, metrics}`

---

DEP-2 Fix Summary

Priority	Fix	Effort	Impact
P0 🔴	Add exec_score gate to secondary unjustified rule	1 line	Prevents false-positive classification
P0 🔴	Normalize DPO formats across pipeline	2 hours	Enables actual DPO training
P0 🔴	Write classifier unit tests	4 hours	Prevents regression on core logic
P1 🟠	Disable FunctionGemma rules in mock mode	30 min	Removes phantom classifications
P1 🟠	Add semantic fidelity check to rewriter	2 hours	Prevents hallucinated rewrites entering training
P1 🟠	Generate 5K+ DPO pairs	8 hours	Makes DPO training viable
P1 🟠	Fix RepoWorm or remove it	4 hours	Eliminates dead pipeline confusion
P2 🟡	Add data lineage tracking	4 hours	Enables debugging and reproducibility
P2 🟡	Add JSONL fallback to ConversationWorm	2 hours	Makes WORMS work without Supabase
P2 🟡	Stratified train/val/test splits	1 hour	Better evaluation reliability
P2 🟡	Config file system	3 hours	Better reproducibility
P3 🟢	CI/CD pipeline	4 hours	Standard engineering hygiene
P3 🟢	Model registry	2 hours	Track trained models

---

PHASE 3 — Evolution

3.1 Product Vision: "Train Your Own Cognitive Twin"

The core insight: CognitiveTwin is a framework for creating personalized AI behavior models from conversation history. Today it's Mohamed's personal tool. Tomorrow it could be:

> "Upload your conversation exports. Get a LoRA adapter that makes any LLM think like you."

3.2 Product Architecture

┌─────────────────────────────────────────────────────────────┐
│                    CognitiveTwin Cloud                        │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐ │
│  │  Upload   │──▶│  Corpus  │──▶│  WORMS   │──▶│  Train   │ │
│  │  Portal   │   │  Surgery │   │  Augment │   │  Engine  │ │
│  │           │   │          │   │          │   │          │ │
│  │ ChatGPT   │   │ Classify │   │ Paraphrase│  │ SFT +    │ │
│  │ Claude    │   │ Rewrite  │   │ Extend   │   │ DPO      │ │
│  │ Discord   │   │ Score    │   │ DPO Gen  │   │ LoRA     │ │
│  │ Slack     │   │ Quarantine│  │          │   │          │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘ │
│       │                                              │       │
│       ▼                                              ▼       │
│  ┌──────────┐                                 ┌──────────┐  │
│  │ Dashboard │                                │  Model   │  │
│  │ Analytics │                                │  Store   │  │
│  │ Twin Edit │                                │  Deploy  │  │
│  └──────────┘                                 └──────────┘  │
│                                                               │
└─────────────────────────────────────────────────────────────┘

3.3 Onboarding Flow

Step 1: Export Your Conversations (5 minutes)
- User exports from ChatGPT (Settings → Data → Export)
- User exports from Claude (Account → Export Data)
- Optional: Discord bot token, Slack export

Step 2: Upload & Analyze (automated, ~2 minutes)
- Drag-and-drop ZIP/JSON files
- Pipeline runs corpus surgery automatically
- Dashboard shows: conversation count, density distribution, personality traits detected, stall patterns found

Step 3: Review Twin Profile (optional, 3 minutes)
- "Your Twin Profile" page shows:
- Communication style (direct vs. exploratory, terse vs. verbose)
- Expertise domains detected (code, research, planning, ops)
- Behavioral patterns (preference for X, tendency to Y)
- Permission-seeking score (how much the AI stalled with you)
- User can toggle: "I WANT my twin to ask clarifying questions sometimes" (adjusts question_policy)
- User can mark conversations as "Not me" (exclude from training)

Step 4: Train (automated, 30-60 minutes)
- Select base model tier:
- Starter (3B params): Fast, cheap, good for simple style matching
- Pro (8-32B active): Full cognitive twin with reasoning patterns
- Enterprise (70B+): Maximum fidelity, multi-domain mastery
- Training runs in cloud (Together AI / Vast.ai / RunPod)
- Real-time progress: loss curves, sample outputs at checkpoints

Step 5: Deploy & Use (immediate)
- API endpoint (OpenAI-compatible)
- Download LoRA adapter for local use (Ollama, vLLM, etc.)
- Integration with:
- Cursor / VS Code extension
- Discord bot
- Chrome extension (reply suggestion)
- Slack bot
- Apple Shortcuts

3.4 Pricing Model

Tier	Price	What You Get
Free	$0	Upload & analyze (dashboard only). No training.
Personal	$29/mo	1 Twin, Starter model, 10K conversations, 3 retrains/month
Pro	$99/mo	3 Twins, Pro model, 100K conversations, unlimited retrains, API access
Team	$49/seat/mo	Shared twins, team style guides, role-specific twins (CEO tone, eng tone)
Enterprise	Custom	Self-hosted, 70B+ models, SSO, compliance, SLA

Usage-based add-ons:
- API inference: $0.001/1K tokens (pass-through + 20
- Extra training: $5/retrain (covers compute)
- Custom base model: $50/train (larger model fine-tuning)

3.5 What Makes This a Product (Not Just a Tool)

1. Network Effects: As more people train twins, the corpus surgery classifier gets better (aggregate anonymized stall patterns improve detection)
2. Data Flywheel: Each user's corrections to their twin generate more DPO training data
3. Platform Lock-in: Your twin's adapter only works with CognitiveTwin's deployment infra (or you can export it, but the retraining cadence keeps you subscribed)
4. Upsell Path: Free → dashboard addiction → "I want to actually USE this" → paid tier

3.6 Technical Evolution Requirements

To go from "Mohamed's personal pipeline" to "anyone's cognitive twin":

Must-Have for MVP:
- [ ] Web upload portal (drag-and-drop conversation exports)
- [ ] Multi-tenant pipeline (isolated user data, parallel training jobs)
- [ ] Auto-detect conversation format (ChatGPT vs Claude vs generic chat)
- [ ] Hosted inference endpoint with OpenAI-compatible API
- [ ] Dashboard: conversation analytics, twin profile, style metrics

Must-Have for Launch:
- [ ] User accounts (auth, billing, usage tracking)
- [ ] Data encryption at rest (user conversations are sensitive)
- [ ] GDPR compliance (data deletion, export)
- [ ] Rate limiting and abuse prevention
- [ ] Monitoring and alerting on training jobs

Nice-to-Have (v2):
- [ ] Real-time twin updates (process new conversations incrementally)
- [ ] Twin-to-twin comparison ("How does my twin differ from average?")
- [ ] Style transfer ("Make my twin sound more [professional/casual/technical]")
- [ ] Multi-language support
- [ ] Voice clone integration (text style + ElevenLabs voice = full digital twin)
- [ ] "Twin memories" — RAG++ integration for long-term context

3.7 Competitive Landscape

Competitor	What They Do	CognitiveTwin Advantage
Character.ai	Create fictional characters	CT trains on YOUR real conversations. Authentic, not fictional.
Ditto AI	AI clone from social media	CT uses deep conversation data, not surface-level posts.
Personal.ai	Personal AI assistant	CT produces a LoRA adapter you OWN and can run locally.
Fine-tuning APIs (OpenAI, Together)	Raw fine-tuning	CT handles the entire pipeline: ingestion, cleaning, augmentation, training, eval.

Moat: The corpus surgery + WORMS pipeline is the moat. Anyone can fine-tune a model, but CT's multi-signal classification, friction quarantine, and trajectory-aware DPO create better training data from the same raw conversations. The DATA QUALITY is the product.

3.8 Risk Analysis

Risk	Likelihood	Impact	Mitigation
Users upload sensitive/illegal content	HIGH	HIGH	Content filtering on upload, ToS, automated scanning
Model outputs harmful content in user's "style"	MEDIUM	HIGH	Output guardrails, content policy enforcement on inference
Privacy breach (conversation data leaked)	LOW	CRITICAL	Encryption at rest, SOC2, minimal retention
Training costs exceed revenue	MEDIUM	MEDIUM	Start with LoRA-only (cheap), tier pricing covers compute
Users expect perfection from small data	HIGH	MEDIUM	Set expectations: "50+ conversations for good results, 500+ for great"

---

ISSUE TRACKER

#	Category	Severity	Title	Status
1	Classification	🔴 CRITICAL	Secondary unjustified rule missing exec_score gate	OPEN
2	Integration	🔴 CRITICAL	DPO format mismatch across 3 pipeline stages	OPEN
3	Testing	🔴 CRITICAL	Zero unit tests for classifier	OPEN
4	WORMS	🔴 CRITICAL	RepoWorm produced 0 records — dead pipeline	OPEN
5	Classification	🟠 HIGH	FunctionGemma mock mode corrupts fusion scoring	OPEN
6	Rewriter	🟠 HIGH	No semantic fidelity check on rewrites	OPEN
7	Data	🟠 HIGH	Only 740 DPO pairs for 32B model training	OPEN
8	Data	🟠 HIGH	WORMS multiplier 1.0x — augmentation barely worked	OPEN
9	Data	🟠 HIGH	No lineage tracking — records untraceable	OPEN
10	Scoring	🟡 MEDIUM	Stall score double-counting inflates values	OPEN
11	Scoring	🟡 MEDIUM	`ends_with_question()` over-fires on declarations	OPEN
12	Ingestion	🟡 MEDIUM	Silent turn dropping on malformed exports	OPEN
13	Ingestion	🟡 MEDIUM	No pagination in Supabase extractor	OPEN
14	Export	🟡 MEDIUM	No stratified train/val/test splits	OPEN
15	Training	🟡 MEDIUM	No early stopping — overfitting risk on small DPO	OPEN
16	Eval	🟡 MEDIUM	Regression tester only checks string patterns	OPEN
17	Deploy	🟡 MEDIUM	No CI/CD, no automated quality gates	OPEN
18	WORMS	🟡 MEDIUM	ConversationWorm hard-depends on Supabase	OPEN
19	WORMS	🟡 MEDIUM	`_get_original_response()` always returns None	OPEN
20	Data	🟡 MEDIUM	Expansion v6-v8 potential data leakage	OPEN

---

VERDICT

What's Actually Good

1. The classification system design is excellent. Three-signal scoring (stall/exec/blocked) with directive completeness and question policy is a genuinely novel approach. The density scorer's 4-dimension model is well-thought-out. This is publishable research.

2. The CTv3.1 schema is production-quality. Comprehensive type definitions, proper enums, clean serialization. This is the kind of schema that can power a product.

3. The architecture is extensible. The WORMS system (ConversationWorm + RepoWorm + EnhancerAgent) is a great design even though the current implementation underdelivers on data multiplication.

4. Documentation is exceptional. 8 docs + paper + detailed README with architecture diagrams. This is ready for external contributors.

5. The 38K SFT records are real value. Dense, scored, filtered conversation data from actual usage. This is hard to recreate and forms a genuine training corpus.

What Needs Fixing Before Training

1. Fix the classifier (Issues #1, #5, #3): The core classification logic has a false-positive path (secondary rule), mock data influencing production scores, and zero tests. This affects every record.

2. Fix DPO format (Issue #2): Cannot train DPO if the data format doesn't match what the training pipeline expects.

3. Generate more DPO data (Issue #7): 740 pairs is insufficient. Target 5K minimum.

4. Decide on RepoWorm (Issue #4): Either fix it or remove it. Dead pipelines erode trust.

Training Readiness Score: 6/10

The SFT data is solid. The DPO data is critically undersized and format-broken. The classification layer that produces both has bugs that need fixing. Fix Issues #1-7 and this jumps to 8/10.

Product Readiness Score: 3/10

Excellent foundation, but needs web portal, multi-tenancy, auth, billing, and data privacy before it's a product. The technical pipeline is 70

---

Generated by Evo-Cube analysis. Reviewed 93 Python files, 47K LOC, 43K training records, 8 expansion stages, 6 pipeline phases.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/EVOCUBE_REPORT.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture

Full Public Reader

TABLE OF CONTENTS

PHASE 1 — CEF

1.1 Meta-Evil: Pipeline-Level Attacks

ATTACK 1: Classification Threshold Regression → Silent Data Corruption

ATTACK 2: FunctionGemma Integration is Dead Code in Production

ATTACK 3: Stall Score Double-Counting

ATTACK 4: Rewriter Has No Semantic Fidelity Check

ATTACK 5: No Data Versioning or Lineage Tracking

1.2 Chunk-Evil: Per-Module Attacks

INGESTION (`v3/ingest/`)

SCORING (`v3/corpus_surgery/`)

WORMS (`v3/worms/`)

EXPORT & DATASET (`v3/dataset/`)

TRAINING (`v3/pipeline.py`, `scripts/`)

EVALUATION (`v3/eval/`)

FRAMEWORK (`framework/`)

1.3 Synthesis-Evil: Cross-Stage Data Quality

ATTACK 6: The 43K Number is Inflated

ATTACK 7: Data Provenance Chain is Broken

ATTACK 8: Expansion Stages May Have Data Leakage

PHASE 2 — DEP-2

6-Level Recursive Task Decomposition

Level 1: Structure ✅ PASS (with caveats)

Level 2: Compilation ⚠️ CONDITIONAL PASS

Level 3: Integration ❌ FAIL

Level 4: Content ⚠️ CONDITIONAL PASS

Level 5: User Journey ⚠️ CONDITIONAL PASS

Level 6: Deployment ❌ FAIL

DEP-2 Fix Summary

PHASE 3 — Evolution

3.1 Product Vision: "Train Your Own Cognitive Twin"

3.2 Product Architecture

3.3 Onboarding Flow

3.4 Pricing Model

3.5 What Makes This a Product (Not Just a Tool)

3.6 Technical Evolution Requirements

3.7 Competitive Landscape

3.8 Risk Analysis

ISSUE TRACKER

VERDICT

What's Actually Good

What Needs Fixing Before Training

Training Readiness Score: 6/10

Product Readiness Score: 3/10

Promotion Decision

Source Anchor

Detected Structure