Grand Diomande Research · Full HTML Reader

CognitiveTwin V2 Technical Evaluation Report

Agents That Account for Themselves experiment experiment writeup candidate score 42 .md

Full Public Reader

CognitiveTwin V2 Technical Evaluation Report

Executive Summary

This document presents a comprehensive analysis of the CognitiveTwin V2 fine-tuned language model, comparing its performance characteristics against the base Meta Llama 3.1 8B Instruct model from which it derives. The evaluation methodology encompasses multiple dimensions of model behavior including response structure, coherence patterns, stylistic fidelity, and domain-specific adaptation. The fine-tuning process successfully transferred identifiable patterns from the training corpus into the model's generative behavior, with measurable improvements in characteristic phrase usage, topic consistency, and technical term density. The fine-tuned model demonstrates a distinct response profile characterized by longer sentence structures, increased use of numbered lists over bullet points, and higher concentration of domain-specific terminology compared to the base model.

Introduction

The CognitiveTwin V2 represents an instance of supervised fine-tuning applied to the Meta Llama 3.1 8B Instruct foundation model. The training objective centers on adapting the model's response patterns to reflect the stylistic and structural characteristics observed in a corpus of 979 conversational exchanges extracted from the Supabase memory_turns table. The training procedure employed full-parameter fine-tuning over three epochs using the Together AI infrastructure, with the resulting model deployed as a private endpoint accessible via the Together API. This evaluation seeks to quantify the degree to which the fine-tuning process achieved its intended objectives and to characterize the behavioral differences between the adapted model and its base counterpart.

Methodology

The evaluation framework implements a parallel comparison architecture wherein both the fine-tuned model and the base model receive identical prompts under controlled conditions. Each prompt originates from one of two sources: authentic user interactions extracted from the training data distribution, or synthetically constructed queries designed to probe specific capability domains including code generation, architectural reasoning, technical explanation, and debugging assistance. The evaluation pipeline processes fifteen distinct prompts through both models, collecting the raw response text along with generation timing metrics. Each response undergoes multi-dimensional analysis through a suite of computational linguistics measures that quantify structural, coherence, and stylistic properties without relying on subjective human judgment.

The response metrics subsystem computes word count, sentence count, average sentence length, lexical diversity as measured by type-token ratio, code block frequency, list formatting patterns including both numbered and bulleted styles, header usage, question frequency, unique vocabulary size, average word length, punctuation density, and uppercase character ratio. The coherence analysis module examines topic consistency through repeated significant word analysis, logical connector frequency encompassing terms such as therefore, consequently, because, and however, and transition word usage including specifically, for example, similarly, and in contrast. The style transfer assessment quantifies characteristic phrase frequency derived from patterns identified in the training corpus, formality scoring based on formal and informal marker presence, technical term density computed against a curated domain vocabulary, personal pronoun usage ratio, and imperative sentence frequency.

Training Configuration

The fine-tuning process executed on Together AI infrastructure employed the following parameter configuration. The base model specification designated Meta Llama 3.1 8B Instruct Reference as the foundation architecture. The training duration spanned three complete epochs over the full dataset. The training corpus comprised 979 conversational exchanges sourced from the memory_turns table, representing authentic user-assistant interaction patterns. The fine-tuning mode utilized full parameter optimization rather than parameter-efficient approaches such as LoRA or QLoRA, maximizing the model's capacity to absorb training distribution characteristics at the cost of increased computational requirements. The resulting model identifier registered as mo_841e/Meta-Llama-3.1-8B-Instruct-Reference-cognitivetwin-v2-full-04e6c420 within the Together AI model registry.

The training data exhibited specific thematic concentrations reflecting the source application domain. Analysis of representative training samples reveals substantial content related to conversation tree coordinate systems, including discussion of tetrahedral coordinate representations where x-coordinates denote conversation depth, y and z-coordinates encode sibling positioning, and coordinate assignment rules govern hierarchical message placement. Additional training content encompasses algorithm integration patterns exemplified by deferred acceptance matching implementations, error handling strategies for PDF processing pipelines, and structured technical explanations featuring enumerated key takeaways. These thematic elements create expectations for corresponding patterns in the fine-tuned model's output distribution.

Quantitative Results

The comparative analysis reveals distinct behavioral signatures differentiating the fine-tuned model from its base counterpart across all measured dimensions. The response metrics comparison demonstrates that the fine-tuned model produces responses averaging 138.33 words compared to the base model's 135.47 words, representing a marginal increase of 2.87 words per response. More significantly, the fine-tuned model exhibits substantially longer average sentence length at 24.67 words per sentence versus the base model's 12.91 words per sentence, indicating a preference for more elaborate syntactic structures that pack greater informational density into fewer sentence boundaries. This 11.76 word differential in average sentence length represents the most pronounced structural divergence observed in the evaluation.

The lexical diversity metric, computed as the ratio of unique words to total words, measures 0.34 for the fine-tuned model against 0.41 for the base model. This reduction in type-token ratio suggests the fine-tuned model exhibits slightly more repetition in its vocabulary usage, potentially reflecting domain-specific terminology recurrence or stylistic preferences for consistent phrasing. The fine-tuned model generated four code blocks across all responses compared to three for the base model. More dramatically, the fine-tuned model produced 28 numbered list entries against the base model's 17, while generating only 6 bullet list entries compared to the base model's 24. This shift from bullet-point formatting toward numbered enumeration represents a substantive stylistic adaptation likely reflecting patterns present in the training corpus.

The coherence analysis reveals the fine-tuned model achieving an average topic consistency score of 0.289 compared to the base model's 0.209, indicating stronger thematic coherence within individual responses. The fine-tuned model utilized 6 logical connector terms across all responses while the base model employed 11, suggesting different argumentative structuring strategies. Conversely, the fine-tuned model produced 9 transition words compared to the base model's 5, indicating greater emphasis on explicit conceptual bridging between ideas.

The style transfer metrics demonstrate the evaluation's central finding regarding fine-tuning effectiveness. The fine-tuned model incorporated 6 characteristic phrases matching patterns identified in the training corpus, double the 3 characteristic phrases observed in base model outputs. Both models registered identical formality scores of 0.0, indicating neutral register without significant formal or informal marker prevalence. The fine-tuned model exhibited technical term density of 0.308 compared to the base model's 0.211, representing a 46

The generation timing analysis recorded average response latency of 2.54 seconds for the fine-tuned model compared to 1.35 seconds for the base model. This approximately 88

Qualitative Analysis

Examination of response pairs across identical prompts reveals consistent behavioral patterns distinguishing the fine-tuned model from its base counterpart. When presented with the prompt requesting code enhancement without specifying the target code, the fine-tuned model responded with a direct statement acknowledging the incomplete request context before immediately generating a complete code example demonstrating enhanced patterns including type annotations, protocol definitions, and comprehensive logging infrastructure. The base model conversely began with a polite acknowledgment requesting clarification before offering to assist. This differential response strategy reflects the fine-tuned model's apparent training on contexts where assistant responses frequently include proactive code generation regardless of prompt ambiguity.

The K-means clustering prompt elicited structurally similar responses from both models, each providing function implementations with docstrings and appropriate library imports. However, the fine-tuned model's response incorporated explanatory preamble text explicitly restating the interpreted user intent, while the base model proceeded directly to implementation. This pattern of interpretive restatement appears consistently in the fine-tuned model's outputs and likely reflects training data patterns where clarification of intent precedes technical content.

The SVG generation prompt produced visually similar structural approaches from both models, each generating properly formatted SVG elements with viewBox specifications, grouped elements, and appropriate attribute assignments. The fine-tuned model included additional CSS styling within a dedicated style block and employed a container div pattern suggesting web integration context, while the base model produced a more minimal standalone SVG specification. These differences reflect the fine-tuned model's adaptation toward web development contexts prevalent in the training corpus.

The JavaScript property existence checking prompt yielded the most substantive content overlap between models, with both providing hasOwnProperty, in operator, and optional chaining approaches. The fine-tuned model organized its response around progressive complexity with clear conceptual explanations preceding each code example, while the base model employed subsection headers distinguishing JavaScript, TypeScript, and React contexts. Both approaches demonstrate valid structural strategies, with the fine-tuned model favoring narrative flow and the base model emphasizing categorical organization.

Training Corpus Characteristics

Analysis of the training data samples reveals distinctive thematic and structural patterns that manifest in the fine-tuned model's outputs. The first sample demonstrates extended analytical discourse concerning coordinate system interpretation, spanning multiple paragraphs with explicit enumeration of key takeaways and acknowledgment of interpretive uncertainty. The second and third samples continue this coordinate system analysis theme with increasing specificity regarding z-coordinate semantics. The fourth sample presents algorithm integration code with comprehensive docstrings and parameter specifications. The fifth sample addresses error handling patterns with progressive solution strategies.

These training samples exhibit several consistent structural features including paragraph-level organization rather than bullet-point enumeration, explicit enumeration using numbered lists for key points and strategies, substantial explanatory context preceding technical content, acknowledgment of limitations and contextual dependencies, and domain-specific technical terminology. The fine-tuned model's evaluated outputs demonstrate absorption of these patterns, particularly the preference for numbered enumeration over bullets, the extended sentence structures accommodating complex explanations, and the increased technical terminology density.

Interpretation of Results

The evaluation results support the conclusion that the fine-tuning process achieved its primary objective of transferring stylistic and structural patterns from the training corpus into the model's generative behavior. The doubling of characteristic phrase frequency from 3 to 6 instances provides direct evidence of vocabulary pattern transfer. The 46

The reduction in lexical diversity and the increase in sentence length together suggest the fine-tuned model favors more elaborate, domain-specific phrasing over the varied vocabulary and shorter sentences characterizing the base model's more general-purpose style. This trade-off represents an expected consequence of domain adaptation wherein generality decreases as specificity increases.

The increase in topic consistency score indicates the fine-tuned model maintains stronger thematic coherence within individual responses, potentially reflecting the training data's emphasis on sustained analytical discourse rather than fragmented response patterns. The reduction in logical connector usage paired with increased transition word usage suggests a stylistic preference for explicit conceptual bridging over formal logical argumentation.

Limitations

Several methodological limitations constrain the interpretability of these results. The embedding similarity analysis intended to provide semantic alignment metrics encountered API availability issues, rendering the training similarity scores unreliable. The characteristic phrase lexicon derives from manual identification rather than computational extraction from training data, potentially omitting significant patterns or including non-distinctive terms. The evaluation prompt set, while incorporating authentic training data prompts, represents a small sample of the possible prompt distribution and may not adequately represent edge cases or adversarial inputs.

The style metrics rely on surface-level lexical analysis without capturing deeper pragmatic or discourse-level patterns that may distinguish the fine-tuned model from its base counterpart. The absence of human evaluation prevents assessment of subjective quality dimensions including helpfulness, accuracy, and naturalness. The generation time comparison conflates model differences with infrastructure differences, limiting its utility for efficiency assessment.

Conclusions

The CognitiveTwin V2 fine-tuned model demonstrates measurable adaptation to the stylistic and structural patterns present in its training corpus. The evaluation identifies successful transfer of characteristic phrases, increased domain-specific terminology usage, preference for numbered enumeration formatting, extended sentence structures, and improved topic consistency relative to the base model. These adaptations collectively produce a response profile distinctly different from the general-purpose base model, reflecting the domain-specific and stylistically particular characteristics of the training data source.

The fine-tuning process achieved its intended objective of creating a model instance that generates responses bearing recognizable patterns from the training distribution. Whether these adaptations improve utility in the target application context requires human evaluation beyond the scope of this computational analysis. The model stands ready for deployment in contexts where the absorbed stylistic and domain patterns align with user expectations, while acknowledging that the reduced lexical diversity and longer sentence structures may not suit all interaction contexts.

Appendix: Evaluation Configuration

The evaluation executed on December 31, 2025 at 09:33:52 UTC with the following configuration parameters. The fine-tuned model identifier specified mo_841e/Meta-Llama-3.1-8B-Instruct-Reference-cognitivetwin-v2-full-04e6c420. The base model identifier specified meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo. The maximum token generation limit set to 250 tokens per response. The temperature parameter set to 0.7. The prompt count totaled 15 distinct prompts. Each prompt received a single generation run. The intended embedding model specified togethercomputer/m2-bert-80M-8k-retrieval though embedding computation encountered availability issues during execution.

Appendix: Domain-Specific Knowledge Transfer Analysis

A targeted evaluation probing domain-specific knowledge transfer reveals significant differentiation between the fine-tuned and base models when presented with prompts closely aligned to training data themes. The z-coordinate conversation tree prompt elicited responses demonstrating distinct interpretive frameworks. The fine-tuned model explained the z-coordinate as representing conversation tree depth, measuring distance from the root node at origin, with values incrementing as traversal proceeds deeper into the tree structure. This interpretation directly reflects patterns present in training samples discussing coordinate assignment rules where x-coordinates denote depth and y and z-coordinates encode sibling positioning. The base model conversely interpreted the tetrahedral system through the lens of quaternary branching structures, explaining the z-coordinate as enabling complex non-linear tree traversal rather than depth representation. This interpretive divergence demonstrates successful absorption of domain-specific conceptual frameworks from the training corpus.

The deferred acceptance algorithm prompt produced structurally similar responses from both models, each proposing data structure definitions followed by initialization procedures. The fine-tuned model explicitly enumerated a BloomFilter class supporting membership checking with specified error rates alongside a DeferredAcceptance implementation class, while the base model referenced external library dependencies before proceeding to algorithmic description. Both approaches represent valid implementation strategies, with the fine-tuned model's explicit class enumeration reflecting training patterns emphasizing self-contained code examples.

The trajectory coordinate generation prompt yielded the most substantive evidence of domain adaptation. The fine-tuned model's response employed the exact terminology of trajectory coordinates as a component of conversational AI systems, structuring its explanation around conversation goals, context establishment, and audience targeting. The base model interpreted the prompt through a natural language processing framework emphasizing text preprocessing, tokenization, and linguistic reduction techniques. The fine-tuned model's interpretation aligns precisely with the training corpus's treatment of trajectory coordinates as geometric representations within conversation trees rather than NLP feature vectors. This alignment demonstrates successful transfer of domain-specific conceptual vocabulary and associated semantic interpretations from training data to model behavior.

Appendix: Metric Definitions

The word count metric measures the total number of whitespace-delimited tokens in the response text. The sentence count metric measures the number of segments delimited by period, exclamation, or question mark characters. The average sentence length metric divides word count by sentence count. The lexical diversity metric divides unique word count by total word count after case normalization and alphabetic filtering. The code block count metric divides the count of triple backtick sequences by two. The numbered list count metric counts lines matching the regular expression pattern for digit-period sequences at line start. The bullet list count metric counts newline-hyphen-space and newline-asterisk-space sequences. The header count metric counts lines beginning with one or more hash characters followed by space. The topic consistency metric divides the count of words appearing more than once by the count of unique words longer than four characters. The logical connector count metric sums occurrences of predetermined logical connector terms. The transition word count metric sums occurrences of predetermined transition terms. The characteristic phrase count metric sums occurrences of predetermined characteristic phrase patterns. The formality score metric subtracts informal marker count from formal marker count and divides by word count. The technical term density metric divides technical term occurrence count by word count. The personal pronoun ratio metric divides first and second person pronoun count by total word count.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/core/_recovered/retrieval/cc-rag-plus-plus/docs/CognitiveTwin/V2/COGNITIVETWIN_V2_EVALUATION_REPORT.md

Detected Structure

Introduction · Method · Evaluation · References · Architecture