Grand Diomande Research · Full HTML Reader

CognitiveTwin: Architectural Foundations and Empirical Evaluation of Personalized Language Model Adaptation Through Trajectory-Aware Fine-Tuning

The construction of personalized language model instances capable of reproducing individual cognitive patterns, stylistic signatures, and domain-specific conceptual frameworks represents a significant advancement in the development of AI systems that function as cognitive extensions rather than generic tools. This paper presents the CognitiveTwin framework, a comprehensive architecture for creating personalized language model instances through trajectory-aware supervised fine-tuning on conversational interaction hi

Agents That Account for Themselves working paper preprint render candidate score 100 .tex

Full Public Reader

Abstract

The construction of personalized language model instances capable of reproducing individual cognitive patterns, stylistic signatures, and domain-specific conceptual frameworks represents a significant advancement in the development of AI systems that function as cognitive extensions rather than generic tools. This paper presents the CognitiveTwin framework, a comprehensive architecture for creating personalized language model instances through trajectory-aware supervised fine-tuning on conversational interaction histories. The architectural foundation integrates three principal components: a trajectory coordinate system that encodes hierarchical conversation structure through tetrahedral geometric representations, a style signature extraction mechanism that captures recurring linguistic patterns across interaction histories, and a dual-ring memory topology that maintains both episodic and semantic memory traces for context-aware generation. We formalize the mathematical foundations underlying each architectural component, establish theoretical guarantees for style transfer convergence under specified conditions, and introduce a multi-dimensional evaluation framework encompassing lexical, syntactic, semantic, and pragmatic assessment dimensions. Empirical evaluation of a CognitiveTwin instance fine-tuned on 979 conversational exchanges demonstrates statistically significant improvements across all measured dimensions: characteristic phrase frequency increased by 100\

Keywords: Large Language Models, Personalization, Fine-Tuning, Cognitive Architecture, Style Transfer, Trajectory Modeling, Domain Adaptation, Evaluation Methodology

Introduction

The emergence of large language models as general-purpose reasoning and generation systems has fundamentally altered the landscape of human-computer interaction, yet the homogeneity of model outputs across users presents a fundamental limitation for applications requiring personalized cognitive augmentation. Contemporary language models, trained on aggregated corpora spanning billions of tokens from millions of sources, necessarily reflect statistical averages of human linguistic behavior rather than the distinctive patterns characterizing individual cognition. The challenge of creating AI systems that function as personalized cognitive extensions---mirroring the stylistic preferences, conceptual frameworks, and domain expertise of specific individuals---requires architectural innovations beyond conventional fine-tuning approaches and evaluation methodologies capable of quantifying subtle behavioral differences that distinguish successful personalization from superficial adaptation.

The CognitiveTwin framework addresses this challenge through an integrated architecture that treats conversational histories not merely as training data but as structured records of cognitive activity amenable to geometric and topological analysis. Central to this approach is the recognition that conversations exhibit hierarchical structure: user queries spawn assistant responses which may themselves prompt follow-up queries, generating tree-like structures where positional relationships encode semantic and pragmatic relationships beyond what sequential text alone captures. The trajectory coordinate system introduced in this work assigns each conversational turn a position in a tetrahedral coordinate space, enabling the model to condition generation on structural context that reflects the cognitive path leading to the current interaction point.

Motivation and Problem Context

The inadequacy of generic language model outputs for specialized applications manifests across multiple dimensions that collectively motivate the personalization research agenda. First, stylistic mismatch between model outputs and user expectations creates friction in human-AI collaboration: users accustomed to particular communication patterns find generic model outputs jarring or inappropriate for their contexts. Second, domain knowledge gaps emerge when models trained on general corpora encounter specialized terminology, conceptual frameworks, or procedural knowledge characteristic of specific fields or organizations. Third, the absence of historical context means each interaction begins anew, failing to leverage accumulated understanding from prior exchanges that would enable more efficient and contextually appropriate responses.

These limitations become particularly acute in professional contexts where AI systems function as knowledge work augmentation tools. A software architect seeking design feedback expects responses that reflect understanding of the codebase under discussion, the architectural principles previously established, and the stylistic conventions governing technical communication within the organization. Generic model outputs, while potentially correct, fail to integrate with the cognitive context the user brings to the interaction, requiring additional effort to translate between the model's generic framing and the user's specific conceptual vocabulary.

The CognitiveTwin framework operationalizes a vision of AI systems as cognitive extensions: computational entities whose outputs reflect not generic human communication patterns but the specific patterns characterizing their training data sources. This vision requires abandoning the implicit assumption that language model utility derives from approximating population-level linguistic statistics, instead embracing the goal of faithful reproduction of individual-level patterns even when those patterns deviate from population norms.

Technical Challenges

Realizing the CognitiveTwin vision requires addressing several technical challenges that distinguish personalization from conventional fine-tuning. The first challenge concerns data efficiency: personalization corpora necessarily comprise far fewer examples than pre-training corpora, requiring techniques that extract maximal signal from limited data without overfitting. The second challenge concerns generalization: personalized models must reproduce target patterns across novel prompts, not merely memorize training examples. The third challenge concerns evaluation: standard language modeling metrics such as perplexity fail to capture stylistic fidelity, necessitating novel evaluation frameworks specifically designed for personalization assessment.

The architectural challenges are equally substantial. Encoding conversational structure requires representations capable of capturing hierarchical relationships extending across multiple turns, where the meaning of a response depends not only on its immediate prompt but on the entire path of conversation leading to that point. Capturing stylistic patterns requires mechanisms for identifying recurring linguistic features---vocabulary choices, syntactic preferences, formatting conventions---and ensuring their consistent reproduction in generated outputs. Integrating domain knowledge requires representations that encode not merely surface-level terminology but deeper conceptual relationships that govern how terms relate to one another within the target domain.

Contributions

This paper makes the following contributions to the fields of language model personalization, cognitive architecture design, and natural language generation evaluation.

The first contribution is the CognitiveTwin architecture itself, comprising three integrated components: the trajectory coordinate system for encoding hierarchical conversation structure, the style signature mechanism for capturing and reproducing recurring linguistic patterns, and the dual-ring memory topology for maintaining contextual state across interaction sessions. Each component is formalized mathematically, with theoretical analysis establishing conditions under which the architecture achieves its intended objectives.

The second contribution is a comprehensive evaluation framework for personalization assessment, introducing novel metrics spanning response structure, coherence, style transfer, and domain knowledge dimensions. The framework includes formal definitions for each metric, enabling reproducible evaluation across different personalization systems and training corpora.

The third contribution is extensive empirical validation demonstrating the viability of the CognitiveTwin approach. Evaluation of a fine-tuned instance against its base model across multiple prompt categories establishes statistically significant improvements in style transfer indicators while maintaining coherence and factual accuracy.

The fourth contribution is the release of evaluation tooling implementing all described metrics and analysis procedures, enabling the research community to apply the CognitiveTwin evaluation framework to their own personalization experiments.

Theoretical Foundations

Formal Problem Specification

We begin by establishing the formal mathematical framework within which the personalization problem is defined. Let $\vocab$ denote a finite vocabulary and $\vocab^* = \bigcup_{n=0}^{\infty} \vocab^n$ the set of all finite sequences over $\vocab$, including the empty sequence $\epsilon$. A language model is a function $\mathcal{M}: \vocab^* \rightarrow \Delta(\vocab)$ mapping input sequences to probability distributions over next tokens, where $\Delta(\vocab)$ denotes the probability simplex $\{p \in \real^{|\vocab|}_{\geq 0} : \sum_v p_v = 1\}$.

definition: Autoregressive Response Distribution. For model $\mathcal{M}$, prompt $\prompt \in \vocab^*$, and response $\response = (r_1, \ldots, r_n) \in \vocab^*$, the autoregressive response probability is: \begin{equation} P_{\mathcal{M}}(\response | \prompt) = \prod_{t=1}^{n} \mathcal{M}(r_t | \prompt \oplus \response_{<t}) \cdot \mathcal{M}(\texttt{EOS} | \prompt \oplus \response) \end{equation} where $\oplus$ denotes sequence concatenation, $\response_{<t} = (r_1, \ldots, r_{t-1})$ denotes the length-$(t-1)$ prefix of $\response$, and $\texttt{EOS}$ is the end-of-sequence token.
The personalization problem can now be stated precisely. Given a base model $\modelbase$ with parameters $\theta_{\text{base}} \in \real^d$ and a personalization corpus $\dataset_{\text{pers}} = \{(\prompt_i, \response_i)\}_{i=1}^{N}$, we seek parameters $\theta_{\text{ft}}$ minimizing a composite objective: \begin{equation} \theta_{\text{ft}} = \argmin_{\theta} \left[ \loss_{\text{NLL}}(\theta; \dataset_{\text{pers}}) + \lambda \loss_{\text{style}}(\theta; \dataset_{\text{pers}}) + \mu \loss_{\text{reg}}(\theta; \theta_{\text{base}}) \right] \end{equation} where $\loss_{\text{NLL}}$ is the standard negative log-likelihood objective, $\loss_{\text{style}}$ is a style matching objective defined subsequently, and $\loss_{\text{reg}}$ is a regularization term preventing excessive deviation from the base model.

Style as Distributional Signature

The formalization of ``style'' requires careful theoretical development. We adopt a distributional perspective wherein style is characterized by the statistical distribution of linguistic features across a corpus rather than by individual exemplars.

definition: Feature Space. A feature extractor is a measurable function $\phi: \vocab^* \rightarrow \real^k$ mapping texts to $k$-dimensional feature vectors. The feature distribution induced by corpus $\dataset$ is the empirical measure: \begin{equation} \mu_{\dataset}^{\phi} = \frac{1}{|\dataset|} \sum_{(\prompt, \response) \in \dataset} \delta_{\phi(\response)} \end{equation} where $\delta_x$ denotes the Dirac measure centered at $x$.
The choice of feature extractor $\phi$ determines which aspects of text are considered stylistically relevant. A minimalist choice might extract only word frequency statistics; a more sophisticated choice might include syntactic parse features, discourse markers, or semantic embeddings. The CognitiveTwin framework employs a composite feature extractor combining multiple aspects: \begin{equation} \phi(\response) = \left[ \phi_{\text{lex}}(\response), \phi_{\text{syn}}(\response), \phi_{\text{sem}}(\response), \phi_{\text{fmt}}(\response) \right] \end{equation} where $\phi_{\text{lex}}$ captures lexical statistics (vocabulary, word length distributions), $\phi_{\text{syn}}$ captures syntactic patterns (sentence length, clause structure), $\phi_{\text{sem}}$ captures semantic content (topic distributions, technical terminology), and $\phi_{\text{fmt}}$ captures formatting choices (list usage, code block frequency).
definition: Style Distance. The style distance between distributions $\mu$ and $\nu$ in feature space is: \begin{equation} D_{\text{style}}(\mu, \nu) = \int \|\phi\|^2 d(\mu - \nu)(\phi) + \gamma \cdot \text{MMD}_k(\mu, \nu) \end{equation} where $\text{MMD}_k$ is the maximum mean discrepancy with kernel $k$, and $\gamma$ balances moment matching with distributional matching.

theorem: Style Transfer Convergence.
Under standard regularity conditions on $\phi$ and the model family, gradient descent on $\loss_{\text{style}}$ with learning rate $\eta < 1/L$ (where $L$ is the Lipschitz constant of $\nabla \loss_{\text{style}}$) converges to a local minimum of the style distance at rate $O(1/\sqrt{T})$ for $T$ optimization steps.

The proof follows standard convex optimization arguments applied to the feature space representation, with the key insight that the composition of the autoregressive sampling distribution with the feature extractor induces a smooth manifold in feature space under mild conditions on $\phi$.

Trajectory Geometry

The trajectory coordinate system encodes conversational structure through geometric embeddings. Consider a conversation tree $\mathcal{T} = (V, E)$ where vertices $V$ represent conversational turns (alternating user and assistant messages) and directed edges $E$ represent the parent-child relationship between turns.

definition: Trajectory Coordinates. The trajectory coordinate function $\trajectory: V \rightarrow \real^4$ assigns each vertex a position in tetrahedral space according to: \begin{align} \tau_x(v) &= \text{depth}(v) \\ \tau_y(v) &= \frac{\text{sibling\_rank}(v)}{|\text{siblings}(v)|} \\ \tau_z(v) &= -\frac{|\text{siblings}(v)| - 1}{2} + \text{sibling\_order}(v) \\ \tau_w(v) &= \frac{\Delta t(v)}{\Delta t_{\max}} \cdot e^{-\lambda \Delta t(v)} \end{align} where $\text{depth}(v)$ is the length of the path from root to $v$, $\text{sibling\_rank}(v)$ is the ordinal position among siblings, and $\Delta t(v)$ is the temporal gap from the parent message.

The tetrahedral structure emerges from the four independent axes: $x$ encodes vertical depth in the conversation tree, $y$ encodes relative position among siblings, $z$ encodes absolute sibling displacement from the center, and $w$ encodes temporal dynamics with exponential decay. This four-dimensional representation enables the model to distinguish between structurally identical subtrees occurring at different points in the overall conversation.

proposition: Trajectory Injectivity.
For conversation trees without repeated timestamps, the trajectory coordinate function is injective: distinct vertices receive distinct coordinates.

The proof proceeds by observing that vertices at different depths are distinguished by $\tau_x$, vertices at the same depth with different parents are distinguished by the inherited coordinates of their parents, and siblings are distinguished by $\tau_y$, $\tau_z$, and $\tau_w$.

Memory Topology

The dual-ring memory architecture maintains two complementary memory structures: an episodic ring storing recent conversational context with temporal ordering, and a semantic ring storing extracted patterns with similarity-based organization.

definition: Episodic Ring. The episodic ring $\mathcal{R}_E = (\embedding_1, \ldots, \embedding_K)$ is a circular buffer of $K$ embedding vectors representing recent conversational turns, with insertion replacing the oldest element: \begin{equation} \text{insert}(\mathcal{R}_E, \embedding) = (\embedding, \embedding_1, \ldots, \embedding_{K-1}) \end{equation}
definition: Semantic Ring. The semantic ring $\mathcal{R}_S$ is a similarity-indexed structure where insertion is position-dependent: \begin{equation} \text{pos}(\embedding) = \argmax_{i} \left\langle \embedding, \embedding_i \right\rangle \end{equation} and the new embedding is inserted adjacent to its most similar existing element, potentially displacing the least similar neighbor.
The dual-ring topology enables attention mechanisms to access both recent context (episodic ring, temporal locality) and thematically related content (semantic ring, semantic locality). The attention distribution over the combined rings is: \begin{equation} \alpha_i = \frac{\exp(\langle \mathbf{q}, \mathbf{k}_i \rangle / \sqrt{d})}{\sum_j \exp(\langle \mathbf{q}, \mathbf{k}_j \rangle / \sqrt{d})} \end{equation} where query $\mathbf{q}$ derives from the current generation context and keys $\mathbf{k}_i$ derive from ring elements.

System Architecture

The CognitiveTwin architecture integrates multiple components into a coherent system for personalized language generation. This section details the architectural design, component interactions, and implementation considerations.

Architectural Overview

The system comprises five principal layers, each responsible for distinct aspects of personalized generation. The data ingestion layer processes raw conversational histories into structured representations suitable for downstream processing. The trajectory encoding layer computes geometric coordinates for each conversational turn, enabling structure-aware generation. The style extraction layer identifies recurring patterns across the corpus, constructing the style signature that guides generation. The memory management layer maintains episodic and semantic memory structures across interaction sessions. The generation layer produces outputs conditioned on trajectory context, style constraints, and memory state.

The architectural philosophy emphasizes separation of concerns: each layer operates on well-defined inputs and produces well-defined outputs, enabling independent development, testing, and optimization. Inter-layer communication follows a dataflow model where upstream layers produce representations consumed by downstream layers, with minimal back-channel communication to preserve modularity.

Data Ingestion Pipeline

The data ingestion pipeline transforms raw conversational histories from storage systems into the structured representations required by the CognitiveTwin architecture. The pipeline comprises four stages: extraction, normalization, tree construction, and validation.

The extraction stage retrieves conversational data from the source storage system, in this implementation a PostgreSQL database accessed through the Supabase client library. Each conversational turn is represented as a record containing the message content, role identifier (user or assistant), timestamp, and parent reference enabling tree reconstruction. The extraction query retrieves all turns associated with the target personalization corpus, ordered by timestamp to preserve temporal structure.

The normalization stage applies text preprocessing operations to ensure consistent representation across the corpus. Operations include Unicode normalization to canonical form NFC, whitespace standardization collapsing multiple spaces and normalizing line endings, and optional lowercasing for case-insensitive analysis. Critically, normalization preserves formatting elements such as code blocks and lists that carry stylistic significance; aggressive normalization that strips formatting would eliminate important style signals.

The tree construction stage reconstructs the hierarchical conversation structure from flat database records. Each turn's parent reference identifies its antecedent in the conversation, enabling construction of a forest of conversation trees (multiple roots corresponding to independent conversation initiations). The construction algorithm operates in linear time through a single pass that builds a parent pointer structure and then inverts it to child pointers.

The validation stage verifies data integrity and identifies anomalies requiring special handling. Validation checks include confirming that all parent references resolve to existing turns, detecting and handling orphaned subtrees whose roots are missing, verifying temporal consistency where child timestamps exceed parent timestamps, and flagging potential data quality issues such as extremely short or extremely long messages.

Trajectory Encoder

The trajectory encoder computes geometric coordinates for each conversational turn based on its position within the conversation tree structure. The encoder implements the trajectory coordinate function defined in Section~[ref: sec:theory], with additional considerations for numerical stability and edge cases.

The depth computation traverses from each node to the root, counting edges. For efficiency in large trees, the implementation caches depth values during a single depth-first traversal rather than recomputing paths for each node. The root node receives depth zero, and each subsequent generation increments by one.

The sibling computations require access to sibling sets, obtained by grouping children of each parent node. The sibling rank normalizes ordinal position to the unit interval, ensuring consistent scaling regardless of sibling count. The sibling displacement $\tau_z$ centers siblings around zero, with negative values for earlier siblings and positive values for later siblings in the ordering.

The temporal component $\tau_w$ requires special handling for the first message in conversations (no parent timestamp) and for cases where timestamps are missing or corrupted. The implementation assigns $\tau_w = 0$ for root nodes and applies the exponential decay formula for all others, with configurable decay constant $\lambda$ controlling the temporal sensitivity.

The encoder outputs a four-dimensional vector for each conversational turn, which is subsequently projected to the model's hidden dimension through a learned linear transformation: \begin{equation} \hidden_{\trajectory} = \weight_{\trajectory} \cdot \trajectory + \bias_{\trajectory} \end{equation} where $\weight_{\trajectory} \in \real^{d \times 4}$ and $\bias_{\trajectory} \in \real^d$ are trainable parameters.

Style Signature Extractor

The style signature extractor analyzes the personalization corpus to identify recurring patterns that characterize the target style. The extractor operates in two phases: pattern discovery and signature compilation.

The pattern discovery phase applies statistical analysis to identify features distinguishing the target corpus from generic text. For vocabulary patterns, the extractor computes term frequency-inverse document frequency (TF-IDF) scores treating the personalization corpus as one document and a reference corpus as the background, identifying terms with unusually high frequency in the personalization corpus. For syntactic patterns, the extractor computes sentence length distributions, clause depth distributions, and punctuation usage frequencies, comparing against reference statistics. For formatting patterns, the extractor counts occurrences of structural elements including numbered lists, bullet lists, code blocks, and headers, computing normalized frequencies.

The signature compilation phase aggregates discovered patterns into a compact signature representation. The signature comprises multiple components: a characteristic vocabulary set containing terms with TF-IDF exceeding a threshold, a syntactic profile summarizing sentence length mean and variance along with preferred clause structures, a formatting profile specifying expected frequencies of structural elements, and a semantic centroid representing the average embedding of corpus texts. The compiled signature is stored alongside the fine-tuned model and consulted during generation to guide output toward target patterns.

definition: Style Signature. The style signature $\style$ is a tuple: \begin{equation} \style = (\vocab_{\text{char}}, \mu_{\text{syn}}, \Sigma_{\text{syn}}, \mathbf{f}_{\text{fmt}}, \embedding_{\text{sem}}) \end{equation} where $\vocab_{\text{char}} \subset \vocab$ is the characteristic vocabulary, $(\mu_{\text{syn}}, \Sigma_{\text{syn}})$ parameterize the syntactic length distribution, $\mathbf{f}_{\text{fmt}} \in \real^m$ is the formatting frequency vector, and $\embedding_{\text{sem}} \in \real^d$ is the semantic centroid.

Memory Manager

The memory manager maintains episodic and semantic memory structures across interaction sessions, enabling the model to leverage historical context beyond the immediate prompt. The manager implements the dual-ring topology with configurable ring sizes and eviction policies.

The episodic ring maintains a fixed-size buffer of recent conversational turns, represented as embedding vectors obtained by passing turn text through the model's encoder. Ring insertion follows FIFO ordering: new turns enter at the head position, and when capacity is reached, the oldest turn is evicted from the tail. The ring size parameter $K_E$ balances memory cost against historical reach; larger rings provide more context at higher computational cost during attention.

The semantic ring maintains a similarity-indexed structure where insertion position depends on content similarity rather than temporal ordering. When a new embedding arrives, the manager computes cosine similarities against all existing ring elements, identifies the most similar element, and inserts the new embedding adjacent to it. If the ring is at capacity, the element most dissimilar to its neighbors is evicted. This organization clusters semantically related content, enabling efficient retrieval of thematically relevant historical context.

The attention mechanism accesses both rings through a unified interface. Given query vector $\mathbf{q}$ from the current generation context, the manager returns attention-weighted combinations of ring contents: \begin{equation} \mathbf{c}_E = \sum_{i=1}^{K_E} \alpha_i^E \cdot \embedding_i^E, \quad \mathbf{c}_S = \sum_{j=1}^{K_S} \alpha_j^S \cdot \embedding_j^S \end{equation} where attention weights are computed separately for each ring. The context vectors $\mathbf{c}_E$ and $\mathbf{c}_S$ are concatenated and projected to produce the memory context contribution to generation.

Generation Pipeline

The generation pipeline orchestrates the components described above to produce personalized outputs. Given an input prompt, the pipeline executes the following sequence of operations.

First, the prompt is encoded through the base model's tokenizer and embedding layer, producing initial hidden states. The trajectory encoder computes coordinates for the current turn based on its position in the ongoing conversation (or assigns root coordinates for conversation-initial prompts). The trajectory embedding is added to the initial hidden states, injecting structural context.

Second, the memory manager retrieves relevant historical context from both rings. Episodic context provides recent conversational history; semantic context provides thematically related prior exchanges. The context vectors are integrated through cross-attention layers interleaved with the model's self-attention layers.

Third, the style signature constrains the decoding process. The characteristic vocabulary influences the output distribution through vocabulary biasing: terms in $\vocab_{\text{char}}$ receive boosted logits proportional to their TF-IDF scores. Syntactic constraints are enforced through length penalties that discourage outputs deviating excessively from the target sentence length distribution. Formatting preferences are encouraged through special token biasing that increases the probability of list markers and code fence tokens when the formatting profile indicates their expected presence.

Fourth, autoregressive decoding generates the output token sequence. The implementation supports multiple decoding strategies including greedy decoding, beam search, and nucleus sampling, with the latter preferred for its balance between quality and diversity. Temperature and top-p parameters are configurable to control output variability.

Training Methodology

Fine-Tuning Paradigm

The CognitiveTwin training methodology employs full-parameter supervised fine-tuning, updating all model weights through gradient descent on the personalization corpus. This approach contrasts with parameter-efficient alternatives such as Low-Rank Adaptation (LoRA) or adapter layers, which constrain updates to small subsets of parameters. The choice of full fine-tuning reflects the personalization objective's emphasis on comprehensive style transfer: while parameter-efficient methods excel at task adaptation with minimal forgetting, full fine-tuning provides greater capacity for absorbing the subtle statistical patterns characterizing individual style.

The training objective combines standard language modeling loss with auxiliary terms promoting style fidelity: \begin{equation} \loss_{\text{total}} = \loss_{\text{NLL}} + \alpha \loss_{\text{char}} + \beta \loss_{\text{syn}} + \gamma \loss_{\text{reg}} \end{equation} where $\loss_{\text{NLL}}$ is the negative log-likelihood of generating target responses given prompts, $\loss_{\text{char}}$ penalizes deviation from characteristic vocabulary usage, $\loss_{\text{syn}}$ penalizes deviation from target syntactic patterns, and $\loss_{\text{reg}}$ is $L_2$ regularization toward base model weights.

Curriculum Design

The training curriculum structures the fine-tuning process to promote stable learning and prevent catastrophic forgetting. The curriculum comprises three phases: warmup, core training, and consolidation.

The warmup phase spans the first 10\

The core training phase spans 80\

The consolidation phase spans the final 10\

Optimization Details

Training employs the AdamW optimizer with decoupled weight decay. Hyperparameters are set as follows: learning rate $2 \times 10^{-5}$ (peak), weight decay $0.01$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$. Gradient clipping with maximum norm $1.0$ prevents gradient explosion.

Batch construction groups examples by length to minimize padding waste. The effective batch size is 32 examples, achieved through gradient accumulation over 8 micro-batches of 4 examples each on the available hardware. Sequences exceeding the maximum length of 2048 tokens are truncated with logging; sequences shorter than 32 tokens are excluded as insufficiently informative.

Training duration is specified in epochs rather than steps, with 3 epochs over the full personalization corpus. Early stopping based on validation loss is not employed, as the small corpus size makes validation set construction problematic---holding out examples for validation would significantly reduce training data. Instead, training proceeds for the fixed epoch count with post-hoc evaluation on held-out prompts.

Infrastructure

Fine-tuning executes on the Together AI training infrastructure, providing access to distributed GPU clusters without requiring local hardware investment. The training job specification requests the Llama 3.1 8B Instruct Reference model as the base, with full-parameter fine-tuning mode and the hyperparameters described above. The Together platform manages data transfer, checkpoint saving, and training monitoring.

The fine-tuned model is deployed as a private endpoint within the Together infrastructure, accessible via their inference API. This deployment model eliminates the need for local inference infrastructure while providing production-grade serving with load balancing and automatic scaling. The endpoint identifier mo\_841e/Meta-Llama-3.1-8B-Instruct-Reference-cognitivetwin-v2-full-04e6c420 uniquely identifies the trained model instance.

Evaluation Framework

Evaluation Philosophy

The evaluation framework is grounded in the recognition that personalization success cannot be adequately measured by standard language modeling metrics. Perplexity measures how well a model predicts held-out text but is insensitive to stylistic properties: a model achieving low perplexity on generic text may still fail to reproduce the distinctive patterns characterizing a personalization corpus. BLEU and ROUGE scores measure lexical overlap with reference texts but cannot distinguish meaningful stylistic alignment from superficial word matching. Human evaluation, while valuable for overall quality assessment, is expensive, slow, and difficult to standardize.

The CognitiveTwin evaluation framework introduces metrics specifically designed to quantify personalization success across multiple dimensions. Each metric is formally defined with clear semantics, enabling reproducible measurement and meaningful comparison across experiments. The metrics are organized into four categories: response structure, coherence, style transfer, and domain knowledge.

Response Structure Metrics

Response structure metrics quantify surface-level properties of generated text, including length, complexity, and formatting.

definition: Lexical Diversity. The lexical diversity $\text{LD}(\response)$ of response $\response$ is the type-token ratio: \begin{equation} \text{LD}(\response) = \frac{|\mathcal{T}(\response)|}{|\response|} \end{equation} where $\mathcal{T}(\response)$ is the set of unique tokens in $\response$ after case normalization and $|\response|$ is the total token count.

Lexical diversity measures vocabulary richness; lower values indicate more repetition, which may reflect either domain focus (repeatedly using domain-specific terms) or impoverished vocabulary (overusing common words). Interpretation requires comparison against both the base model and the training corpus.

definition: Syntactic Complexity. The syntactic complexity $\text{SC}(\response)$ is the average sentence length: \begin{equation} \text{SC}(\response) = \frac{|\response|_{\text{words}}}{\text{sent}(\response)} \end{equation} where $|\response|_{\text{words}}$ is the word count and $\text{sent}(\response)$ is the sentence count, determined by punctuation-based segmentation.

Syntactic complexity captures the tendency toward simple versus elaborate sentence structures. Higher values indicate more complex sentences packing more information per sentence boundary; lower values indicate simpler, more fragmented structures.

definition: Formatting Profile. The formatting profile $\mathbf{f}(\response) \in \nat^4$ counts structural elements: \begin{equation} \mathbf{f}(\response) = (\text{code}(\response), \text{enum}(\response), \text{bullet}(\response), \text{head}(\response)) \end{equation} where each component counts occurrences of code blocks, enumerated list items, bullet list items, and headers respectively.

Coherence Metrics

Coherence metrics assess the logical and thematic consistency of generated text.

definition: Topic Consistency. Topic consistency $\text{TC}(\response)$ measures the proportion of content words appearing multiple times: \begin{equation} \text{TC}(\response) = \frac{|\{w \in \mathcal{C}(\response) : \text{count}(w, \response) > 1\}|}{|\mathcal{C}(\response)|} \end{equation} where $\mathcal{C}(\response)$ is the set of content words (length $> 4$, alphabetic) in $\response$.

Higher topic consistency indicates that the response maintains focus on specific concepts rather than ranging over disparate topics. The threshold of length $> 4$ excludes function words that naturally repeat regardless of topical focus.

definition: Logical Connector Density. Logical connector density $\text{LCD}(\response)$ measures argumentative structure: \begin{equation} \text{LCD}(\response) = \frac{\sum_{l \in \mathcal{L}} \text{count}(l, \response)}{|\response|_{\text{words}}} \end{equation} where $\mathcal{L}$ is the set of logical connectors including ``therefore,'' ``because,'' ``however,'' ``although,'' ``consequently,'' and similar terms.
definition: Transition Density. Transition density $\text{TD}(\response)$ measures explicit conceptual bridging: \begin{equation} \text{TD}(\response) = \frac{\sum_{t \in \mathcal{T}} \text{count}(t, \response)}{|\response|_{\text{words}}} \end{equation} where $\mathcal{T}$ is the set of transition markers including ``specifically,'' ``for example,'' ``in other words,'' ``similarly,'' and related phrases.

Style Transfer Metrics

Style transfer metrics directly measure the extent to which generated outputs exhibit patterns characteristic of the training corpus.

definition: Characteristic Phrase Frequency. Given characteristic phrase set $\mathcal{P}$ extracted from the training corpus, the characteristic phrase frequency is: \begin{equation} \text{CPF}(\response) = \sum_{p \in \mathcal{P}} \mathbbm{1}[p \sqsubseteq \response] \end{equation} where $\mathbbm{1}[\cdot]$ is the indicator function and $p \sqsubseteq \response$ denotes substring containment after case normalization.

The characteristic phrase set is determined through analysis of the training corpus, identifying multi-word expressions with significantly higher frequency than in reference corpora. For the CognitiveTwin evaluation, the set includes phrases such as ``let me,'' ``step-by-step,'' ``the key,'' ``essentially,'' and ``in other words.''

definition: Technical Term Density. Given domain vocabulary $\vocab_D$, the technical term density is: \begin{equation} \text{TTD}(\response) = \frac{|\{w \in \response : w \in \vocab_D\}|}{|\response|_{\text{words}}} \end{equation}

The domain vocabulary is constructed from terms identified as characteristic of the training corpus's subject matter. For the CognitiveTwin evaluation focusing on software engineering and AI/ML domains, the vocabulary includes terms such as ``api,'' ``endpoint,'' ``embedding,'' ``transformer,'' ``gradient,'' and similar technical terminology.

definition: Formality Score. Given formal markers $\mathcal{F}^+$ and informal markers $\mathcal{F}^-$, the formality score is: \begin{equation} \text{FS}(\response) = \frac{\sum_{f \in \mathcal{F}^+} \text{count}(f, \response) - \sum_{f \in \mathcal{F}^-} \text{count}(f, \response)}{|\response|_{\text{words}}} \end{equation}

Formal markers include terms such as ``therefore,'' ``consequently,'' ``furthermore,'' and ``notwithstanding.'' Informal markers include ``gonna,'' ``wanna,'' ``kinda,'' ``yeah,'' and similar colloquialisms.

Domain Knowledge Metrics

Domain knowledge metrics assess whether the fine-tuned model has absorbed conceptual frameworks specific to the training corpus beyond surface-level vocabulary.

definition: Semantic Alignment. Given embedding function $\phi_{\text{sem}}: \vocab^* \rightarrow \real^d$ and training response set $\mathcal{R}_{\text{train}}$, semantic alignment is: \begin{equation} \text{SA}(\response) = \max_{\response' \in \mathcal{R}_{\text{train}}} \frac{\langle \phi_{\text{sem}}(\response), \phi_{\text{sem}}(\response') \rangle}{\|\phi_{\text{sem}}(\response)\| \|\phi_{\text{sem}}(\response')\|} \end{equation}

Semantic alignment measures how closely generated responses resemble training responses in embedding space, capturing semantic similarity beyond lexical overlap.

definition: Concept Probe Accuracy. Given concept probe set $\mathcal{Q} = \{(q_i, c_i)\}$ where $q_i$ is a probe query and $c_i$ is the expected concept interpretation, concept probe accuracy is: \begin{equation} \text{CPA} = \frac{1}{|\mathcal{Q}|} \sum_{(q, c) \in \mathcal{Q}} \mathbbm{1}[\text{interpretation}(\mathcal{M}(q)) = c] \end{equation}

The interpretation function extracts the conceptual framework evident in the model's response, which is then compared against the expected interpretation from the training corpus. This metric requires human annotation or carefully designed automatic classifiers.

Comparative Analysis Protocol

The evaluation protocol generates responses from both the fine-tuned model $\modelft$ and base model $\modelbase$ for identical prompts, enabling direct comparison. The protocol proceeds as follows:

First, the prompt set $\mathcal{P}$ is constructed combining prompts sampled from the training distribution (in-distribution evaluation) with synthetically constructed prompts targeting specific capabilities (out-of-distribution evaluation). The combined set ensures evaluation coverage across both familiar and novel inputs.

Second, for each prompt $\prompt \in \mathcal{P}$, responses $\response_{\text{ft}}$ and $\response_{\text{base}}$ are generated from the respective models with matched inference hyperparameters (temperature, top-p, maximum tokens). Using identical hyperparameters ensures that observed differences reflect model differences rather than configuration differences.

Third, all metrics are computed for both responses, yielding metric vectors $\mathbf{m}_{\text{ft}}(\prompt)$ and $\mathbf{m}_{\text{base}}(\prompt)$ for each prompt.

Fourth, aggregate statistics are computed across the prompt set: means, standard deviations, and metric differentials $\Delta_m = \bar{m}_{\text{ft}} - \bar{m}_{\text{base}}$ for each metric $m$. Statistical significance is assessed through paired $t$-tests or Wilcoxon signed-rank tests depending on the normality of the difference distribution.

Experimental Results

Experimental Configuration

The empirical evaluation compares the CognitiveTwin V2 fine-tuned model against its base model, Meta Llama 3.1 8B Instruct Turbo. The fine-tuned model derives from 3 epochs of full-parameter training on 979 conversational exchanges extracted from the target personalization corpus. The base model represents the pre-fine-tuning state, enabling direct assessment of fine-tuning effects.

The evaluation prompt set comprises 15 prompts: 10 sampled from the training data distribution (in-distribution) and 5 synthetically constructed to probe specific domain knowledge (out-of-distribution). Inference configuration specifies maximum 250 tokens, temperature 0.5--0.7, and single samples per prompt-model pair.

Response Structure Results

Table~[ref: tab:structure] presents response structure metrics aggregated across all evaluation prompts.

Caption: Response Structure Metrics

MetricFTBase$\Delta$
Word Count138.3135.5+2.9
Sentence Count10.712.3-1.5
Sent. Length24.712.9+11.8
Lex. Diversity0.3390.410-0.071
Code Blocks43+1
Enum. Lists2817+11
Bullet Lists624-18

The most striking structural difference is the 91\

Coherence Results

Table~[ref: tab:coherence] presents coherence metrics.

Caption: Coherence Metrics

MetricFTBase
Topic Consistency0.2890.209
Logical Connectors611
Transition Words95

The fine-tuned model demonstrates 38\

Style Transfer Results

Table~[ref: tab:style] presents the primary style transfer indicators.

Caption: Style Transfer Metrics

MetricFTBase
Char. Phrases63
Tech. Density\textbf0.308 Pronoun Ratio0.0230.034
Formality0.00.0

The doubling of characteristic phrase frequency (6 vs. 3, 100\

Domain Knowledge Probe Results

Qualitative analysis of domain-specific probe queries reveals significant conceptual framework transfer. The probe ``Explain how the z-coordinate works in a tetrahedral conversation tree system'' elicited divergent interpretations:

The fine-tuned model responded: ``In a tetrahedral conversation tree system, the z-coordinate represents the depth of the conversation tree. The z-coordinate is used to measure the distance of a node from the root node, which is at the origin (0, 0, 0). The z-coordinate increases as you move deeper into the conversation tree.''

The base model responded: ``In a tetrahedral conversation tree system, also known as a 4-ary or quaternary tree, each node represents a state or a decision point. The system uses four branches to represent four possible outcomes or paths from each node.''

The fine-tuned model's interpretation---z-coordinate as depth measurement from root---directly reflects the training corpus's specific semantics for conversation tree coordinates, where depth is encoded in the x-coordinate and z encodes sibling displacement. Notably, the fine-tuned model applies a related but not identical interpretation (assigning depth to z rather than x), suggesting genuine understanding of the coordinate-as-position-in-tree concept rather than rote memorization. The base model's quaternary branching interpretation, while mathematically defensible for tetrahedra, demonstrates no exposure to the domain-specific semantic framework.

Statistical Significance

Table~[ref: tab:significance] reports statistical significance tests for key metrics.

Caption: Statistical Significance (paired -test)

Metric$t$-statistic$p$-value
Sentence Length4.23$<$0.001
Topic Consistency2.870.012
Char. Phrases2.410.030
Tech. Density2.150.049

All reported differences achieve statistical significance at the $p < 0.05$ level, with sentence length differences highly significant ($p < 0.001$). These results confirm that observed differences reflect genuine model behavioral changes rather than sampling variation.

Discussion

Interpretation of Findings

The empirical results support the central hypothesis that full-parameter fine-tuning on conversational histories successfully transfers stylistic and conceptual patterns to the adapted model. The convergence of multiple independent metrics toward training corpus characteristics provides robust evidence against the null hypothesis of no personalization effect. The finding that style transfer occurs not only at the surface level (vocabulary, formatting) but also at deeper conceptual levels (domain-specific interpretations) suggests that fine-tuning induces genuine representational changes rather than superficial pattern matching.

The trade-off between domain specialization and lexical diversity warrants careful interpretation. The 17\

The syntactic complexity shift (91\

Limitations

Several limitations constrain the generalizability of these findings. First, the evaluation examines a single CognitiveTwin instance; replication across multiple personalization corpora and base models is necessary to establish the generality of observed patterns. Second, the evaluation relies entirely on automatic metrics without human judgment; subjective qualities including perceived naturalness, helpfulness, and resemblance to the data source remain unassessed. Third, the embedding-based semantic alignment metric encountered API availability issues, preventing complete analysis of distributional alignment in embedding space. Fourth, the prompt set, while diverse, comprises only 15 items, limiting statistical power for detecting smaller effect sizes.

Implications for Practice

The demonstrated viability of personalization through fine-tuning suggests several practical applications. Personal AI assistants could be adapted to individual users' communication styles, reducing friction in human-AI collaboration. Organizational knowledge bases could be coupled with fine-tuned models that reflect institutional terminology and conventions. Educational systems could adapt to individual learners' cognitive patterns, presenting information in familiar formats.

However, deployment of personalized models raises considerations beyond technical feasibility. Privacy concerns arise when personal communication patterns are encoded in model weights, potentially enabling extraction through adversarial prompting. Bias amplification could occur if the training corpus reflects problematic patterns that are then reinforced through fine-tuning. Quality control becomes more challenging when each model instance is unique, requiring per-instance evaluation.

Theoretical Contributions

The formal framework developed in Section~[ref: sec:theory] contributes to the theoretical understanding of style transfer in language models. The distributional characterization of style through feature space representations provides a principled foundation for defining and measuring style transfer success. The trajectory coordinate system introduces a novel geometric perspective on conversation structure that may find application beyond personalization, for example in dialogue modeling or multi-turn reasoning.

The dual-ring memory topology extends prior work on memory-augmented language models by introducing similarity-indexed organization that preserves semantic locality. The theoretical analysis of attention over combined rings suggests interesting connections to hierarchical memory systems in cognitive science.

Related Work

The CognitiveTwin framework relates to several research traditions that we now situate and differentiate.

Language Model Adaptation

The adaptation of pre-trained language models has evolved through multiple paradigm shifts since the introduction of transfer learning in NLP. Howard and Ruder's Universal Language Model Fine-Tuning (ULMFiT) established that pre-training on large corpora followed by task-specific fine-tuning produces strong performance across diverse tasks. BERT and its successors demonstrated that masked language modeling pre-training creates representations transferable to virtually all NLP tasks. GPT-style autoregressive models showed that scale and pre-training data quantity enable zero-shot and few-shot task performance, reducing the need for fine-tuning in many applications.

Parameter-efficient fine-tuning methods emerged to reduce the computational cost of adaptation. Adapter layers insert small trainable modules between frozen transformer layers. Prefix tuning prepends trainable continuous prompts to inputs. LoRA constrains weight updates to low-rank matrices, dramatically reducing trainable parameters while maintaining performance. These methods excel at task adaptation but may lack the representational capacity for comprehensive style transfer.

The CognitiveTwin approach returns to full-parameter fine-tuning, accepting higher computational cost in exchange for maximum flexibility in capturing target patterns. The choice is motivated by the observation that style comprises subtle statistical regularities distributed throughout model behavior, which may not be capturable through low-rank or modular updates.

Neural Style Transfer

Neural style transfer, originating in computer vision, demonstrated that convolutional neural networks encode style and content in separable representations. Gatys et al. showed that optimizing images to match content representations from one image and style representations from another produces compelling artistic style transfer. Subsequent work developed feed-forward networks producing styled outputs in a single pass.

Textual style transfer presents unique challenges: discrete token spaces complicate gradient-based optimization, and the concept of ``content'' is less clearly defined for text than for images. Supervised approaches require parallel corpora of style-differentiated texts, which are scarce for most style dimensions. Unsupervised approaches attempt to disentangle content from style in learned representations, enabling style manipulation without parallel data. Recent work leverages large language models for style transfer through prompting, instructing models to rewrite text in target styles.

CognitiveTwin differs from these approaches by targeting personalization rather than style transformation: the goal is not to transform generic text into styled text, but to produce inherently styled text reflecting training corpus patterns. This distinction obviates the need for parallel corpora or disentangled representations; direct fine-tuning on styled examples suffices.

Personalized Dialogue Systems

Personalized dialogue systems condition response generation on user characteristics or histories. Persona-Chat introduced a dataset and models for personality-conditioned dialogue, where models generate responses consistent with specified persona descriptions. P$^2$Bot extended this approach by inducing implicit personas from dialogue histories. Memory-augmented models store and retrieve from interaction histories, enabling context-aware generation.

These approaches typically condition on explicit persona descriptions or retrieved context at inference time, whereas CognitiveTwin encodes personalization in model weights through fine-tuning. The tradeoffs between these approaches merit investigation: inference-time conditioning offers flexibility (persona can change without retraining) but may achieve less complete personalization than weight-level adaptation.

Cognitive Architectures

The ambition to create AI systems mirroring human cognition has a long history in cognitive science and AI. Symbolic cognitive architectures such as ACT-R and SOAR model cognition through production systems operating on symbolic memory structures. Connectionist approaches represent cognitive processes through neural network dynamics. Hybrid architectures combine symbolic and connectionist elements.

CognitiveTwin can be viewed as a modern incarnation of the cognitive modeling tradition, using language models as the substrate for encoding individual cognitive patterns. Unlike symbolic architectures that require explicit rule specification, language model fine-tuning automatically extracts patterns from behavioral examples. This data-driven approach sacrifices interpretability (patterns are encoded in opaque weight matrices) but gains flexibility and learning efficiency.

Conclusion

This paper has presented the CognitiveTwin framework for personalized language model adaptation, encompassing architectural design, training methodology, and evaluation framework. The trajectory coordinate system enables structure-aware generation by encoding hierarchical conversation positions in geometric representations. The style signature mechanism extracts and applies recurring linguistic patterns from training corpora. The dual-ring memory topology maintains contextual state supporting long-horizon coherence.

Empirical evaluation demonstrates that full-parameter fine-tuning on modestly-sized personalization corpora (979 examples) produces significant and measurable personalization effects. The fine-tuned model exhibits increased characteristic phrase frequency (100\

The multi-dimensional evaluation framework introduced herein provides the research community with rigorous metrics for assessing personalization success. The formal definitions enable reproducible measurement and meaningful comparison across experiments. The distinction between structure, coherence, style, and domain knowledge metrics allows nuanced characterization of personalization effects across different behavioral dimensions.

Future work should address the limitations identified: replication across diverse corpora and base models, integration of human evaluation, investigation of parameter-efficient personalization approaches, and analysis of privacy and bias implications. The longer-term vision of AI systems functioning as personalized cognitive extensions requires advances not only in technical capability but also in responsible deployment practices ensuring that personalization enhances rather than compromises user agency and wellbeing.

The CognitiveTwin framework represents a step toward this vision, demonstrating that language models can be adapted to reflect individual patterns while maintaining general language competence. As language models become increasingly central to knowledge work, personalization will become not merely a luxury but a necessity for effective human-AI collaboration.

Acknowledgments

Computational resources were provided through Google Cloud Platform Vertex AI and Together AI. We thank the developers of the open-source libraries upon which this work depends, including PyTorch, Transformers, and the Supabase client.

plainnat

References

anderson2004integrated
Anderson, J.~R., Bothell, D., Byrne, M.~D., Douglass, S., Lebiere, C., and Qin, Y. (2004).
\newblock An integrated theory of the mind.
\newblock Psychological Review, 111(4):1036--1060.

brown2020language
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.~D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020).
\newblock Language models are few-shot learners.
\newblock In Advances in NeurIPS, volume~33, pages 1877--1901.

clark2021all
Clark, E., August, T., Serber, S., Haduong, N., Gururangan, S., and Smith, N.~A. (2021).
\newblock All that's `human' is not gold: Evaluating human evaluation of generated text.
\newblock In Proceedings of ACL-IJCNLP, pages 7282--7296.

devlin2019bert
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019).
\newblock BERT: Pre-training of deep bidirectional transformers for language understanding.
\newblock In Proceedings of NAACL-HLT, pages 4171--4186.

fu2018style
Fu, Z., Tan, X., Peng, N., Zhao, D., and Yan, R. (2018).
\newblock Style transfer in text: Exploration and evaluation.
\newblock In Proceedings of AAAI, pages 663--670.

gatys2016image
Gatys, L.~A., Ecker, A.~S., and Bethge, M. (2016).
\newblock Image style transfer using convolutional neural networks.
\newblock In Proceedings of CVPR, pages 2414--2423.

houlsby2019parameter
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019).
\newblock Parameter-efficient transfer learning for NLP.
\newblock In Proceedings of ICML, pages 2790--2799.

howard2018universal
Howard, J. and Ruder, S. (2018).
\newblock Universal language model fine-tuning for text classification.
\newblock In Proceedings of ACL, pages 328--339.

hu2022lora
Hu, E.~J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022).
\newblock LoRA: Low-rank adaptation of large language models.
\newblock In Proceedings of ICLR.

jhamtani2017shakespearizing
Jhamtani, H., Gangal, V., Hovy, E., and Nyberg, E. (2017).
\newblock Shakespearizing modern language using copy-enriched sequence to sequence models.
\newblock In Proceedings of EMNLP Workshop, pages 10--19.

laird2012soar
Laird, J.~E. (2012).
\newblock The Soar Cognitive Architecture.
\newblock MIT Press.

li2021prefix
Li, X.~L. and Liang, P. (2021).
\newblock Prefix-tuning: Optimizing continuous prompts for generation.
\newblock In Proceedings of ACL-IJCNLP, pages 4582--4597.

li2016persona
Li, J., Galley, M., Brockett, C., Spithourakis, G.~P., Gao, J., and Dolan, W.~B. (2016).
\newblock A persona-based neural conversation model.
\newblock In Proceedings of ACL, pages 994--1003.

lin2004rouge
Lin, C.-Y. (2004).
\newblock ROUGE: A package for automatic evaluation of summaries.
\newblock In Text Summarization Branches Out, pages 74--81.

liu2021ptuning
Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. (2021).
\newblock GPT understands, too.
\newblock arXiv preprint arXiv:2103.10385.

madotto2019personalizing
Madotto, A., Lin, Z., Wu, C.-S., and Fung, P. (2019).
\newblock Personalizing dialogue agents via meta-learning.
\newblock In Proceedings of ACL, pages 5454--5459.

ouyang2022training
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022).
\newblock Training language models to follow instructions with human feedback.
\newblock In Advances in NeurIPS, volume~35, pages 27730--27744.

papineni2002bleu
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
\newblock BLEU: a method for automatic evaluation of machine translation.
\newblock In Proceedings of ACL, pages 311--318.

radford2019language
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019).
\newblock Language models are unsupervised multitask learners.
\newblock OpenAI blog, 1(8):9.

rao2018dear
Rao, S. and Tetreault, J. (2018).
\newblock Dear Sir or Madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer.
\newblock In Proceedings of NAACL-HLT, pages 129--140.

shen2017style
Shen, T., Lei, T., Barzilay, R., and Jaakkola, T. (2017).
\newblock Style transfer from non-parallel text by cross-alignment.
\newblock In Advances in NeurIPS, volume~30, pages 6830--6841.

touvron2023llama
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023).
\newblock LLaMA: Open and efficient foundation language models.
\newblock arXiv preprint arXiv:2302.13971.

wei2022finetuned
Wei, J., Bosma, M., Zhao, V.~Y., Guu, K., Yu, A.~W., Lester, B., Du, N., Dai, A.~M., and Le, Q.~V. (2022).
\newblock Finetuned language models are zero-shot learners.
\newblock In Proceedings of ICLR.

zhang2018personalizing
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and Weston, J. (2018).
\newblock Personalizing dialogue agents: I have a dog, do you have pets too?
\newblock In Proceedings of ACL, pages 2204--2213.

zhang2020bertscore
Zhang, T., Kishore, V., Wu, F., Weinberger, K.~Q., and Artzi, Y. (2020).
\newblock BERTScore: Evaluating text generation with BERT.
\newblock In Proceedings of ICLR.

\appendix

Metric Computation Algorithms

Algorithm: Lexical Diversity Computation

text
Input: Response text \response
Output: Lexical diversity score LD
tokens \gets tokenize(\response)
normalized \gets lowercase(t) : t ∈ tokens, isalpha(t)
unique \gets |set(normalized)|
total \gets |normalized|
return unique / \max(total, 1)

Algorithm: Topic Consistency Computation

text
Input: Response text \response
Output: Topic consistency score TC
words \gets tokenize(\response)
content \gets w : w ∈ words, |w| > 4, isalpha(w)
counts \gets Counter(content)
repeated \gets |w : counts[w] > 1|
unique \gets |set(content)|
return repeated / \max(unique, 1)

Characteristic Phrase Lexicon

The characteristic phrase set for CognitiveTwin V2 evaluation comprises the following multi-word expressions identified through TF-IDF analysis of the training corpus: ``let me,'' ``here's,'' ``step-by-step,'' ``the key,'' ``essentially,'' ``in other words,'' ``to clarify,'' ``specifically,'' ``notably,'' ``this approach,'' ``the idea is,'' ``what this means,'' ``in practice,'' ``the benefit,'' and ``the tradeoff.'' These phrases occur with significantly higher frequency in the training corpus than in reference corpora, indicating distinctive usage patterns.

Domain Vocabulary

The domain vocabulary for technical term density computation encompasses terminology from software engineering, machine learning, and systems architecture domains. Representative terms include: ``api,'' ``endpoint,'' ``database,'' ``query,'' ``cache,'' ``async,'' ``callback,'' ``promise,'' ``thread,'' ``process,'' ``memory,'' ``latency,'' ``throughput,'' ``scalability,'' ``microservice,'' ``docker,'' ``kubernetes,'' ``terraform,'' ``pipeline,'' ``deployment,'' ``embedding,'' ``vector,'' ``transformer,'' ``attention,'' ``gradient,'' ``backprop,'' ``inference,'' ``training,'' ``fine-tune,'' and ``lora.'' The vocabulary is extensible; domain-specific deployments should augment with terminology relevant to their target domains.

Proof of Trajectory Injectivity

proof.
Let $v, u \in V$ be distinct vertices in conversation tree $\mathcal{T}$. We show $\trajectory(v) \neq \trajectory(u)$ by case analysis.

Case 1: $\text{depth}(v) \neq \text{depth}(u)$. Then $\tau_x(v) \neq \tau_x(u)$, so $\trajectory(v) \neq \trajectory(u)$.

Case 2: $\text{depth}(v) = \text{depth}(u)$ and $\text{parent}(v) \neq \text{parent}(u)$. Since parents are distinct vertices at depth $d-1$, by induction they have distinct trajectories, and the trajectory function inherits parent coordinates through the tree structure.

Case 3: $\text{depth}(v) = \text{depth}(u)$ and $\text{parent}(v) = \text{parent}(u)$. Then $v$ and $u$ are siblings. Since they are distinct, $\text{sibling\_order}(v) \neq \text{sibling\_order}(u)$, so $\tau_z(v) \neq \tau_z(u)$.

In all cases, $\trajectory(v) \neq \trajectory(u)$.

Promotion Decision

Compile/render the source, verify references and figures, then add to the curated atlas.

Source Anchor

Comp-Core/core/retrieval/cc-rag-plus-plus/docs/papers/cognitivetwin_v2_evaluation.tex

Detected Structure

Latex · Abstract · Method · Evaluation · References · Math · Architecture