Grand Diomande Research · Full HTML Reader

AutoResearchClaw Research Extract

**Research Date:** 2026-03-18 **Target Repo:** https://github.com/aiming-lab/AutoResearchClaw **Focus:** Novel patterns worth stealing for CLAW mesh integration

Agents That Account for Themselves research note experiment writeup candidate score 36 .md

Full Public Reader

AutoResearchClaw Research Extract

Research Date: 2026-03-18
Target Repo: https://github.com/aiming-lab/AutoResearchClaw
Focus: Novel patterns worth stealing for CLAW mesh integration

---

Executive Summary

AutoResearchClaw is a 23-stage autonomous research pipeline that transforms a single idea into conference-ready papers. Three patterns stand out as immediately valuable:

1. 4-layer citation verification — prevents LLM hallucination in references (arXiv → DOI → OpenAlex → Semantic Scholar fallback)
2. Cross-run lesson extraction — JSONL append-only failure log → LLM skill generation → prompt overlay injection (25
3. PIVOT/REFINE/PROCEED autonomy — autonomous decision loops with versioned rollbacks and pivot limits

The rest is solid engineering but not novel: multi-source literature search (OpenAlex/S2/arXiv), sandbox execution with AST validation, hardware detection, quality gates with template ratio heuristics.

---

1. Literature Search — Multi-Source Aggregation

APIs & Endpoints

SourceEndpointRate LimitQuery Method
OpenAlex`https://api.openalex.org/works`10K/day`search` param + `filter` for dates, `select` for fields, `mailto` for polite pool
Semantic Scholar`https://api.semanticscholar.org/graph/v1/paper/search`1 req/s (free), 10 req/s (paid)`query` param + `year` filter + `fields` selection
arXiv`https://export.arxiv.org/api/query`1/3s`search_query=all:{query}` + Atom XML parsing

Query Expansion

NONE. They pass a single free-text query to all three sources. No synonym expansion, no boolean operators, no embedding-based expansion. Just a raw string.

Deduplication

Three-tier matching hierarchy:
1. DOI — normalized (lowercase, strip punctuation, collapse whitespace)
2. arXiv ID — exact match after whitespace strip
3. Fuzzy title — Jaccard-like word overlap (set intersection / max length)

Winner selection: entry with higher `citation_count` wins (prefers richer metadata).

Rate Limiting

Sequential execution with delays:
- OpenAlex first (0.5s delay after)
- Semantic Scholar second (1.0s delay after)
- arXiv last (3.1s delay, respects 1/3s minimum)

Circuit breaker pattern (Semantic Scholar + arXiv only):
- CLOSED → normal operation
- OPEN → skip requests, auto-recover after cooldown (120-600s exponential backoff)
- HALF_OPEN → test single probe request
- Trips after 3 consecutive 429s

Graceful fallback: If one source rate-limits, others compensate via cached responses.

What's Novel

Nothing. Standard REST API calls with exponential backoff. The deduplication is solid but not groundbreaking. The circuit breaker is a standard resiliency pattern.

Integration Path

We already have RAG++ for retrieval. Add this as a Prefect flow that:
- Queries all three sources in parallel (Prefect task per source)
- Deduplicates via DOI/arXiv/title matching
- Stores in Supabase `research_papers` table with `citation_count`, `year`, `abstract`, `doi`, `arxiv_id`
- Expose via `/api/literature/search?q={query}&year_min={year}` endpoint

Effort: 1-2 days. Value: Medium (nice-to-have for research tasks, not critical).

---

2. Citation Verification — 4-Layer System

Architecture

Three-layer sequential fallback (despite "4-layer" marketing):

#### Layer 2: DOI Resolution (First Check)
- Endpoint: CrossRef `/works/{doi}`, fallback to DataCite for arXiv DOIs (10.48550/, 10.5281/)
- Validation: Extract title from response, compute Jaccard similarity
- Scoring:
- `similarity ≥ 0.80` → VERIFIED (confidence = similarity score)
- `0.50 ≤ similarity < 0.80` → SUSPICIOUS
- `similarity < 0.50` → SUSPICIOUS (DOI exists, metadata diverges)
- HTTP 404 → HALLUCINATED

#### Layer 3a: OpenAlex Title Search (DOI Fails)
- Endpoint: `https://api.openalex.org/works?filter=title.search:{query}`
- Returns: Top 5 results
- Validation: Jaccard word-overlap on titles
- Scoring: Same 0.80/0.50 thresholds as DOI layer

#### Layer 1: arXiv ID Lookup (Last Resort for Preprints)
- Endpoint: `https://export.arxiv.org/api/query?id_list={arxiv_id}`
- Validation: Check for error entries (ID contains `api/errors`)
- Scoring: Same 0.80/0.50 thresholds

#### Layer 3b: Semantic Scholar Fallback (Ultimate Fallback)
- Endpoint: `/graph/v1/paper/search`
- Same logic as OpenAlex

Title Similarity Metric

python
# Jaccard-like but with max(len) denominator (prevents short titles from inflating scores)
similarity = |word_set_A ∩ word_set_B| / max(|word_set_A|, |word_set_B|)

Preprocessing: lowercase, strip punctuation, remove empty tokens.

Caching

SHA256 hash of normalized title → JSON file in `[home-path]`

Does NOT cache `SKIPPED` status (network failures) to allow retry on next run.

Rate Limiting

  • arXiv: 1.5s between requests
  • CrossRef: 0.3s
  • OpenAlex: 0.2s

What's Novel

THIS IS THE GOLD. The fallback chain with 0.80/0.50 thresholds is the exact pattern we need for preventing LLM reference hallucination. The cache layer prevents redundant API calls.

Integration Path

Build as `citation_verifier.py` service in `[home-path]`:

python
@dataclass
class VerificationResult:
    status: str  # VERIFIED | SUSPICIOUS | HALLUCINATED | SKIPPED
    confidence: float  # 0.0-1.0
    source: str  # doi | openalex | arxiv | s2
    matched_title: str | None
    similarity: float

async def verify_citation(title: str, doi: str | None, arxiv_id: str | None) -> VerificationResult:
    # Layer 2: DOI first
    if doi:
        result = await check_crossref(doi, title)
        if result: return result

    # Layer 3a: OpenAlex
    result = await check_openalex(title)
    if result: return result

    # Layer 1: arXiv ID
    if arxiv_id:
        result = await check_arxiv(arxiv_id, title)
        if result: return result

    # Layer 3b: Semantic Scholar
    result = await check_semantic_scholar(title)
    return result or VerificationResult("HALLUCINATED", 0.9, "none", None, 0.0)

Use cases:
- Pre-publish verification for research outputs
- Real-time citation checking in Obsidian vault writes
- Batch verification flow for existing vault references

Effort: 2-3 days. Value: HIGH (prevents embarrassing hallucinated citations in any research output).

---

3. Novelty Assessment

Scoring Mechanism

Inverse similarity: `novelty_score = 1.0 - max_similarity`

Where `max_similarity` is the highest Jaccard overlap against existing papers in the domain.

Signals

1. Keyword overlap (70
2. Title sequence matching (30
3. Impact weighting — high-citation papers (≥50 citations) with similarity ≥0.4 get 0.7× penalty multiplier

Ranking

  • Top 5 most-similar papers drive the novelty score
  • Search queries = topic + hypothesis titles + extracted keywords
  • Deduplication by title across sources

Assessment Tiers

Novelty ScoreTierRecommendation
≥0.70highProceed
≥0.45moderateDifferentiate
≥0.25lowDifferentiate strongly
<0.25criticalAbort

What's Novel

Nothing. Basic inverse similarity with citation weighting. We already have better novelty signals in Evo3 (TIE techniques, diversity metrics, cross-pollination).

Integration Path

Skip. Not worth porting. If we need novelty assessment, use Evo3's existing diversity metrics and cross-pollination system.

---

4. MetaClaw Bridge — Cross-Run Learning

Lesson Extraction

Source: Failed stages, blocked stages, decision pivots/refines, runtime anomalies (NaN/Inf detection)

Categorization: SYSTEM | EXPERIMENT | WRITING | ANALYSIS | LITERATURE | PIPELINE (via keyword matching)

Data Structure:

python
@dataclass
class LessonEntry:
    stage_name: str
    stage_num: int
    category: str  # SYSTEM | EXPERIMENT | ...
    severity: str  # info | warning | error | critical
    description: str
    timestamp: str
    run_id: str

Storage Format

JSONL append-only: `lessons.jsonl`

Each line is a single JSON object. Enables efficient sequential logging without full file rewrites.

Lesson → Skill Conversion

Trigger: End of each run, LLM converts high-severity lessons to skills

Severity filtering:

python
_SEVERITY_ORDER = {"info": 0, "warning": 1, "error": 2, "critical": 3}
# Only lessons with severity >= min_severity get converted

LLM prompt pattern:

System: Convert failure lessons from an automated research pipeline into reusable skill guides.

User:
Filtered lessons:
- [severity] [category] [stage] description

Existing skills: [list of skill names to prevent duplicates]

Generate up to {max_skills} skills.

Output format:

python
@dataclass
class Skill:
    name: str  # "arc-" prefix, lowercase-hyphenated slug
    description: str  # usage guidance
    category: str  # mapped from lesson category
    content: str  # markdown with numbered steps

File format: `SKILL.md` with YAML frontmatter + markdown body

Skill Injection (Stage-Specific Retrieval)

Retrieval logic:

python
def query_for_stage(stage_name: str) -> list[LessonEntry]:
    lessons = load_lessons_jsonl()

    # Recency weighting: exponential decay (30-day half-life, 90-day max age)
    now = datetime.now()
    weighted_lessons = []
    for lesson in lessons:
        age_days = (now - lesson.timestamp).days
        if age_days > 90: continue

        recency_weight = 2 ** (-age_days / 30)  # 30-day half-life
        relevance_weight = 2.0 if lesson.stage_name == stage_name else 1.0
        severity_weight = 1.5 if lesson.severity in ["error", "critical"] else 1.0

        score = recency_weight * relevance_weight * severity_weight
        weighted_lessons.append((score, lesson))

    # Return top K lessons by score
    return sorted(weighted_lessons, key=lambda x: x[0], reverse=True)[:K]

Prompt overlay injection:

python
def build_overlay(stage_name: str) -> str:
    # Recent intra-run lessons (current execution)
    recent_lessons = get_recent_intra_run_lessons()

    # Cross-run skills from MetaClaw
    skills = load_skills_for_stage(stage_name)

    return f"""
## Lessons from Previous Runs

{format_lessons(recent_lessons)}

## Relevant Skills

{format_skills(skills)}
"""

Stage-to-Skill Mapping

Data structure: `STAGE_SKILL_MAP` dictionary

python
STAGE_SKILL_MAP = {
    "literature_screen": {
        "task_type": "research",
        "skills": ["paper-relevance-screening"],
        "top_k": 6
    },
    "code_generation": {
        "task_type": "coding",
        "skills": ["experiment-code-gen", "pytorch-best-practices"],
        "top_k": 4
    },
    # ... 22 stages total
}

Retrieval:
- Critical research stages: `top_k=6`
- Standard stages: `top_k=4`
- Automation tasks: `top_k=2`

Fallback: Generic research config if stage unknown (task_type="research", empty skills, top_k=4)

What's Novel

The skill-per-stage mapping + recency-weighted retrieval is the pattern worth stealing. The JSONL append-only log is clean. The lesson→skill conversion via LLM is elegant.

But we already have this in KARL. Here's the mapping:

AutoResearchClawKARL Equivalent
`lessons.jsonl``trajectory_log.jsonl` + `shadow_records.jsonl`
Lesson → Skill conversionTrajectory → Skill annotation via `rank_skills()`
Stage-specific retrievalSkill routing via vector similarity or regex
Recency weightingNot yet implemented (opportunity!)
Skill prompt overlaySkill injection via `enriched_spawn.py`

Integration Path

Enhance KARL with recency weighting:

Add `recency_weight` to `trajectory_bridge.py`:

python
def get_skills_for_context(cwd: str, prompt: str, top_k: int = 5) -> list[str]:
    # Existing vector/regex routing
    ranked_skills = rank_skills(prompt, cwd)

    # Add recency weighting
    now = datetime.now()
    weighted = []
    for skill, similarity in ranked_skills:
        trajectories = get_trajectories_for_skill(skill)
        if not trajectories: continue

        # Average trajectory age
        avg_age_days = sum((now - t.timestamp).days for t in trajectories) / len(trajectories)
        recency_weight = 2 ** (-avg_age_days / 30)  # 30-day half-life

        final_score = similarity * recency_weight
        weighted.append((skill, final_score))

    return [skill for skill, _ in sorted(weighted, reverse=True)[:top_k]]

Effort: 1 day. Value: MEDIUM (improves KARL routing with temporal decay).

---

5. PIVOT/REFINE/PROCEED — Autonomous Decision Loop

Decision Point

Stage 15 (RESEARCH_DECISION) — autonomous analysis of experiment results

Three Outcomes

OutcomeActionRollback Target
PROCEEDContinue to result analysis and paper writingNone (forward progress)
PIVOTDiscard hypotheses, regenerate from scratch`Stage.HYPOTHESIS_GEN`
REFINEKeep hypotheses, re-execute experiments`Stage.ITERATIVE_REFINE`

Pivot Limit

MAX_DECISION_PIVOTS = 2 — prevents infinite loops

After 2 pivots, the system forces PROCEED even if results are weak.

Rollback Mechanism

1. Decision Recording

Append to `decision_history.json`:

json
{
  "stage": "RESEARCH_DECISION",
  "decision": "PIVOT",
  "attempt": 1,
  "rationale": "Hypothesis 2 showed no improvement over baseline...",
  "rollback_target": "HYPOTHESIS_GEN",
  "timestamp": "2026-03-18T10:30:00Z"
}

2. Version Preservation

`_version_rollback_stages()` renames existing directories:

stage-08/ → stage-08_v1/
stage-09/ → stage-09_v2/

Preserves all previous attempts for audit trail.

3. Quality Gating

Before allowing PROCEED after pivots, check:
- Pivot count < MAX_DECISION_PIVOTS
- No consecutive empty metrics (indicates broken experiment)
- Experiment quality score ≥ threshold

4. Recursive Re-execution

python
def execute_pipeline(start_stage: Stage = Stage.TOPIC_INIT) -> list[StageResult]:
    results = []
    for stage in STAGE_SEQUENCE[start_stage:]:
        result = execute_stage(stage)
        results.append(result)

        if result.decision in ["PIVOT", "REFINE"]:
            rollback_target = GATE_ROLLBACK[result.decision]
            _version_rollback_stages(start=rollback_target)

            # Recursive call from rollback point
            recursive_results = execute_pipeline(start_stage=rollback_target)
            results.extend(recursive_results)
            break

    return results

What's Novel

The versioned rollback + pivot limit is clean autonomy design. Most systems either (a) require human approval at decision points, or (b) allow infinite retries. This middle path — autonomous decisions with bounded retries — is the sweet spot.

Integration Path

Add to Prefect flows and Agent Teams:

python
# In flow orchestration
@flow
def research_flow(topic: str):
    results = []
    pivot_count = 0

    while pivot_count < MAX_PIVOTS:
        # Execute research stages
        lit_review = literature_search(topic)
        hypotheses = generate_hypotheses(lit_review)
        experiments = run_experiments(hypotheses)

        # Autonomous decision
        decision = analyze_results(experiments)

        if decision == "PROCEED":
            return write_paper(experiments)
        elif decision == "PIVOT":
            pivot_count += 1
            # Version existing work
            version_artifacts(f"pivot_{pivot_count}")
            # Regenerate from scratch
            continue
        elif decision == "REFINE":
            # Keep hypotheses, retry experiments with tweaks
            experiments = refine_experiments(hypotheses, experiments)

    # Forced proceed after max pivots
    return write_paper(experiments, disclaimer="Results below threshold")

Use in KARL/Agent Teams:
- Team lead spawns subtasks
- Aggregator analyzes results
- If quality low, aggregator can PIVOT (spawn new team) or REFINE (retry subtasks with tweaks)
- Track pivot count in `mac_tasks.pivot_count` column

Effort: 2-3 days. Value: HIGH (autonomous quality control for multi-stage workflows).

---

6. Sandbox Execution + Self-Healing

Sandbox Architecture

Subprocess isolation:

python
script_path = self._next_script_path()  # /tmp/experiment_N.py
write_file(script_path, code)
result = subprocess.run(
    ["python3", script_path],
    capture_output=True,
    text=True,
    cwd=self.workdir,
    timeout=300,
    env={"PYTHONUNBUFFERED": "1"}
)

NaN/Inf Detection

Metric extraction: Regex patterns for `"metric: value"` or `"condition=X metric: value"`

Divergence detection:

python
def detect_nan_divergence(stdout: str, stderr: str) -> list[str]:
    issues = []

    # Parse metrics from stdout
    metrics = extract_metrics(stdout)
    for name, value in metrics:
        if not math.isfinite(value):
            issues.append(f"Non-finite metric: {name}={value}")
        if name.endswith("loss") and value > 100:
            issues.append(f"Diverging loss: {name}={value} (>100)")

    # Scan stderr for NaN/Inf warnings
    if re.search(r'\bnan\b', stderr, re.IGNORECASE):
        issues.append("NaN detected in stderr")
    if re.search(r'\binf\b(?!o)', stderr, re.IGNORECASE):  # Avoid "info"
        issues.append("Inf detected in stderr")

    return issues

Self-Healing Loop

NOT in sandbox code. The sandbox only detects issues.

Actual repair happens in executor:

python
def execute_code_generation(context):
    max_attempts = 5
    code = None

    for attempt in range(max_attempts):
        if code is None:
            # First attempt: generate from scratch
            code = llm_generate_code(context)
        else:
            # Repair attempt: provide issues to LLM
            code = llm_repair_code(code, issues)

        # Validate syntax, imports, security
        validation_issues = validate_code(code)
        if validation_issues:
            issues = validation_issues
            continue

        # Execute in sandbox
        result = sandbox.run(code)
        runtime_issues = detect_nan_divergence(result.stdout, result.stderr)

        if not runtime_issues:
            return code, result  # Success

        # Format issues for LLM repair
        issues = format_issues_for_llm(runtime_issues, result)

    # Max attempts exhausted
    raise ExperimentFailure(f"Failed after {max_attempts} repair attempts")

Code Validation (Pre-Execution)

AST-based static analysis:

python
def validate_code(code: str) -> list[str]:
    issues = []

    # Parse AST
    try:
        tree = ast.parse(code)
    except SyntaxError as e:
        return [f"Syntax error: {e}"]

    # Security checks
    forbidden_imports = ["subprocess", "os.system", "eval", "exec"]
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                if alias.name in forbidden_imports:
                    issues.append(f"Forbidden import: {alias.name}")

    # Check for required structure (e.g., main function)
    has_main = any(
        isinstance(node, ast.FunctionDef) and node.name == "main"
        for node in ast.walk(tree)
    )
    if not has_main:
        issues.append("Missing main() function")

    return issues

What's Novel

The validation → execute → detect → repair loop is standard but well-executed. The AST validation prevents malicious code. The NaN/Inf detection via regex is pragmatic.

But there's no novel algorithm here. It's just good engineering: subprocess isolation + static analysis + iterative repair.

Integration Path

We already have sandboxing in `.pulse/enriched_spawn.py` and flow execution. Add NaN/Inf detection:

python
# In flow result analysis
def analyze_flow_result(result: FlowResult) -> dict:
    issues = []

    # Check for NaN/Inf in logged metrics
    for metric_name, values in result.metrics.items():
        for v in values:
            if not math.isfinite(v):
                issues.append(f"Non-finite {metric_name}: {v}")
            if "loss" in metric_name.lower() and v > 100:
                issues.append(f"Diverging {metric_name}: {v}")

    # Scan logs
    logs = result.logs
    if "nan" in logs.lower():
        issues.append("NaN detected in logs")

    return {"healthy": len(issues) == 0, "issues": issues}

Effort: 1 day. Value: LOW (we don't run ML experiments frequently; this is ML-specific).

---

7. Hardware Detection + Adaptive Code Generation

Detection Logic

Sequential checks:

python
def detect_hardware() -> HardwareProfile:
    # 1. Try NVIDIA
    nvidia = _detect_nvidia()
    if nvidia:
        return nvidia

    # 2. Try Apple MPS
    mps = _detect_mps()
    if mps:
        return mps

    # 3. Fallback to CPU
    return HardwareProfile(
        has_gpu=False,
        gpu_type="cpu",
        tier="cpu_only",
        warning="No GPU detected. Experiments may be slow."
    )

NVIDIA detection:

bash
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
# Output: "NVIDIA A100-SXM4-40GB, 40960 MiB"

Parse GPU name and VRAM. Classify:
- VRAM ≥ 8192 MB → "high" tier
- VRAM < 8192 MB → "limited" tier

MPS detection:

python
if platform.system() == "Darwin" and platform.machine() == "arm64":
    chip = subprocess.check_output(["sysctl", "-n", "machdep.cpu.brand_string"])
    return HardwareProfile(
        has_gpu=True,
        gpu_type="mps",
        tier="limited",  # Shares system memory
        warning="Using Apple MPS. Performance varies by model size."
    )

Adaptive Code Generation

Prompt overlay:

System: Generate PyTorch experiment code.

Hardware Profile:
- GPU: {gpu_type} ({tier})
- Warning: {warning}

{If gpu_type == "cpu":}
Use small batch sizes (≤32). Avoid conv3d. Limit model size to <100M params.

{If gpu_type == "mps":}
Use torch.mps device. Avoid flash-attention. Batch size ≤64.

{If gpu_type == "cuda" and tier == "high":}
Full hardware available. Can use large models, flash-attention, gradient checkpointing.

Conditional PyTorch installation:

python
def ensure_torch_available(hardware: HardwareProfile) -> bool:
    if hardware.gpu_type == "cpu":
        # Don't install torch for CPU-only (too slow anyway)
        return False

    try:
        subprocess.run(["pip", "install", "torch"], check=True)
        return True
    except:
        return False

What's Novel

Nothing. Standard hardware detection. The adaptive prompt is smart but not groundbreaking.

Integration Path

Add hardware profile to agent context:

python
# In enriched_spawn.py
def get_system_context() -> dict:
    return {
        "cwd": os.getcwd(),
        "git_branch": get_git_branch(),
        "hardware": detect_hardware(),  # Add this
        "mesh_node": get_mesh_device_name()
    }

Use in prompt overlay for code-generation tasks.

Effort: 1 day. Value: LOW (useful for ML tasks but we don't do many).

---

8. Quality Gates + Template Ratio Detection

Quality Check Mechanism

Template ratio heuristic:

python
def compute_template_ratio(text: str) -> float:
    total_chars = len(text)
    template_chars = 0

    # 12 predefined regex patterns
    patterns = [
        r"\[INSERT[^\]]*\]",
        r"\[TODO:[^\]]*\]",
        r"\[PLACEHOLDER[^\]]*\]",
        r"this section will describe",
        r"add your content here",
        r"Lorem ipsum",
        # ... 6 more patterns
    ]

    for pattern in patterns:
        for match in re.finditer(pattern, text, re.IGNORECASE):
            template_chars += len(match.group())

    return template_chars / total_chars if total_chars > 0 else 0.0

Quality Gate

Binary pass/fail:

python
threshold = 0.05  # 5% template content tolerance

def check_quality(text: str) -> tuple[bool, str]:
    ratio = compute_template_ratio(text)

    if ratio <= threshold:
        return (True, f"Quality check passed: template_ratio={ratio:.2%}")
    else:
        matches = find_template_matches(text)
        examples = matches[:5]  # Show up to 5 examples
        return (False, f"Template content detected: ratio={ratio:.2%}, {len(matches)} matches. Examples: {examples}")

Integration with Pipeline

Quality gate stages: 5, 9, 20

Decision flow:

python
if stage in QUALITY_GATE_STAGES:
    passed, message = check_quality(stage_output)
    if not passed:
        return StageResult(
            status="FAILED",
            decision="REVISION",  # Loop internally, don't proceed
            error=message
        )

What's Novel

The template ratio heuristic is clever. Most systems check for placeholder text via simple string matching. This quantifies it as a ratio and enforces a threshold.

But it's very domain-specific (academic paper writing). Not generalizable to code, architecture docs, or other outputs.

Integration Path

Add as quality checker for Obsidian vault writes:

python
# In obsidian_vault_writer/api.py
def validate_note_quality(content: str) -> dict:
    template_ratio = compute_template_ratio(content)

    return {
        "passed": template_ratio <= 0.05,
        "template_ratio": template_ratio,
        "message": "Too much placeholder content" if template_ratio > 0.05 else "OK"
    }

Effort: 1 day. Value: LOW (nice-to-have for vault quality, not critical).

---

9. Knowledge Base — Structured Extraction

Data Structure

python
@dataclass
class KBEntry:
    category: str  # questions | literature | experiments | findings | decisions | reviews
    entry_id: str
    title: str
    content: str  # Markdown body
    source_stage: str
    run_id: str
    evidence_refs: list[str] | None = None  # Links to artifacts
    tags: list[str] | None = None
    links: list[str] | None = None  # Wikilinks for Obsidian

Conversion Process

Stage artifact → KB entry:

python
def stage_to_kb(stage_name: str, artifact_path: str, run_id: str) -> KBEntry:
    # Read artifact (truncate if >5K chars)
    content = read_file(artifact_path)
    if len(content) > 5000:
        content = content[:5000] + "\n\n[Content truncated...]"

    # Map stage to category
    category = KB_CATEGORY_MAP.get(stage_name, "findings")

    # Auto-tag
    tags = [stage_name, f"stage-{stage_num}", f"run-{run_id[:8]}"]

    # Evidence refs
    evidence_refs = [f"stage-{stage_num}/{artifact_filename}"]

    return KBEntry(
        category=category,
        entry_id=f"{stage_name}-{run_id}",
        title=f"{stage_name.replace('_', ' ').title()} ({run_id[:8]})",
        content=content,
        source_stage=stage_name,
        run_id=run_id,
        evidence_refs=evidence_refs,
        tags=tags,
        links=None  # Populated for Obsidian backend
    )

Storage Backends

Markdown:

markdown
---
entry_id: literature_collect-a1b2c3d4
title: Literature Collect (a1b2c3d4)
category: literature
source_stage: literature_collect
run_id: a1b2c3d4e5f6
tags: [literature_collect, stage-02, run-a1b2c3d4]
evidence_refs: [stage-02/papers.json]
---

# Literature Collect

Content here...

Obsidian:

Same as Markdown but adds:
- Wikilinks: `[[related-entry]]`
- Inline hashtags: `#literature_collect #stage-02`

Weekly Aggregation

Cross-run statistics:

python
def generate_weekly_report(runs: list[RunSummary]) -> dict:
    return {
        "total_runs": len(runs),
        "success_rate": sum(r.status == "COMPLETE" for r in runs) / len(runs),
        "common_failures": Counter(r.failed_stage for r in runs if r.failed_stage).most_common(5),
        "avg_runtime_hours": sum(r.runtime_seconds for r in runs) / len(runs) / 3600,
        "pivot_rate": sum(r.pivot_count > 0 for r in runs) / len(runs)
    }

What's Novel

The category mapping + evidence refs + auto-tagging is solid. It's a clean way to structure research artifacts for long-term retrieval.

But we already have this in Obsidian vault writer. Here's the mapping:

AutoResearchClawCLAW Equivalent
`KBEntry`Obsidian note with YAML frontmatter
`evidence_refs`Links to source files in vault
`tags`Auto-tags from context (project, skill, pane)
Category mappingFolder structure (Panes, Projects, Daily, Concepts)
Weekly aggregationMemory summarizer flow (not yet weekly, but similar)

Integration Path

Enhance Obsidian vault writer with evidence refs:

python
# In vault API
@dataclass
class VaultNote:
    title: str
    content: str
    folder: str  # Panes | Projects | Daily | Concepts | Claims
    tags: list[str]
    evidence_refs: list[str] | None = None  # NEW: links to source artifacts

def create_note_with_evidence(note: VaultNote):
    frontmatter = {
        "tags": note.tags,
        "evidence_refs": note.evidence_refs,
        "created": datetime.now().isoformat()
    }

    # Append evidence section if refs exist
    if note.evidence_refs:
        note.content += "\n\n## Evidence\n\n"
        for ref in note.evidence_refs:
            note.content += f"- `{ref}`\n"

    write_vault_note(note.title, note.content, note.folder, frontmatter)

Effort: 1 day. Value: MEDIUM (improves vault traceability).

---

10. Agents — Multi-Agent Orchestration

Base Agent Structure

python
class BaseAgent:
    def __init__(self, llm_client):
        self.llm = llm_client
        self.metrics = {"llm_calls": 0, "tokens": 0}

    def execute(self, context: dict) -> AgentStepResult:
        raise NotImplementedError

    def _chat(self, system: str, user: str, **kwargs) -> str:
        self.metrics["llm_calls"] += 1
        response = self.llm.chat(system=system, user=user, **kwargs)
        self.metrics["tokens"] += response.usage.total_tokens
        return response.content

    def _chat_json(self, system: str, user: str, **kwargs) -> dict:
        response_text = self._chat(system, user, **kwargs)
        return self._parse_json_with_fallback(response_text)

Orchestrator

python
class AgentOrchestrator:
    def __init__(self, agents: list[BaseAgent], max_iterations: int = 1):
        self.agents = agents
        self.max_iterations = max_iterations
        self.metrics = {"llm_calls": 0, "tokens": 0}

    def orchestrate(self, context: dict) -> OrchestratorResult:
        raise NotImplementedError  # Subclasses define workflow

    def _accumulate(self, agent: BaseAgent):
        self.metrics["llm_calls"] += agent.metrics["llm_calls"]
        self.metrics["tokens"] += agent.metrics["tokens"]

Communication Pattern

Context-based passing:

python
# Sequential example
def orchestrate(self, context: dict) -> OrchestratorResult:
    # Agent 1: Generate hypothesis
    hypothesis = self.hypothesis_agent.execute(context)
    context["hypothesis"] = hypothesis.output

    # Agent 2: Design experiment
    experiment = self.design_agent.execute(context)
    context["experiment"] = experiment.output

    # Agent 3: Review
    review = self.review_agent.execute(context)

    return OrchestratorResult(
        outputs=[hypothesis, experiment, review],
        metrics=self.metrics
    )

NO VOTING OR CONSENSUS. The code shows no multi-agent debate pattern. Agents execute sequentially and pass data via shared context dict.

What's Novel

Nothing. This is just clean OOP with base classes and context passing. No novel orchestration pattern.

The "multi-agent debate" mentioned in the README is likely a marketing term for sequential agent chain with review steps, not actual debate/voting.

Integration Path

We already have better orchestration in Agent Teams:

  • Parallel subtask execution (AutoResearchClaw is sequential)
  • Team messages for inter-agent communication (AutoResearchClaw has no peer-to-peer messaging)
  • Aggregator for synthesis (AutoResearchClaw has no aggregation step)

Skip. Our agent orchestration is superior.

---

Synthesis — What to Steal

Tier 1: Immediate Value (Implement This Week)

1. Citation Verification (4-layer fallback)
- Path: `[home-path]`
- Endpoints: CrossRef → OpenAlex → arXiv → Semantic Scholar
- Thresholds: 0.80 verified, 0.50 suspicious, <0.50 hallucinated
- Cache: SHA256(title) → JSON in `[home-path]`
- Use case: Pre-publish verification for research outputs, Obsidian vault quality
- Effort: 2-3 days
- VALUE: HIGH

2. PIVOT/REFINE/PROCEED Autonomy
- Add to Prefect flows and Agent Teams
- Decision outcomes: PROCEED (continue), PIVOT (regenerate), REFINE (retry)
- Pivot limit: MAX_PIVOTS=2 (prevents infinite loops)
- Versioning: Snapshot artifacts before rollback (`stage_v1/`, `stage_v2/`)
- Track in `mac_tasks.pivot_count` column
- Effort: 2-3 days
- VALUE: HIGH

3. KARL Recency Weighting
- Enhance `trajectory_bridge.py` with temporal decay
- Formula: `final_score = similarity * (2 (-age_days / 30))`
- 30-day half-life, 90-day max age
- Prevents stale skills from dominating routing
- Effort: 1 day
-
VALUE: MEDIUM**

Tier 2: Nice-to-Have (Next Month)

4. Multi-Source Literature Search
- Prefect flow querying OpenAlex + Semantic Scholar + arXiv in parallel
- Deduplication via DOI → arXiv ID → title
- Store in Supabase `research_papers` table
- Expose via `/api/literature/search?q={query}&year_min={year}`
- Effort: 1-2 days
- VALUE: MEDIUM

5. Evidence Refs in Vault
- Add `evidence_refs: list[str]` to Obsidian notes
- Append "## Evidence" section with links to source artifacts
- Improves traceability for research notes
- Effort: 1 day
- VALUE: MEDIUM

6. Template Ratio Quality Check
- Add to Obsidian vault writer
- Compute ratio of placeholder content (regex patterns)
- Threshold: 5
- Effort: 1 day
- VALUE: LOW

Tier 3: Skip

  • Novelty assessment — Evo3 is better
  • Agent orchestration — Agent Teams is superior
  • Hardware detection — Not ML-focused enough to justify
  • Sandbox execution — Already have in enriched_spawn.py
  • Knowledge base structure — Already have in vault

---

Code Snippets — Ready to Implement

Citation Verifier

python
# [home-path]

import hashlib
import json
import re
import urllib.parse
from dataclasses import dataclass
from pathlib import Path
import httpx

CACHE_DIR = Path.home() / ".cache" / "claw" / "citations"
CACHE_DIR.mkdir(parents=True, exist_ok=True)

@dataclass
class VerificationResult:
    status: str  # VERIFIED | SUSPICIOUS | HALLUCINATED | SKIPPED
    confidence: float  # 0.0-1.0
    source: str  # doi | openalex | arxiv | s2 | none
    matched_title: str | None
    similarity: float

def normalize_title(title: str) -> str:
    """Lowercase, strip punctuation, collapse whitespace."""
    title = title.lower()
    title = re.sub(r'[^\w\s]', '', title)
    title = re.sub(r'\s+', ' ', title).strip()
    return title

def title_similarity(a: str, b: str) -> float:
    """Jaccard-like with max(len) denominator."""
    words_a = set(normalize_title(a).split())
    words_b = set(normalize_title(b).split())
    if not words_a or not words_b:
        return 0.0
    intersection = len(words_a & words_b)
    max_len = max(len(words_a), len(words_b))
    return intersection / max_len

async def check_crossref(doi: str, expected_title: str) -> VerificationResult | None:
    """Layer 2: DOI resolution via CrossRef."""
    try:
        url = f"https://api.crossref.org/works/{urllib.parse.quote(doi)}"
        async with httpx.AsyncClient() as client:
            resp = await client.get(url, timeout=10)
            if resp.status_code == 404:
                return VerificationResult("HALLUCINATED", 0.9, "doi", None, 0.0)
            resp.raise_for_status()

            data = resp.json()
            actual_title = data["message"]["title"][0]
            similarity = title_similarity(expected_title, actual_title)

            if similarity >= 0.80:
                return VerificationResult("VERIFIED", similarity, "doi", actual_title, similarity)
            elif similarity >= 0.50:
                return VerificationResult("SUSPICIOUS", similarity, "doi", actual_title, similarity)
            else:
                return VerificationResult("SUSPICIOUS", similarity, "doi", actual_title, similarity)
    except Exception as e:
        print(f"CrossRef error: {e}")
        return None

async def check_openalex(expected_title: str) -> VerificationResult | None:
    """Layer 3a: OpenAlex title search."""
    try:
        query = urllib.parse.quote(expected_title)
        url = f"https://api.openalex.org/works?filter=title.search:{query}&per_page=5&mailto=[email]"

        async with httpx.AsyncClient() as client:
            resp = await client.get(url, timeout=10)
            resp.raise_for_status()

            data = resp.json()
            if not data["results"]:
                return VerificationResult("HALLUCINATED", 0.7, "openalex", None, 0.0)

            # Find best match
            best_sim = 0.0
            best_title = None
            for work in data["results"]:
                actual_title = work.get("title", "")
                sim = title_similarity(expected_title, actual_title)
                if sim > best_sim:
                    best_sim = sim
                    best_title = actual_title

            if best_sim >= 0.80:
                return VerificationResult("VERIFIED", best_sim, "openalex", best_title, best_sim)
            elif best_sim >= 0.50:
                return VerificationResult("SUSPICIOUS", best_sim, "openalex", best_title, best_sim)
            else:
                return VerificationResult("HALLUCINATED", 0.7, "openalex", best_title, best_sim)
    except Exception as e:
        print(f"OpenAlex error: {e}")
        return None

async def check_arxiv(arxiv_id: str, expected_title: str) -> VerificationResult | None:
    """Layer 1: arXiv ID lookup."""
    try:
        url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
        async with httpx.AsyncClient() as client:
            resp = await client.get(url, timeout=10)
            resp.raise_for_status()

            # Parse Atom XML
            from xml.etree import ElementTree as ET
            root = ET.fromstring(resp.content)
            ns = {"atom": "http://www.w3.org/2005/Atom"}

            entries = root.findall("atom:entry", ns)
            if not entries:
                return VerificationResult("HALLUCINATED", 0.9, "arxiv", None, 0.0)

            entry = entries[0]
            # Check for error entry
            entry_id = entry.find("atom:id", ns).text
            if "api/errors" in entry_id:
                return VerificationResult("HALLUCINATED", 0.9, "arxiv", None, 0.0)

            actual_title = entry.find("atom:title", ns).text
            actual_title = re.sub(r'\s+', ' ', actual_title).strip()

            similarity = title_similarity(expected_title, actual_title)

            if similarity >= 0.80:
                return VerificationResult("VERIFIED", similarity, "arxiv", actual_title, similarity)
            elif similarity >= 0.50:
                return VerificationResult("SUSPICIOUS", similarity, "arxiv", actual_title, similarity)
            else:
                return VerificationResult("SUSPICIOUS", similarity, "arxiv", actual_title, similarity)
    except Exception as e:
        print(f"arXiv error: {e}")
        return None

async def check_semantic_scholar(expected_title: str) -> VerificationResult | None:
    """Layer 3b: Semantic Scholar fallback."""
    try:
        url = "https://api.semanticscholar.org/graph/v1/paper/search"
        params = {
            "query": expected_title,
            "limit": 5,
            "fields": "title"
        }

        async with httpx.AsyncClient() as client:
            resp = await client.get(url, params=params, timeout=10)
            resp.raise_for_status()

            data = resp.json()
            if not data.get("data"):
                return VerificationResult("HALLUCINATED", 0.7, "s2", None, 0.0)

            # Find best match
            best_sim = 0.0
            best_title = None
            for paper in data["data"]:
                actual_title = paper.get("title", "")
                sim = title_similarity(expected_title, actual_title)
                if sim > best_sim:
                    best_sim = sim
                    best_title = actual_title

            if best_sim >= 0.80:
                return VerificationResult("VERIFIED", best_sim, "s2", best_title, best_sim)
            elif best_sim >= 0.50:
                return VerificationResult("SUSPICIOUS", best_sim, "s2", best_title, best_sim)
            else:
                return VerificationResult("HALLUCINATED", 0.7, "s2", best_title, best_sim)
    except Exception as e:
        print(f"Semantic Scholar error: {e}")
        return None

async def verify_citation(
    title: str,
    doi: str | None = None,
    arxiv_id: str | None = None
) -> VerificationResult:
    """4-layer verification with caching."""

    # Check cache
    cache_key = hashlib.sha256(normalize_title(title).encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"
    if cache_file.exists():
        data = json.loads(cache_file.read_text())
        if data["status"] != "SKIPPED":  # Don't cache network failures
            return VerificationResult(**data)

    # Layer 2: DOI first
    if doi:
        result = await check_crossref(doi, title)
        if result and result.status != "SKIPPED":
            cache_file.write_text(json.dumps(result.__dict__))
            return result

    # Layer 3a: OpenAlex
    result = await check_openalex(title)
    if result and result.status != "SKIPPED":
        cache_file.write_text(json.dumps(result.__dict__))
        return result

    # Layer 1: arXiv ID
    if arxiv_id:
        result = await check_arxiv(arxiv_id, title)
        if result and result.status != "SKIPPED":
            cache_file.write_text(json.dumps(result.__dict__))
            return result

    # Layer 3b: Semantic Scholar fallback
    result = await check_semantic_scholar(title)
    if result and result.status != "SKIPPED":
        cache_file.write_text(json.dumps(result.__dict__))
        return result

    # All layers failed
    return VerificationResult("SKIPPED", 0.0, "none", None, 0.0)

# CLI for testing
if __name__ == "__main__":
    import asyncio
    import sys

    async def main():
        if len(sys.argv) < 2:
            print("Usage: python citation_verifier.py 'Paper Title' [--doi DOI] [--arxiv ARXIV_ID]")
            sys.exit(1)

        title = sys.argv[1]
        doi = None
        arxiv_id = None

        for i, arg in enumerate(sys.argv[2:]):
            if arg == "--doi" and i+3 < len(sys.argv):
                doi = sys.argv[i+3]
            elif arg == "--arxiv" and i+3 < len(sys.argv):
                arxiv_id = sys.argv[i+3]

        result = await verify_citation(title, doi, arxiv_id)
        print(f"Status: {result.status}")
        print(f"Confidence: {result.confidence:.2%}")
        print(f"Source: {result.source}")
        print(f"Matched: {result.matched_title}")
        print(f"Similarity: {result.similarity:.2%}")

    asyncio.run(main())

Test:

bash
python [home-path] "Attention Is All You Need" --arxiv 1706.03762
# Expected: VERIFIED, arxiv, 0.90+ similarity

python [home-path] "Totally Fake Paper About Dragons"
# Expected: HALLUCINATED, none or low similarity

---

KARL Recency Weighting

python
# Add to [home-path]

from datetime import datetime, timezone

def get_skills_with_recency(prompt: str, cwd: str, top_k: int = 5) -> list[str]:
    """Rank skills with recency weighting (30-day half-life)."""

    # Existing vector/regex routing
    ranked_skills = rank_skills(prompt, cwd)  # Returns list[(skill, similarity)]

    # Add recency weighting
    now = datetime.now(timezone.utc)
    weighted = []

    for skill, similarity in ranked_skills:
        # Get trajectories for this skill
        trajectories = get_trajectories_by_skill(skill)
        if not trajectories:
            # No historical data, use neutral recency weight
            recency_weight = 1.0
        else:
            # Average trajectory age
            ages = [(now - t.timestamp).days for t in trajectories]
            avg_age_days = sum(ages) / len(ages)

            # Exponential decay: 30-day half-life, cap at 90 days
            if avg_age_days > 90:
                recency_weight = 0.1  # Ancient skills get minimal weight
            else:
                recency_weight = 2 ** (-avg_age_days / 30)

        final_score = similarity * recency_weight
        weighted.append((skill, final_score))

    # Return top K by final score
    weighted.sort(key=lambda x: x[1], reverse=True)
    return [skill for skill, _ in weighted[:top_k]]

def get_trajectories_by_skill(skill: str) -> list[Trajectory]:
    """Load all trajectories annotated with this skill."""
    trajectories = []

    # Load trajectory log
    log_path = KARL_DIR / "trajectory_log.jsonl"
    if not log_path.exists():
        return []

    with open(log_path) as f:
        for line in f:
            traj = json.loads(line)
            if traj.get("skill") == skill:
                trajectories.append(Trajectory(
                    session_id=traj["session_id"],
                    timestamp=datetime.fromisoformat(traj["timestamp"]),
                    skill=skill,
                    # ... other fields
                ))

    return trajectories

Test:

bash
cd [home-path]
python -c "
from trajectory_bridge import get_skills_with_recency
skills = get_skills_with_recency('Deploy to production', '/home/user/project')
print('Skills with recency:', skills)
"

---

Final Assessment

What's worth stealing:

1. ✅ Citation verification (4-layer) — Prevents LLM hallucination in references
2. ✅ PIVOT/REFINE/PROCEED autonomy — Bounded retry loops with versioned rollbacks
3. ✅ KARL recency weighting — Temporal decay for skill routing

What's not novel:

  • Literature search (standard REST APIs)
  • Novelty assessment (basic inverse similarity)
  • Agent orchestration (sequential context passing, no debate/voting)
  • Sandbox execution (subprocess + AST validation)
  • Hardware detection (nvidia-smi + platform checks)
  • Quality gates (template ratio heuristic)
  • Knowledge base (YAML frontmatter + markdown)

Overall verdict: AutoResearchClaw is solid engineering with a few gems. The citation verification system is production-ready. The autonomous decision loop is elegant. The rest is well-executed but not groundbreaking.

Recommended action: Implement citation verifier and PIVOT/REFINE/PROCEED this week. Add KARL recency weighting next week. Skip the rest.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/autoresearchclaw-research.md

Detected Structure

Method · Evaluation · References · Math · Code Anchors · Architecture · is Stage Research