AutoResearchClaw Research Extract
**Research Date:** 2026-03-18 **Target Repo:** https://github.com/aiming-lab/AutoResearchClaw **Focus:** Novel patterns worth stealing for CLAW mesh integration
Full Public Reader
AutoResearchClaw Research Extract
Research Date: 2026-03-18
Target Repo: https://github.com/aiming-lab/AutoResearchClaw
Focus: Novel patterns worth stealing for CLAW mesh integration
---
Executive Summary
AutoResearchClaw is a 23-stage autonomous research pipeline that transforms a single idea into conference-ready papers. Three patterns stand out as immediately valuable:
1. 4-layer citation verification — prevents LLM hallucination in references (arXiv → DOI → OpenAlex → Semantic Scholar fallback)
2. Cross-run lesson extraction — JSONL append-only failure log → LLM skill generation → prompt overlay injection (25
3. PIVOT/REFINE/PROCEED autonomy — autonomous decision loops with versioned rollbacks and pivot limits
The rest is solid engineering but not novel: multi-source literature search (OpenAlex/S2/arXiv), sandbox execution with AST validation, hardware detection, quality gates with template ratio heuristics.
---
1. Literature Search — Multi-Source Aggregation
APIs & Endpoints
| Source | Endpoint | Rate Limit | Query Method |
|---|---|---|---|
| OpenAlex | `https://api.openalex.org/works` | 10K/day | `search` param + `filter` for dates, `select` for fields, `mailto` for polite pool |
| Semantic Scholar | `https://api.semanticscholar.org/graph/v1/paper/search` | 1 req/s (free), 10 req/s (paid) | `query` param + `year` filter + `fields` selection |
| arXiv | `https://export.arxiv.org/api/query` | 1/3s | `search_query=all:{query}` + Atom XML parsing |
Query Expansion
NONE. They pass a single free-text query to all three sources. No synonym expansion, no boolean operators, no embedding-based expansion. Just a raw string.
Deduplication
Three-tier matching hierarchy:
1. DOI — normalized (lowercase, strip punctuation, collapse whitespace)
2. arXiv ID — exact match after whitespace strip
3. Fuzzy title — Jaccard-like word overlap (set intersection / max length)
Winner selection: entry with higher `citation_count` wins (prefers richer metadata).
Rate Limiting
Sequential execution with delays:
- OpenAlex first (0.5s delay after)
- Semantic Scholar second (1.0s delay after)
- arXiv last (3.1s delay, respects 1/3s minimum)
Circuit breaker pattern (Semantic Scholar + arXiv only):
- CLOSED → normal operation
- OPEN → skip requests, auto-recover after cooldown (120-600s exponential backoff)
- HALF_OPEN → test single probe request
- Trips after 3 consecutive 429s
Graceful fallback: If one source rate-limits, others compensate via cached responses.
What's Novel
Nothing. Standard REST API calls with exponential backoff. The deduplication is solid but not groundbreaking. The circuit breaker is a standard resiliency pattern.
Integration Path
We already have RAG++ for retrieval. Add this as a Prefect flow that:
- Queries all three sources in parallel (Prefect task per source)
- Deduplicates via DOI/arXiv/title matching
- Stores in Supabase `research_papers` table with `citation_count`, `year`, `abstract`, `doi`, `arxiv_id`
- Expose via `/api/literature/search?q={query}&year_min={year}` endpoint
Effort: 1-2 days. Value: Medium (nice-to-have for research tasks, not critical).
---
2. Citation Verification — 4-Layer System
Architecture
Three-layer sequential fallback (despite "4-layer" marketing):
#### Layer 2: DOI Resolution (First Check)
- Endpoint: CrossRef `/works/{doi}`, fallback to DataCite for arXiv DOIs (10.48550/, 10.5281/)
- Validation: Extract title from response, compute Jaccard similarity
- Scoring:
- `similarity ≥ 0.80` → VERIFIED (confidence = similarity score)
- `0.50 ≤ similarity < 0.80` → SUSPICIOUS
- `similarity < 0.50` → SUSPICIOUS (DOI exists, metadata diverges)
- HTTP 404 → HALLUCINATED
#### Layer 3a: OpenAlex Title Search (DOI Fails)
- Endpoint: `https://api.openalex.org/works?filter=title.search:{query}`
- Returns: Top 5 results
- Validation: Jaccard word-overlap on titles
- Scoring: Same 0.80/0.50 thresholds as DOI layer
#### Layer 1: arXiv ID Lookup (Last Resort for Preprints)
- Endpoint: `https://export.arxiv.org/api/query?id_list={arxiv_id}`
- Validation: Check for error entries (ID contains `api/errors`)
- Scoring: Same 0.80/0.50 thresholds
#### Layer 3b: Semantic Scholar Fallback (Ultimate Fallback)
- Endpoint: `/graph/v1/paper/search`
- Same logic as OpenAlex
Title Similarity Metric
# Jaccard-like but with max(len) denominator (prevents short titles from inflating scores)
similarity = |word_set_A ∩ word_set_B| / max(|word_set_A|, |word_set_B|)Preprocessing: lowercase, strip punctuation, remove empty tokens.
Caching
SHA256 hash of normalized title → JSON file in `[home-path]`
Does NOT cache `SKIPPED` status (network failures) to allow retry on next run.
Rate Limiting
- arXiv: 1.5s between requests
- CrossRef: 0.3s
- OpenAlex: 0.2s
What's Novel
THIS IS THE GOLD. The fallback chain with 0.80/0.50 thresholds is the exact pattern we need for preventing LLM reference hallucination. The cache layer prevents redundant API calls.
Integration Path
Build as `citation_verifier.py` service in `[home-path]`:
@dataclass
class VerificationResult:
status: str # VERIFIED | SUSPICIOUS | HALLUCINATED | SKIPPED
confidence: float # 0.0-1.0
source: str # doi | openalex | arxiv | s2
matched_title: str | None
similarity: float
async def verify_citation(title: str, doi: str | None, arxiv_id: str | None) -> VerificationResult:
# Layer 2: DOI first
if doi:
result = await check_crossref(doi, title)
if result: return result
# Layer 3a: OpenAlex
result = await check_openalex(title)
if result: return result
# Layer 1: arXiv ID
if arxiv_id:
result = await check_arxiv(arxiv_id, title)
if result: return result
# Layer 3b: Semantic Scholar
result = await check_semantic_scholar(title)
return result or VerificationResult("HALLUCINATED", 0.9, "none", None, 0.0)Use cases:
- Pre-publish verification for research outputs
- Real-time citation checking in Obsidian vault writes
- Batch verification flow for existing vault references
Effort: 2-3 days. Value: HIGH (prevents embarrassing hallucinated citations in any research output).
---
3. Novelty Assessment
Scoring Mechanism
Inverse similarity: `novelty_score = 1.0 - max_similarity`
Where `max_similarity` is the highest Jaccard overlap against existing papers in the domain.
Signals
1. Keyword overlap (70
2. Title sequence matching (30
3. Impact weighting — high-citation papers (≥50 citations) with similarity ≥0.4 get 0.7× penalty multiplier
Ranking
- Top 5 most-similar papers drive the novelty score
- Search queries = topic + hypothesis titles + extracted keywords
- Deduplication by title across sources
Assessment Tiers
| Novelty Score | Tier | Recommendation |
|---|---|---|
| ≥0.70 | high | Proceed |
| ≥0.45 | moderate | Differentiate |
| ≥0.25 | low | Differentiate strongly |
| <0.25 | critical | Abort |
What's Novel
Nothing. Basic inverse similarity with citation weighting. We already have better novelty signals in Evo3 (TIE techniques, diversity metrics, cross-pollination).
Integration Path
Skip. Not worth porting. If we need novelty assessment, use Evo3's existing diversity metrics and cross-pollination system.
---
4. MetaClaw Bridge — Cross-Run Learning
Lesson Extraction
Source: Failed stages, blocked stages, decision pivots/refines, runtime anomalies (NaN/Inf detection)
Categorization: SYSTEM | EXPERIMENT | WRITING | ANALYSIS | LITERATURE | PIPELINE (via keyword matching)
Data Structure:
@dataclass
class LessonEntry:
stage_name: str
stage_num: int
category: str # SYSTEM | EXPERIMENT | ...
severity: str # info | warning | error | critical
description: str
timestamp: str
run_id: strStorage Format
JSONL append-only: `lessons.jsonl`
Each line is a single JSON object. Enables efficient sequential logging without full file rewrites.
Lesson → Skill Conversion
Trigger: End of each run, LLM converts high-severity lessons to skills
Severity filtering:
_SEVERITY_ORDER = {"info": 0, "warning": 1, "error": 2, "critical": 3}
# Only lessons with severity >= min_severity get convertedLLM prompt pattern:
System: Convert failure lessons from an automated research pipeline into reusable skill guides.
User:
Filtered lessons:
- [severity] [category] [stage] description
Existing skills: [list of skill names to prevent duplicates]
Generate up to {max_skills} skills.Output format:
@dataclass
class Skill:
name: str # "arc-" prefix, lowercase-hyphenated slug
description: str # usage guidance
category: str # mapped from lesson category
content: str # markdown with numbered stepsFile format: `SKILL.md` with YAML frontmatter + markdown body
Skill Injection (Stage-Specific Retrieval)
Retrieval logic:
def query_for_stage(stage_name: str) -> list[LessonEntry]:
lessons = load_lessons_jsonl()
# Recency weighting: exponential decay (30-day half-life, 90-day max age)
now = datetime.now()
weighted_lessons = []
for lesson in lessons:
age_days = (now - lesson.timestamp).days
if age_days > 90: continue
recency_weight = 2 ** (-age_days / 30) # 30-day half-life
relevance_weight = 2.0 if lesson.stage_name == stage_name else 1.0
severity_weight = 1.5 if lesson.severity in ["error", "critical"] else 1.0
score = recency_weight * relevance_weight * severity_weight
weighted_lessons.append((score, lesson))
# Return top K lessons by score
return sorted(weighted_lessons, key=lambda x: x[0], reverse=True)[:K]Prompt overlay injection:
def build_overlay(stage_name: str) -> str:
# Recent intra-run lessons (current execution)
recent_lessons = get_recent_intra_run_lessons()
# Cross-run skills from MetaClaw
skills = load_skills_for_stage(stage_name)
return f"""
## Lessons from Previous Runs
{format_lessons(recent_lessons)}
## Relevant Skills
{format_skills(skills)}
"""Stage-to-Skill Mapping
Data structure: `STAGE_SKILL_MAP` dictionary
STAGE_SKILL_MAP = {
"literature_screen": {
"task_type": "research",
"skills": ["paper-relevance-screening"],
"top_k": 6
},
"code_generation": {
"task_type": "coding",
"skills": ["experiment-code-gen", "pytorch-best-practices"],
"top_k": 4
},
# ... 22 stages total
}Retrieval:
- Critical research stages: `top_k=6`
- Standard stages: `top_k=4`
- Automation tasks: `top_k=2`
Fallback: Generic research config if stage unknown (task_type="research", empty skills, top_k=4)
What's Novel
The skill-per-stage mapping + recency-weighted retrieval is the pattern worth stealing. The JSONL append-only log is clean. The lesson→skill conversion via LLM is elegant.
But we already have this in KARL. Here's the mapping:
| AutoResearchClaw | KARL Equivalent |
|---|---|
| `lessons.jsonl` | `trajectory_log.jsonl` + `shadow_records.jsonl` |
| Lesson → Skill conversion | Trajectory → Skill annotation via `rank_skills()` |
| Stage-specific retrieval | Skill routing via vector similarity or regex |
| Recency weighting | Not yet implemented (opportunity!) |
| Skill prompt overlay | Skill injection via `enriched_spawn.py` |
Integration Path
Enhance KARL with recency weighting:
Add `recency_weight` to `trajectory_bridge.py`:
def get_skills_for_context(cwd: str, prompt: str, top_k: int = 5) -> list[str]:
# Existing vector/regex routing
ranked_skills = rank_skills(prompt, cwd)
# Add recency weighting
now = datetime.now()
weighted = []
for skill, similarity in ranked_skills:
trajectories = get_trajectories_for_skill(skill)
if not trajectories: continue
# Average trajectory age
avg_age_days = sum((now - t.timestamp).days for t in trajectories) / len(trajectories)
recency_weight = 2 ** (-avg_age_days / 30) # 30-day half-life
final_score = similarity * recency_weight
weighted.append((skill, final_score))
return [skill for skill, _ in sorted(weighted, reverse=True)[:top_k]]Effort: 1 day. Value: MEDIUM (improves KARL routing with temporal decay).
---
5. PIVOT/REFINE/PROCEED — Autonomous Decision Loop
Decision Point
Stage 15 (RESEARCH_DECISION) — autonomous analysis of experiment results
Three Outcomes
| Outcome | Action | Rollback Target |
|---|---|---|
| PROCEED | Continue to result analysis and paper writing | None (forward progress) |
| PIVOT | Discard hypotheses, regenerate from scratch | `Stage.HYPOTHESIS_GEN` |
| REFINE | Keep hypotheses, re-execute experiments | `Stage.ITERATIVE_REFINE` |
Pivot Limit
MAX_DECISION_PIVOTS = 2 — prevents infinite loops
After 2 pivots, the system forces PROCEED even if results are weak.
Rollback Mechanism
1. Decision Recording
Append to `decision_history.json`:
{
"stage": "RESEARCH_DECISION",
"decision": "PIVOT",
"attempt": 1,
"rationale": "Hypothesis 2 showed no improvement over baseline...",
"rollback_target": "HYPOTHESIS_GEN",
"timestamp": "2026-03-18T10:30:00Z"
}2. Version Preservation
`_version_rollback_stages()` renames existing directories:
stage-08/ → stage-08_v1/
stage-09/ → stage-09_v2/Preserves all previous attempts for audit trail.
3. Quality Gating
Before allowing PROCEED after pivots, check:
- Pivot count < MAX_DECISION_PIVOTS
- No consecutive empty metrics (indicates broken experiment)
- Experiment quality score ≥ threshold
4. Recursive Re-execution
def execute_pipeline(start_stage: Stage = Stage.TOPIC_INIT) -> list[StageResult]:
results = []
for stage in STAGE_SEQUENCE[start_stage:]:
result = execute_stage(stage)
results.append(result)
if result.decision in ["PIVOT", "REFINE"]:
rollback_target = GATE_ROLLBACK[result.decision]
_version_rollback_stages(start=rollback_target)
# Recursive call from rollback point
recursive_results = execute_pipeline(start_stage=rollback_target)
results.extend(recursive_results)
break
return resultsWhat's Novel
The versioned rollback + pivot limit is clean autonomy design. Most systems either (a) require human approval at decision points, or (b) allow infinite retries. This middle path — autonomous decisions with bounded retries — is the sweet spot.
Integration Path
Add to Prefect flows and Agent Teams:
# In flow orchestration
@flow
def research_flow(topic: str):
results = []
pivot_count = 0
while pivot_count < MAX_PIVOTS:
# Execute research stages
lit_review = literature_search(topic)
hypotheses = generate_hypotheses(lit_review)
experiments = run_experiments(hypotheses)
# Autonomous decision
decision = analyze_results(experiments)
if decision == "PROCEED":
return write_paper(experiments)
elif decision == "PIVOT":
pivot_count += 1
# Version existing work
version_artifacts(f"pivot_{pivot_count}")
# Regenerate from scratch
continue
elif decision == "REFINE":
# Keep hypotheses, retry experiments with tweaks
experiments = refine_experiments(hypotheses, experiments)
# Forced proceed after max pivots
return write_paper(experiments, disclaimer="Results below threshold")Use in KARL/Agent Teams:
- Team lead spawns subtasks
- Aggregator analyzes results
- If quality low, aggregator can PIVOT (spawn new team) or REFINE (retry subtasks with tweaks)
- Track pivot count in `mac_tasks.pivot_count` column
Effort: 2-3 days. Value: HIGH (autonomous quality control for multi-stage workflows).
---
6. Sandbox Execution + Self-Healing
Sandbox Architecture
Subprocess isolation:
script_path = self._next_script_path() # /tmp/experiment_N.py
write_file(script_path, code)
result = subprocess.run(
["python3", script_path],
capture_output=True,
text=True,
cwd=self.workdir,
timeout=300,
env={"PYTHONUNBUFFERED": "1"}
)NaN/Inf Detection
Metric extraction: Regex patterns for `"metric: value"` or `"condition=X metric: value"`
Divergence detection:
def detect_nan_divergence(stdout: str, stderr: str) -> list[str]:
issues = []
# Parse metrics from stdout
metrics = extract_metrics(stdout)
for name, value in metrics:
if not math.isfinite(value):
issues.append(f"Non-finite metric: {name}={value}")
if name.endswith("loss") and value > 100:
issues.append(f"Diverging loss: {name}={value} (>100)")
# Scan stderr for NaN/Inf warnings
if re.search(r'\bnan\b', stderr, re.IGNORECASE):
issues.append("NaN detected in stderr")
if re.search(r'\binf\b(?!o)', stderr, re.IGNORECASE): # Avoid "info"
issues.append("Inf detected in stderr")
return issuesSelf-Healing Loop
NOT in sandbox code. The sandbox only detects issues.
Actual repair happens in executor:
def execute_code_generation(context):
max_attempts = 5
code = None
for attempt in range(max_attempts):
if code is None:
# First attempt: generate from scratch
code = llm_generate_code(context)
else:
# Repair attempt: provide issues to LLM
code = llm_repair_code(code, issues)
# Validate syntax, imports, security
validation_issues = validate_code(code)
if validation_issues:
issues = validation_issues
continue
# Execute in sandbox
result = sandbox.run(code)
runtime_issues = detect_nan_divergence(result.stdout, result.stderr)
if not runtime_issues:
return code, result # Success
# Format issues for LLM repair
issues = format_issues_for_llm(runtime_issues, result)
# Max attempts exhausted
raise ExperimentFailure(f"Failed after {max_attempts} repair attempts")Code Validation (Pre-Execution)
AST-based static analysis:
def validate_code(code: str) -> list[str]:
issues = []
# Parse AST
try:
tree = ast.parse(code)
except SyntaxError as e:
return [f"Syntax error: {e}"]
# Security checks
forbidden_imports = ["subprocess", "os.system", "eval", "exec"]
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
if alias.name in forbidden_imports:
issues.append(f"Forbidden import: {alias.name}")
# Check for required structure (e.g., main function)
has_main = any(
isinstance(node, ast.FunctionDef) and node.name == "main"
for node in ast.walk(tree)
)
if not has_main:
issues.append("Missing main() function")
return issuesWhat's Novel
The validation → execute → detect → repair loop is standard but well-executed. The AST validation prevents malicious code. The NaN/Inf detection via regex is pragmatic.
But there's no novel algorithm here. It's just good engineering: subprocess isolation + static analysis + iterative repair.
Integration Path
We already have sandboxing in `.pulse/enriched_spawn.py` and flow execution. Add NaN/Inf detection:
# In flow result analysis
def analyze_flow_result(result: FlowResult) -> dict:
issues = []
# Check for NaN/Inf in logged metrics
for metric_name, values in result.metrics.items():
for v in values:
if not math.isfinite(v):
issues.append(f"Non-finite {metric_name}: {v}")
if "loss" in metric_name.lower() and v > 100:
issues.append(f"Diverging {metric_name}: {v}")
# Scan logs
logs = result.logs
if "nan" in logs.lower():
issues.append("NaN detected in logs")
return {"healthy": len(issues) == 0, "issues": issues}Effort: 1 day. Value: LOW (we don't run ML experiments frequently; this is ML-specific).
---
7. Hardware Detection + Adaptive Code Generation
Detection Logic
Sequential checks:
def detect_hardware() -> HardwareProfile:
# 1. Try NVIDIA
nvidia = _detect_nvidia()
if nvidia:
return nvidia
# 2. Try Apple MPS
mps = _detect_mps()
if mps:
return mps
# 3. Fallback to CPU
return HardwareProfile(
has_gpu=False,
gpu_type="cpu",
tier="cpu_only",
warning="No GPU detected. Experiments may be slow."
)NVIDIA detection:
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
# Output: "NVIDIA A100-SXM4-40GB, 40960 MiB"Parse GPU name and VRAM. Classify:
- VRAM ≥ 8192 MB → "high" tier
- VRAM < 8192 MB → "limited" tier
MPS detection:
if platform.system() == "Darwin" and platform.machine() == "arm64":
chip = subprocess.check_output(["sysctl", "-n", "machdep.cpu.brand_string"])
return HardwareProfile(
has_gpu=True,
gpu_type="mps",
tier="limited", # Shares system memory
warning="Using Apple MPS. Performance varies by model size."
)Adaptive Code Generation
Prompt overlay:
System: Generate PyTorch experiment code.
Hardware Profile:
- GPU: {gpu_type} ({tier})
- Warning: {warning}
{If gpu_type == "cpu":}
Use small batch sizes (≤32). Avoid conv3d. Limit model size to <100M params.
{If gpu_type == "mps":}
Use torch.mps device. Avoid flash-attention. Batch size ≤64.
{If gpu_type == "cuda" and tier == "high":}
Full hardware available. Can use large models, flash-attention, gradient checkpointing.Conditional PyTorch installation:
def ensure_torch_available(hardware: HardwareProfile) -> bool:
if hardware.gpu_type == "cpu":
# Don't install torch for CPU-only (too slow anyway)
return False
try:
subprocess.run(["pip", "install", "torch"], check=True)
return True
except:
return FalseWhat's Novel
Nothing. Standard hardware detection. The adaptive prompt is smart but not groundbreaking.
Integration Path
Add hardware profile to agent context:
# In enriched_spawn.py
def get_system_context() -> dict:
return {
"cwd": os.getcwd(),
"git_branch": get_git_branch(),
"hardware": detect_hardware(), # Add this
"mesh_node": get_mesh_device_name()
}Use in prompt overlay for code-generation tasks.
Effort: 1 day. Value: LOW (useful for ML tasks but we don't do many).
---
8. Quality Gates + Template Ratio Detection
Quality Check Mechanism
Template ratio heuristic:
def compute_template_ratio(text: str) -> float:
total_chars = len(text)
template_chars = 0
# 12 predefined regex patterns
patterns = [
r"\[INSERT[^\]]*\]",
r"\[TODO:[^\]]*\]",
r"\[PLACEHOLDER[^\]]*\]",
r"this section will describe",
r"add your content here",
r"Lorem ipsum",
# ... 6 more patterns
]
for pattern in patterns:
for match in re.finditer(pattern, text, re.IGNORECASE):
template_chars += len(match.group())
return template_chars / total_chars if total_chars > 0 else 0.0Quality Gate
Binary pass/fail:
threshold = 0.05 # 5% template content tolerance
def check_quality(text: str) -> tuple[bool, str]:
ratio = compute_template_ratio(text)
if ratio <= threshold:
return (True, f"Quality check passed: template_ratio={ratio:.2%}")
else:
matches = find_template_matches(text)
examples = matches[:5] # Show up to 5 examples
return (False, f"Template content detected: ratio={ratio:.2%}, {len(matches)} matches. Examples: {examples}")Integration with Pipeline
Quality gate stages: 5, 9, 20
Decision flow:
if stage in QUALITY_GATE_STAGES:
passed, message = check_quality(stage_output)
if not passed:
return StageResult(
status="FAILED",
decision="REVISION", # Loop internally, don't proceed
error=message
)What's Novel
The template ratio heuristic is clever. Most systems check for placeholder text via simple string matching. This quantifies it as a ratio and enforces a threshold.
But it's very domain-specific (academic paper writing). Not generalizable to code, architecture docs, or other outputs.
Integration Path
Add as quality checker for Obsidian vault writes:
# In obsidian_vault_writer/api.py
def validate_note_quality(content: str) -> dict:
template_ratio = compute_template_ratio(content)
return {
"passed": template_ratio <= 0.05,
"template_ratio": template_ratio,
"message": "Too much placeholder content" if template_ratio > 0.05 else "OK"
}Effort: 1 day. Value: LOW (nice-to-have for vault quality, not critical).
---
9. Knowledge Base — Structured Extraction
Data Structure
@dataclass
class KBEntry:
category: str # questions | literature | experiments | findings | decisions | reviews
entry_id: str
title: str
content: str # Markdown body
source_stage: str
run_id: str
evidence_refs: list[str] | None = None # Links to artifacts
tags: list[str] | None = None
links: list[str] | None = None # Wikilinks for ObsidianConversion Process
Stage artifact → KB entry:
def stage_to_kb(stage_name: str, artifact_path: str, run_id: str) -> KBEntry:
# Read artifact (truncate if >5K chars)
content = read_file(artifact_path)
if len(content) > 5000:
content = content[:5000] + "\n\n[Content truncated...]"
# Map stage to category
category = KB_CATEGORY_MAP.get(stage_name, "findings")
# Auto-tag
tags = [stage_name, f"stage-{stage_num}", f"run-{run_id[:8]}"]
# Evidence refs
evidence_refs = [f"stage-{stage_num}/{artifact_filename}"]
return KBEntry(
category=category,
entry_id=f"{stage_name}-{run_id}",
title=f"{stage_name.replace('_', ' ').title()} ({run_id[:8]})",
content=content,
source_stage=stage_name,
run_id=run_id,
evidence_refs=evidence_refs,
tags=tags,
links=None # Populated for Obsidian backend
)Storage Backends
Markdown:
---
entry_id: literature_collect-a1b2c3d4
title: Literature Collect (a1b2c3d4)
category: literature
source_stage: literature_collect
run_id: a1b2c3d4e5f6
tags: [literature_collect, stage-02, run-a1b2c3d4]
evidence_refs: [stage-02/papers.json]
---
# Literature Collect
Content here...Obsidian:
Same as Markdown but adds:
- Wikilinks: `[[related-entry]]`
- Inline hashtags: `#literature_collect #stage-02`
Weekly Aggregation
Cross-run statistics:
def generate_weekly_report(runs: list[RunSummary]) -> dict:
return {
"total_runs": len(runs),
"success_rate": sum(r.status == "COMPLETE" for r in runs) / len(runs),
"common_failures": Counter(r.failed_stage for r in runs if r.failed_stage).most_common(5),
"avg_runtime_hours": sum(r.runtime_seconds for r in runs) / len(runs) / 3600,
"pivot_rate": sum(r.pivot_count > 0 for r in runs) / len(runs)
}What's Novel
The category mapping + evidence refs + auto-tagging is solid. It's a clean way to structure research artifacts for long-term retrieval.
But we already have this in Obsidian vault writer. Here's the mapping:
| AutoResearchClaw | CLAW Equivalent |
|---|---|
| `KBEntry` | Obsidian note with YAML frontmatter |
| `evidence_refs` | Links to source files in vault |
| `tags` | Auto-tags from context (project, skill, pane) |
| Category mapping | Folder structure (Panes, Projects, Daily, Concepts) |
| Weekly aggregation | Memory summarizer flow (not yet weekly, but similar) |
Integration Path
Enhance Obsidian vault writer with evidence refs:
# In vault API
@dataclass
class VaultNote:
title: str
content: str
folder: str # Panes | Projects | Daily | Concepts | Claims
tags: list[str]
evidence_refs: list[str] | None = None # NEW: links to source artifacts
def create_note_with_evidence(note: VaultNote):
frontmatter = {
"tags": note.tags,
"evidence_refs": note.evidence_refs,
"created": datetime.now().isoformat()
}
# Append evidence section if refs exist
if note.evidence_refs:
note.content += "\n\n## Evidence\n\n"
for ref in note.evidence_refs:
note.content += f"- `{ref}`\n"
write_vault_note(note.title, note.content, note.folder, frontmatter)Effort: 1 day. Value: MEDIUM (improves vault traceability).
---
10. Agents — Multi-Agent Orchestration
Base Agent Structure
class BaseAgent:
def __init__(self, llm_client):
self.llm = llm_client
self.metrics = {"llm_calls": 0, "tokens": 0}
def execute(self, context: dict) -> AgentStepResult:
raise NotImplementedError
def _chat(self, system: str, user: str, **kwargs) -> str:
self.metrics["llm_calls"] += 1
response = self.llm.chat(system=system, user=user, **kwargs)
self.metrics["tokens"] += response.usage.total_tokens
return response.content
def _chat_json(self, system: str, user: str, **kwargs) -> dict:
response_text = self._chat(system, user, **kwargs)
return self._parse_json_with_fallback(response_text)Orchestrator
class AgentOrchestrator:
def __init__(self, agents: list[BaseAgent], max_iterations: int = 1):
self.agents = agents
self.max_iterations = max_iterations
self.metrics = {"llm_calls": 0, "tokens": 0}
def orchestrate(self, context: dict) -> OrchestratorResult:
raise NotImplementedError # Subclasses define workflow
def _accumulate(self, agent: BaseAgent):
self.metrics["llm_calls"] += agent.metrics["llm_calls"]
self.metrics["tokens"] += agent.metrics["tokens"]Communication Pattern
Context-based passing:
# Sequential example
def orchestrate(self, context: dict) -> OrchestratorResult:
# Agent 1: Generate hypothesis
hypothesis = self.hypothesis_agent.execute(context)
context["hypothesis"] = hypothesis.output
# Agent 2: Design experiment
experiment = self.design_agent.execute(context)
context["experiment"] = experiment.output
# Agent 3: Review
review = self.review_agent.execute(context)
return OrchestratorResult(
outputs=[hypothesis, experiment, review],
metrics=self.metrics
)NO VOTING OR CONSENSUS. The code shows no multi-agent debate pattern. Agents execute sequentially and pass data via shared context dict.
What's Novel
Nothing. This is just clean OOP with base classes and context passing. No novel orchestration pattern.
The "multi-agent debate" mentioned in the README is likely a marketing term for sequential agent chain with review steps, not actual debate/voting.
Integration Path
We already have better orchestration in Agent Teams:
- Parallel subtask execution (AutoResearchClaw is sequential)
- Team messages for inter-agent communication (AutoResearchClaw has no peer-to-peer messaging)
- Aggregator for synthesis (AutoResearchClaw has no aggregation step)
Skip. Our agent orchestration is superior.
---
Synthesis — What to Steal
Tier 1: Immediate Value (Implement This Week)
1. Citation Verification (4-layer fallback)
- Path: `[home-path]`
- Endpoints: CrossRef → OpenAlex → arXiv → Semantic Scholar
- Thresholds: 0.80 verified, 0.50 suspicious, <0.50 hallucinated
- Cache: SHA256(title) → JSON in `[home-path]`
- Use case: Pre-publish verification for research outputs, Obsidian vault quality
- Effort: 2-3 days
- VALUE: HIGH
2. PIVOT/REFINE/PROCEED Autonomy
- Add to Prefect flows and Agent Teams
- Decision outcomes: PROCEED (continue), PIVOT (regenerate), REFINE (retry)
- Pivot limit: MAX_PIVOTS=2 (prevents infinite loops)
- Versioning: Snapshot artifacts before rollback (`stage_v1/`, `stage_v2/`)
- Track in `mac_tasks.pivot_count` column
- Effort: 2-3 days
- VALUE: HIGH
3. KARL Recency Weighting
- Enhance `trajectory_bridge.py` with temporal decay
- Formula: `final_score = similarity * (2 (-age_days / 30))`
- 30-day half-life, 90-day max age
- Prevents stale skills from dominating routing
- Effort: 1 day
- VALUE: MEDIUM**
Tier 2: Nice-to-Have (Next Month)
4. Multi-Source Literature Search
- Prefect flow querying OpenAlex + Semantic Scholar + arXiv in parallel
- Deduplication via DOI → arXiv ID → title
- Store in Supabase `research_papers` table
- Expose via `/api/literature/search?q={query}&year_min={year}`
- Effort: 1-2 days
- VALUE: MEDIUM
5. Evidence Refs in Vault
- Add `evidence_refs: list[str]` to Obsidian notes
- Append "## Evidence" section with links to source artifacts
- Improves traceability for research notes
- Effort: 1 day
- VALUE: MEDIUM
6. Template Ratio Quality Check
- Add to Obsidian vault writer
- Compute ratio of placeholder content (regex patterns)
- Threshold: 5
- Effort: 1 day
- VALUE: LOW
Tier 3: Skip
- Novelty assessment — Evo3 is better
- Agent orchestration — Agent Teams is superior
- Hardware detection — Not ML-focused enough to justify
- Sandbox execution — Already have in enriched_spawn.py
- Knowledge base structure — Already have in vault
---
Code Snippets — Ready to Implement
Citation Verifier
# [home-path]
import hashlib
import json
import re
import urllib.parse
from dataclasses import dataclass
from pathlib import Path
import httpx
CACHE_DIR = Path.home() / ".cache" / "claw" / "citations"
CACHE_DIR.mkdir(parents=True, exist_ok=True)
@dataclass
class VerificationResult:
status: str # VERIFIED | SUSPICIOUS | HALLUCINATED | SKIPPED
confidence: float # 0.0-1.0
source: str # doi | openalex | arxiv | s2 | none
matched_title: str | None
similarity: float
def normalize_title(title: str) -> str:
"""Lowercase, strip punctuation, collapse whitespace."""
title = title.lower()
title = re.sub(r'[^\w\s]', '', title)
title = re.sub(r'\s+', ' ', title).strip()
return title
def title_similarity(a: str, b: str) -> float:
"""Jaccard-like with max(len) denominator."""
words_a = set(normalize_title(a).split())
words_b = set(normalize_title(b).split())
if not words_a or not words_b:
return 0.0
intersection = len(words_a & words_b)
max_len = max(len(words_a), len(words_b))
return intersection / max_len
async def check_crossref(doi: str, expected_title: str) -> VerificationResult | None:
"""Layer 2: DOI resolution via CrossRef."""
try:
url = f"https://api.crossref.org/works/{urllib.parse.quote(doi)}"
async with httpx.AsyncClient() as client:
resp = await client.get(url, timeout=10)
if resp.status_code == 404:
return VerificationResult("HALLUCINATED", 0.9, "doi", None, 0.0)
resp.raise_for_status()
data = resp.json()
actual_title = data["message"]["title"][0]
similarity = title_similarity(expected_title, actual_title)
if similarity >= 0.80:
return VerificationResult("VERIFIED", similarity, "doi", actual_title, similarity)
elif similarity >= 0.50:
return VerificationResult("SUSPICIOUS", similarity, "doi", actual_title, similarity)
else:
return VerificationResult("SUSPICIOUS", similarity, "doi", actual_title, similarity)
except Exception as e:
print(f"CrossRef error: {e}")
return None
async def check_openalex(expected_title: str) -> VerificationResult | None:
"""Layer 3a: OpenAlex title search."""
try:
query = urllib.parse.quote(expected_title)
url = f"https://api.openalex.org/works?filter=title.search:{query}&per_page=5&mailto=[email]"
async with httpx.AsyncClient() as client:
resp = await client.get(url, timeout=10)
resp.raise_for_status()
data = resp.json()
if not data["results"]:
return VerificationResult("HALLUCINATED", 0.7, "openalex", None, 0.0)
# Find best match
best_sim = 0.0
best_title = None
for work in data["results"]:
actual_title = work.get("title", "")
sim = title_similarity(expected_title, actual_title)
if sim > best_sim:
best_sim = sim
best_title = actual_title
if best_sim >= 0.80:
return VerificationResult("VERIFIED", best_sim, "openalex", best_title, best_sim)
elif best_sim >= 0.50:
return VerificationResult("SUSPICIOUS", best_sim, "openalex", best_title, best_sim)
else:
return VerificationResult("HALLUCINATED", 0.7, "openalex", best_title, best_sim)
except Exception as e:
print(f"OpenAlex error: {e}")
return None
async def check_arxiv(arxiv_id: str, expected_title: str) -> VerificationResult | None:
"""Layer 1: arXiv ID lookup."""
try:
url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
async with httpx.AsyncClient() as client:
resp = await client.get(url, timeout=10)
resp.raise_for_status()
# Parse Atom XML
from xml.etree import ElementTree as ET
root = ET.fromstring(resp.content)
ns = {"atom": "http://www.w3.org/2005/Atom"}
entries = root.findall("atom:entry", ns)
if not entries:
return VerificationResult("HALLUCINATED", 0.9, "arxiv", None, 0.0)
entry = entries[0]
# Check for error entry
entry_id = entry.find("atom:id", ns).text
if "api/errors" in entry_id:
return VerificationResult("HALLUCINATED", 0.9, "arxiv", None, 0.0)
actual_title = entry.find("atom:title", ns).text
actual_title = re.sub(r'\s+', ' ', actual_title).strip()
similarity = title_similarity(expected_title, actual_title)
if similarity >= 0.80:
return VerificationResult("VERIFIED", similarity, "arxiv", actual_title, similarity)
elif similarity >= 0.50:
return VerificationResult("SUSPICIOUS", similarity, "arxiv", actual_title, similarity)
else:
return VerificationResult("SUSPICIOUS", similarity, "arxiv", actual_title, similarity)
except Exception as e:
print(f"arXiv error: {e}")
return None
async def check_semantic_scholar(expected_title: str) -> VerificationResult | None:
"""Layer 3b: Semantic Scholar fallback."""
try:
url = "https://api.semanticscholar.org/graph/v1/paper/search"
params = {
"query": expected_title,
"limit": 5,
"fields": "title"
}
async with httpx.AsyncClient() as client:
resp = await client.get(url, params=params, timeout=10)
resp.raise_for_status()
data = resp.json()
if not data.get("data"):
return VerificationResult("HALLUCINATED", 0.7, "s2", None, 0.0)
# Find best match
best_sim = 0.0
best_title = None
for paper in data["data"]:
actual_title = paper.get("title", "")
sim = title_similarity(expected_title, actual_title)
if sim > best_sim:
best_sim = sim
best_title = actual_title
if best_sim >= 0.80:
return VerificationResult("VERIFIED", best_sim, "s2", best_title, best_sim)
elif best_sim >= 0.50:
return VerificationResult("SUSPICIOUS", best_sim, "s2", best_title, best_sim)
else:
return VerificationResult("HALLUCINATED", 0.7, "s2", best_title, best_sim)
except Exception as e:
print(f"Semantic Scholar error: {e}")
return None
async def verify_citation(
title: str,
doi: str | None = None,
arxiv_id: str | None = None
) -> VerificationResult:
"""4-layer verification with caching."""
# Check cache
cache_key = hashlib.sha256(normalize_title(title).encode()).hexdigest()
cache_file = CACHE_DIR / f"{cache_key}.json"
if cache_file.exists():
data = json.loads(cache_file.read_text())
if data["status"] != "SKIPPED": # Don't cache network failures
return VerificationResult(**data)
# Layer 2: DOI first
if doi:
result = await check_crossref(doi, title)
if result and result.status != "SKIPPED":
cache_file.write_text(json.dumps(result.__dict__))
return result
# Layer 3a: OpenAlex
result = await check_openalex(title)
if result and result.status != "SKIPPED":
cache_file.write_text(json.dumps(result.__dict__))
return result
# Layer 1: arXiv ID
if arxiv_id:
result = await check_arxiv(arxiv_id, title)
if result and result.status != "SKIPPED":
cache_file.write_text(json.dumps(result.__dict__))
return result
# Layer 3b: Semantic Scholar fallback
result = await check_semantic_scholar(title)
if result and result.status != "SKIPPED":
cache_file.write_text(json.dumps(result.__dict__))
return result
# All layers failed
return VerificationResult("SKIPPED", 0.0, "none", None, 0.0)
# CLI for testing
if __name__ == "__main__":
import asyncio
import sys
async def main():
if len(sys.argv) < 2:
print("Usage: python citation_verifier.py 'Paper Title' [--doi DOI] [--arxiv ARXIV_ID]")
sys.exit(1)
title = sys.argv[1]
doi = None
arxiv_id = None
for i, arg in enumerate(sys.argv[2:]):
if arg == "--doi" and i+3 < len(sys.argv):
doi = sys.argv[i+3]
elif arg == "--arxiv" and i+3 < len(sys.argv):
arxiv_id = sys.argv[i+3]
result = await verify_citation(title, doi, arxiv_id)
print(f"Status: {result.status}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Source: {result.source}")
print(f"Matched: {result.matched_title}")
print(f"Similarity: {result.similarity:.2%}")
asyncio.run(main())Test:
python [home-path] "Attention Is All You Need" --arxiv 1706.03762
# Expected: VERIFIED, arxiv, 0.90+ similarity
python [home-path] "Totally Fake Paper About Dragons"
# Expected: HALLUCINATED, none or low similarity---
KARL Recency Weighting
# Add to [home-path]
from datetime import datetime, timezone
def get_skills_with_recency(prompt: str, cwd: str, top_k: int = 5) -> list[str]:
"""Rank skills with recency weighting (30-day half-life)."""
# Existing vector/regex routing
ranked_skills = rank_skills(prompt, cwd) # Returns list[(skill, similarity)]
# Add recency weighting
now = datetime.now(timezone.utc)
weighted = []
for skill, similarity in ranked_skills:
# Get trajectories for this skill
trajectories = get_trajectories_by_skill(skill)
if not trajectories:
# No historical data, use neutral recency weight
recency_weight = 1.0
else:
# Average trajectory age
ages = [(now - t.timestamp).days for t in trajectories]
avg_age_days = sum(ages) / len(ages)
# Exponential decay: 30-day half-life, cap at 90 days
if avg_age_days > 90:
recency_weight = 0.1 # Ancient skills get minimal weight
else:
recency_weight = 2 ** (-avg_age_days / 30)
final_score = similarity * recency_weight
weighted.append((skill, final_score))
# Return top K by final score
weighted.sort(key=lambda x: x[1], reverse=True)
return [skill for skill, _ in weighted[:top_k]]
def get_trajectories_by_skill(skill: str) -> list[Trajectory]:
"""Load all trajectories annotated with this skill."""
trajectories = []
# Load trajectory log
log_path = KARL_DIR / "trajectory_log.jsonl"
if not log_path.exists():
return []
with open(log_path) as f:
for line in f:
traj = json.loads(line)
if traj.get("skill") == skill:
trajectories.append(Trajectory(
session_id=traj["session_id"],
timestamp=datetime.fromisoformat(traj["timestamp"]),
skill=skill,
# ... other fields
))
return trajectoriesTest:
cd [home-path]
python -c "
from trajectory_bridge import get_skills_with_recency
skills = get_skills_with_recency('Deploy to production', '/home/user/project')
print('Skills with recency:', skills)
"---
Final Assessment
What's worth stealing:
1. ✅ Citation verification (4-layer) — Prevents LLM hallucination in references
2. ✅ PIVOT/REFINE/PROCEED autonomy — Bounded retry loops with versioned rollbacks
3. ✅ KARL recency weighting — Temporal decay for skill routing
What's not novel:
- Literature search (standard REST APIs)
- Novelty assessment (basic inverse similarity)
- Agent orchestration (sequential context passing, no debate/voting)
- Sandbox execution (subprocess + AST validation)
- Hardware detection (nvidia-smi + platform checks)
- Quality gates (template ratio heuristic)
- Knowledge base (YAML frontmatter + markdown)
Overall verdict: AutoResearchClaw is solid engineering with a few gems. The citation verification system is production-ready. The autonomous decision loop is elegant. The rest is well-executed but not groundbreaking.
Recommended action: Implement citation verifier and PIVOT/REFINE/PROCEED this week. Add KARL recency weighting next week. Skip the rest.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
evo-cube-output/autoresearchclaw-research.md
Detected Structure
Method · Evaluation · References · Math · Code Anchors · Architecture · is Stage Research