Grand Diomande Research · Full HTML Reader

Substack — The Ghost Agent Exorcism

I woke up this morning and ran my usual audit of the overnight AI pipeline. Everything looked fine. The dashboard was calm. No error alerts. The system hummed along exactly as designed.

Agents That Account for Themselves research note experiment writeup candidate score 22 .md

Full Public Reader

# Substack — The Ghost Agent Exorcism
Moment ID: 1473408896759627826

---

The Ghost Agent Exorcism

On silent failures, polite automation, and the bugs that look like features

---

I woke up this morning and ran my usual audit of the overnight AI pipeline. Everything looked fine. The dashboard was calm. No error alerts. The system hummed along exactly as designed.

Except it had dispatched 172 ghost agents into the void since midnight.

Every five minutes, like clockwork, my chain engine would wake up, select a task, spin up an agent, and dispatch it. The agent would fail. The system would note the failure, wait five minutes, and try again. No escalation. No alert. No log entry worth reading.

Just infinite, optimistic persistence.

---

The Archeology of Silent Failure

Here's the thing about automation bugs: the dangerous ones don't crash. They persist. They wrap their failures in retry logic and backoff algorithms. They consume resources quietly. They maintain the appearance of progress.

When I finally dug into the logs, I found not one bug, but five—each one masking the others, each one contributing to a system that looked healthy but was quietly burning API calls and compute cycles on nothing.

Bug 1: The Tunnel Vision Problem

The chain dispatch engine was designed to orchestrate multiple parallel build chains. Four different feature branches, each with its own CI pipeline, each ready to dispatch.

But the dispatcher only fired one chain per tick.

Four builds waiting? Only one would launch. The others sat patient, invisible, waiting for a turn that came far slower than designed.

Bug 2: The Infinite Retry Loop

One of my chains was configured to dispatch to tooling that didn't exist yet. A placeholder. A "we'll build this later" note that somehow made it into production config.

No circuit breaker. No "give up after N attempts." Just hope, every five minutes, forever.

Bug 3: The Bash Octal Trap

This one was beautiful in its obscurity.

My thread reporter script used bash associative arrays with task IDs like `8A-1` as keys. Perfectly reasonable, except: bash interprets any number starting with `0` or `8` as an octal literal when certain operations run.

The script would hit task `8A-1`, bash would try to interpret `8` as octal, find it invalid, and crash. Silently. No error message. The script just... stopped.

Bug 4: The Recovery Ceiling

Agent recovery was capped at 2 retries. Seemed reasonable during development. But real-world agents sometimes need 3, 4, 5 attempts as auth tokens expire, GPU instances cycle, and rate limits reset.

Two tries wasn't enough. Agents would fail twice, get marked as "unrecoverable," and disappear from the queue—even if the third attempt would have succeeded.

Bug 5: The Permission Wall

Discord bot tokens can't post directly to threads. I knew this. I'd even written code to handle it. But somewhere in a refactor, the thread-posting path started using the wrong API call.

Status updates were being sent. Discord was receiving them. They were just going into the void—a valid API response, no error, but no visible result.

---

The Meta-Lesson

Each of these bugs, in isolation, was minor. Fixable in 20 minutes. But together, they created something worse than a crash: they created the illusion of functionality.

The system was running. Tasks were dispatching. Retries were happening. Everything was fine.

Except nothing was actually completing.

Automation doesn't fail loudly. It fails politely.

It wraps failure in exponential backoff. It logs at DEBUG level instead of ERROR. It consumes resources while maintaining composure. The most dangerous bugs are the ones that look like features working slowly.

---

The Exorcism

The fixes were surgical:

Chain engine got true multi-chain dispatch (fan-out, not round-robin)
The non-existent tooling chain got a `.disabled` flag
Thread reporter was rewritten in Python (no bash octal traps)
Recovery limits went from 2 to 5
Thread posting got a queue-based system with proper permission handling

Total time: about 3 hours of investigation, 45 minutes of fixes.

But the real lesson isn't about the bugs. It's about the silence.

When your automation is suspiciously quiet, when everything seems to be working but results aren't appearing, when the dashboard is green but your inbox is empty—that's not peace.

That's a ghost party.

And sometimes, the job isn't building new features. It's exorcising the spirits that haunt your infrastructure.

---

Sometimes the most productive day is the one where you don't ship anything new—you just make sure what you already shipped is actually running.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

content-pipeline/substack/2026-02-17-ghost-agent-exorcism.md

Detected Structure

Method · Evaluation · Figures · Architecture