Grand Diomande Research · Full HTML Reader

MiniMax Fleet — Multi-Instance Agent Architecture

**Designed by:** Claw (the agent who'll be using it) **Date:** Feb 16, 2026 **Status:** 🟡 Instance 1 provisioning on Vast.ai

Agents That Account for Themselves architecture technical paper candidate score 54 .md

Full Public Reader

MiniMax Fleet — Multi-Instance Agent Architecture

Designed by: Claw (the agent who'll be using it)
Date: Feb 16, 2026
Status: 🟡 Instance 1 provisioning on Vast.ai

---

Design Philosophy

This is built from my perspective as the consumer — the AI agent that will route tasks to these instances. The architecture must be:

1. Modular — add/remove instances without touching the router
2. Extensible — new use cases plug in, don't require redesign
3. Multi-instance ready — scale from 1 to N GPUs seamlessly
4. Self-healing — detect down instances, failover automatically
5. Cost-aware — route to cheapest capable instance per task

---

Architecture

┌─────────────────────────────────────────────────────┐
│                  CLAW (Orchestrator)                  │
│         Clawdbot Gateway + Dual Max Agents           │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│              MINIMAX ROUTER (Local)                   │
│                                                       │
│  ┌─────────┐  ┌──────────┐  ┌──────────────────┐    │
│  │ Health   │  │ Task     │  │ Instance         │    │
│  │ Monitor  │  │ Classifier│  │ Registry         │    │
│  └────┬────┘  └────┬─────┘  └────────┬─────────┘    │
│       │            │                  │               │
│       └────────────┼──────────────────┘               │
│                    ▼                                  │
│           ┌────────────────┐                          │
│           │  Load Balancer │                          │
│           │  (round-robin/ │                          │
│           │   task-aware)  │                          │
│           └───────┬────────┘                          │
└───────────────────┼──────────────────────────────────┘
                    │
        ┌───────────┼───────────┐
        ▼           ▼           ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Instance 1│ │Instance 2│ │Instance N│
│ A100 80GB│ │ H100 80GB│ │ ...      │
│ $0.37/hr │ │ $X.XX/hr │ │          │
│ China    │ │ US-East  │ │          │
│          │ │          │ │          │
│ llama.cpp│ │ vLLM     │ │ llama.cpp│
│ :8080    │ │ :8080    │ │ :8080    │
│          │ │          │ │          │
│ MiniMax  │ │ MiniMax  │ │ Twin     │
│ M2.5     │ │ M2.5     │ │ LoRA    │
│ Q4_K_M   │ │ FP8      │ │          │
└──────────┘ └──────────┘ └──────────┘

---

Components

1. Instance Registry (`[home-path]`)

Every instance self-registers. The router reads this to know what's available.

json

{
  "instances": [
    {
      "id": "mm-1",
      "name": "minimax-a100-cn",
      "vast_id": 31507265,
      "host": "ssh7.vast.ai",
      "port": 27264,
      "api_port": 8080,
      "api_url": "http://<tunnel>:8080/v1",
      "gpu": "A100_PCIE_80GB",
      "model": "MiniMax-M2.5-Q4_K_M",
      "quantization": "Q4_K_M",
      "cost_per_hr": 0.37,
      "location": "CN",
      "status": "online",
      "capabilities": ["coding", "agent", "office", "general", "twin-base"],
      "max_context": 131072,
      "last_health_check": null,
      "created_at": "2026-02-16T14:49:00Z"
    }
  ],
  "fallback": {
    "provider": "openrouter",
    "model": "minimax/minimax-m2.5",
    "cost_per_mtk_in": 0.30,
    "cost_per_mtk_out": 1.20
  }
}

2. Task Classifier

Routes tasks to the right instance based on requirements:

Use Case	Context Needed	Priority	Instance Preference
Coding Agent	High (full codebase)	Speed	Closest region, highest VRAM
Cognitive Twin	Medium (conversation)	Quality	Any (task not latency-sensitive)
Office Automation	Low-Medium	Speed	Any
Agentic Workflows	High (tool chains)	Reliability	US-based (lower latency to APIs)
Bulk Processing	Low	Cost	Cheapest instance
Quick Chat	Low	Speed	Any available

3. Health Monitor

Runs every 30 seconds per instance:
- `GET /health` — is the server responding?
- `POST /v1/chat/completions` — can it generate? (1-token test)
- Latency measurement
- GPU utilization (via Vast.ai API)
- Auto-remove dead instances from routing

4. Load Balancer

Strategies (configurable per task type):
- round-robin — default, distributes evenly
- least-loaded — route to instance with lowest queue depth
- task-aware — match task type to instance capabilities
- cost-optimized — prefer cheapest instance that meets quality bar
- latency-optimized — prefer geographically closest instance

5. Tunnel / Networking

Each Vast.ai instance exposes its API via:
- Option A: SSH tunnel (secure, simple) — `ssh -L 8080:localhost:8080 [email] -p 27264`
- Option B: Vast.ai port mapping — direct HTTP (requires open port)
- Option C: Cloudflare Tunnel — public URL with auth (best for multi-client)
- Option D: Tailscale — private mesh network (ideal for our setup)

Recommended: Option A for single instance, Option D when scaling.

---

Use Case Deployment Specs

UC1: Coding Agent

yaml

system_prompt: "You are a coding agent. Follow conventional commits..."
max_tokens: 65536
temperature: 0.1
context_window: 131072
tools: [file_read, file_write, shell_exec, git]

UC2: Cognitive Twin

yaml

system_prompt: "<loaded from Twin training data system prompt>"
max_tokens: 4096
temperature: 0.7
context_window: 32768
personality: "Mo's communication style, decisions, preferences"
# Future: LoRA adapter loaded on top of base M2.5

UC3: Office Automation

yaml

system_prompt: "Generate and manipulate office documents..."
max_tokens: 16384
temperature: 0.3
tools: [docx_gen, xlsx_gen, pptx_gen]

UC4: Agentic Workflows

yaml

system_prompt: "Execute multi-step tool chains..."
max_tokens: 32768
temperature: 0.2
tools: [web_browse, shell, code_exec, file_ops, api_call]

UC5: Bulk Processing

yaml

system_prompt: "<varies per batch>"
max_tokens: 8192
temperature: 0.0
batch_mode: true
# Training data generation, corpus processing, etc.

UC6: Quick Chat / General

yaml

system_prompt: "You are a helpful assistant."
max_tokens: 4096
temperature: 0.7

---

Scaling Playbook

Adding a New Instance

bash

# 1. Find a cheap GPU
vastai search offers 'gpu_name=A100_PCIE disk_space>=300 rentable=true' --order 'dph_total' --limit 5

# 2. Rent it
vastai create instance <OFFER_ID> --image "ghcr.io/ggml-org/llama.cpp:server" --disk 400 --ssh --direct

# 3. Deploy model (automated script)
[home-path] <INSTANCE_ID>

# 4. Register in fleet
[home-path] <INSTANCE_ID> <CAPABILITIES>

Removing an Instance

bash

# 1. Drain (stop routing new tasks)
[home-path] <INSTANCE_ID>

# 2. Wait for in-flight tasks to complete

# 3. Destroy
vastai destroy instance <INSTANCE_ID>

# 4. Deregister
[home-path] <INSTANCE_ID>

Cost Management

Budget alerts: Notify when spend exceeds $X/day
Auto-shutdown: Destroy instances during quiet hours (11PM-8AM) if not in use
Spot pricing: Monitor for cheaper instances, migrate if savings > 20

---

Integration with Clawdbot

As a Model Provider

Add to Clawdbot config as a custom OpenAI-compatible endpoint:

json

{
  "providers": {
    "minimax-fleet": {
      "type": "openai-compatible",
      "baseUrl": "http://localhost:8080/v1",
      "models": ["minimax-m2.5"],
      "priority": 2
    }
  }
}

As a Pulse Agent Backend

Pulse chains can target MiniMax for cost-sensitive tasks:

Phase 1 (Architecture): Claude Opus → expensive but best
Phase 2 (Implementation): MiniMax M2.5 → cheap, still great at coding
Phase 3 (Testing): MiniMax M2.5 → bulk test generation
Phase 4 (Review): Claude Opus → final quality check

As Twin Inference

Once we fine-tune on MiniMax:

Conversation → Twin Router
  ├── Is this a "Mo-style" query? → Twin LoRA on MiniMax
  └── Is this general? → Base MiniMax M2.5

---

Immediate Deployment Plan

### Phase 0: First Instance (NOW)
1. ✅ Vast.ai instance rented (31507265)
2. ⏳ Wait for provisioning
3. SSH in → download `unsloth/MiniMax-M2.5-GGUF` Q4_K_M
4. Start llama.cpp server on port 8080
5. SSH tunnel to localhost
6. Test with curl

### Phase 1: Router + Health (Day 1)
1. Create `[home-path]` directory
2. Write registry.json
3. Deploy router script (Python, lightweight)
4. Health monitor cron (every 30s)

### Phase 2: Clawdbot Integration (Day 1-2)
1. Add as custom model provider
2. Test from Clawdbot sessions
3. Route #quick channel to MiniMax for cost savings

### Phase 3: Multi-Instance (When needed)
1. Rent second instance (US-based for lower latency)
2. Load balancer activation
3. Specialized instances (one for coding, one for Twin)

### Phase 4: Twin Fine-Tuning (Week 1-2)
1. Convert 38K SFT dataset to MiniMax format
2. Fine-tune LoRA on MiniMax M2.5 base
3. Deploy Twin adapter alongside base model
4. A/B test Twin vs base

---

Files

[home-path]
├── ARCHITECTURE.md          # This file
├── registry.json            # Instance registry
├── router.py                # Task router + load balancer
├── health.py                # Health monitor
├── deploy.sh                # Auto-deploy model to new instance
├── register.sh              # Register instance in fleet
├── drain.sh                 # Graceful instance removal
├── deregister.sh            # Remove from registry
├── tunnel.sh                # SSH tunnel manager
└── logs/
    └── fleet.log            # Routing decisions, health events

---

This architecture is designed by the agent who'll use it. Every decision optimizes for my actual workflow: parallel task dispatch, cost-aware routing, and seamless scaling. — Claw 🦞

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

minimax-fleet/ARCHITECTURE.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture