Grand Diomande Research Β· Full HTML Reader

MiniMax Fleet β€” Multi-Instance Agent Architecture

**Designed by:** Claw (the agent who'll be using it) **Date:** Feb 16, 2026 **Status:** 🟑 Instance 1 provisioning on Vast.ai

Agents That Account for Themselves architecture technical paper candidate score 54 .md

Full Public Reader

MiniMax Fleet β€” Multi-Instance Agent Architecture

Designed by: Claw (the agent who'll be using it)
Date: Feb 16, 2026
Status: 🟑 Instance 1 provisioning on Vast.ai

---

Design Philosophy

This is built from my perspective as the consumer β€” the AI agent that will route tasks to these instances. The architecture must be:

1. Modular β€” add/remove instances without touching the router
2. Extensible β€” new use cases plug in, don't require redesign
3. Multi-instance ready β€” scale from 1 to N GPUs seamlessly
4. Self-healing β€” detect down instances, failover automatically
5. Cost-aware β€” route to cheapest capable instance per task

---

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  CLAW (Orchestrator)                  β”‚
β”‚         Clawdbot Gateway + Dual Max Agents           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              MINIMAX ROUTER (Local)                   β”‚
β”‚                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Health   β”‚  β”‚ Task     β”‚  β”‚ Instance         β”‚    β”‚
β”‚  β”‚ Monitor  β”‚  β”‚ Classifierβ”‚  β”‚ Registry         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚       β”‚            β”‚                  β”‚               β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                    β–Ό                                  β”‚
β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚           β”‚  Load Balancer β”‚                          β”‚
β”‚           β”‚  (round-robin/ β”‚                          β”‚
β”‚           β”‚   task-aware)  β”‚                          β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό           β–Ό           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Instance 1β”‚ β”‚Instance 2β”‚ β”‚Instance Nβ”‚
β”‚ A100 80GBβ”‚ β”‚ H100 80GBβ”‚ β”‚ ...      β”‚
β”‚ $0.37/hr β”‚ β”‚ $X.XX/hr β”‚ β”‚          β”‚
β”‚ China    β”‚ β”‚ US-East  β”‚ β”‚          β”‚
β”‚          β”‚ β”‚          β”‚ β”‚          β”‚
β”‚ llama.cppβ”‚ β”‚ vLLM     β”‚ β”‚ llama.cppβ”‚
β”‚ :8080    β”‚ β”‚ :8080    β”‚ β”‚ :8080    β”‚
β”‚          β”‚ β”‚          β”‚ β”‚          β”‚
β”‚ MiniMax  β”‚ β”‚ MiniMax  β”‚ β”‚ Twin     β”‚
β”‚ M2.5     β”‚ β”‚ M2.5     β”‚ β”‚ LoRA    β”‚
β”‚ Q4_K_M   β”‚ β”‚ FP8      β”‚ β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

---

Components

1. Instance Registry (`[home-path]`)

Every instance self-registers. The router reads this to know what's available.

json
{
  "instances": [
    {
      "id": "mm-1",
      "name": "minimax-a100-cn",
      "vast_id": 31507265,
      "host": "ssh7.vast.ai",
      "port": 27264,
      "api_port": 8080,
      "api_url": "http://<tunnel>:8080/v1",
      "gpu": "A100_PCIE_80GB",
      "model": "MiniMax-M2.5-Q4_K_M",
      "quantization": "Q4_K_M",
      "cost_per_hr": 0.37,
      "location": "CN",
      "status": "online",
      "capabilities": ["coding", "agent", "office", "general", "twin-base"],
      "max_context": 131072,
      "last_health_check": null,
      "created_at": "2026-02-16T14:49:00Z"
    }
  ],
  "fallback": {
    "provider": "openrouter",
    "model": "minimax/minimax-m2.5",
    "cost_per_mtk_in": 0.30,
    "cost_per_mtk_out": 1.20
  }
}

2. Task Classifier

Routes tasks to the right instance based on requirements:

Use CaseContext NeededPriorityInstance Preference
Coding AgentHigh (full codebase)SpeedClosest region, highest VRAM
Cognitive TwinMedium (conversation)QualityAny (task not latency-sensitive)
Office AutomationLow-MediumSpeedAny
Agentic WorkflowsHigh (tool chains)ReliabilityUS-based (lower latency to APIs)
Bulk ProcessingLowCostCheapest instance
Quick ChatLowSpeedAny available

3. Health Monitor

Runs every 30 seconds per instance:
- `GET /health` β€” is the server responding?
- `POST /v1/chat/completions` β€” can it generate? (1-token test)
- Latency measurement
- GPU utilization (via Vast.ai API)
- Auto-remove dead instances from routing

4. Load Balancer

Strategies (configurable per task type):
- round-robin β€” default, distributes evenly
- least-loaded β€” route to instance with lowest queue depth
- task-aware β€” match task type to instance capabilities
- cost-optimized β€” prefer cheapest instance that meets quality bar
- latency-optimized β€” prefer geographically closest instance

5. Tunnel / Networking

Each Vast.ai instance exposes its API via:
- Option A: SSH tunnel (secure, simple) β€” `ssh -L 8080:localhost:8080 [email] -p 27264`
- Option B: Vast.ai port mapping β€” direct HTTP (requires open port)
- Option C: Cloudflare Tunnel β€” public URL with auth (best for multi-client)
- Option D: Tailscale β€” private mesh network (ideal for our setup)

Recommended: Option A for single instance, Option D when scaling.

---

Use Case Deployment Specs

UC1: Coding Agent

yaml
system_prompt: "You are a coding agent. Follow conventional commits..."
max_tokens: 65536
temperature: 0.1
context_window: 131072
tools: [file_read, file_write, shell_exec, git]

UC2: Cognitive Twin

yaml
system_prompt: "<loaded from Twin training data system prompt>"
max_tokens: 4096
temperature: 0.7
context_window: 32768
personality: "Mo's communication style, decisions, preferences"
# Future: LoRA adapter loaded on top of base M2.5

UC3: Office Automation

yaml
system_prompt: "Generate and manipulate office documents..."
max_tokens: 16384
temperature: 0.3
tools: [docx_gen, xlsx_gen, pptx_gen]

UC4: Agentic Workflows

yaml
system_prompt: "Execute multi-step tool chains..."
max_tokens: 32768
temperature: 0.2
tools: [web_browse, shell, code_exec, file_ops, api_call]

UC5: Bulk Processing

yaml
system_prompt: "<varies per batch>"
max_tokens: 8192
temperature: 0.0
batch_mode: true
# Training data generation, corpus processing, etc.

UC6: Quick Chat / General

yaml
system_prompt: "You are a helpful assistant."
max_tokens: 4096
temperature: 0.7

---

Scaling Playbook

Adding a New Instance

bash
# 1. Find a cheap GPU
vastai search offers 'gpu_name=A100_PCIE disk_space>=300 rentable=true' --order 'dph_total' --limit 5

# 2. Rent it
vastai create instance <OFFER_ID> --image "ghcr.io/ggml-org/llama.cpp:server" --disk 400 --ssh --direct

# 3. Deploy model (automated script)
[home-path] <INSTANCE_ID>

# 4. Register in fleet
[home-path] <INSTANCE_ID> <CAPABILITIES>

Removing an Instance

bash
# 1. Drain (stop routing new tasks)
[home-path] <INSTANCE_ID>

# 2. Wait for in-flight tasks to complete

# 3. Destroy
vastai destroy instance <INSTANCE_ID>

# 4. Deregister
[home-path] <INSTANCE_ID>

Cost Management

  • Budget alerts: Notify when spend exceeds $X/day
  • Auto-shutdown: Destroy instances during quiet hours (11PM-8AM) if not in use
  • Spot pricing: Monitor for cheaper instances, migrate if savings > 20

---

Integration with Clawdbot

As a Model Provider

Add to Clawdbot config as a custom OpenAI-compatible endpoint:

json
{
  "providers": {
    "minimax-fleet": {
      "type": "openai-compatible",
      "baseUrl": "http://localhost:8080/v1",
      "models": ["minimax-m2.5"],
      "priority": 2
    }
  }
}

As a Pulse Agent Backend

Pulse chains can target MiniMax for cost-sensitive tasks:

Phase 1 (Architecture): Claude Opus β†’ expensive but best
Phase 2 (Implementation): MiniMax M2.5 β†’ cheap, still great at coding
Phase 3 (Testing): MiniMax M2.5 β†’ bulk test generation
Phase 4 (Review): Claude Opus β†’ final quality check

As Twin Inference

Once we fine-tune on MiniMax:

Conversation β†’ Twin Router
  β”œβ”€β”€ Is this a "Mo-style" query? β†’ Twin LoRA on MiniMax
  └── Is this general? β†’ Base MiniMax M2.5

---

Immediate Deployment Plan

### Phase 0: First Instance (NOW)
1. βœ… Vast.ai instance rented (31507265)
2. ⏳ Wait for provisioning
3. SSH in β†’ download `unsloth/MiniMax-M2.5-GGUF` Q4_K_M
4. Start llama.cpp server on port 8080
5. SSH tunnel to localhost
6. Test with curl

### Phase 1: Router + Health (Day 1)
1. Create `[home-path]` directory
2. Write registry.json
3. Deploy router script (Python, lightweight)
4. Health monitor cron (every 30s)

### Phase 2: Clawdbot Integration (Day 1-2)
1. Add as custom model provider
2. Test from Clawdbot sessions
3. Route #quick channel to MiniMax for cost savings

### Phase 3: Multi-Instance (When needed)
1. Rent second instance (US-based for lower latency)
2. Load balancer activation
3. Specialized instances (one for coding, one for Twin)

### Phase 4: Twin Fine-Tuning (Week 1-2)
1. Convert 38K SFT dataset to MiniMax format
2. Fine-tune LoRA on MiniMax M2.5 base
3. Deploy Twin adapter alongside base model
4. A/B test Twin vs base

---

Files

[home-path]
β”œβ”€β”€ ARCHITECTURE.md          # This file
β”œβ”€β”€ registry.json            # Instance registry
β”œβ”€β”€ router.py                # Task router + load balancer
β”œβ”€β”€ health.py                # Health monitor
β”œβ”€β”€ deploy.sh                # Auto-deploy model to new instance
β”œβ”€β”€ register.sh              # Register instance in fleet
β”œβ”€β”€ drain.sh                 # Graceful instance removal
β”œβ”€β”€ deregister.sh            # Remove from registry
β”œβ”€β”€ tunnel.sh                # SSH tunnel manager
└── logs/
    └── fleet.log            # Routing decisions, health events

---

This architecture is designed by the agent who'll use it. Every decision optimizes for my actual workflow: parallel task dispatch, cost-aware routing, and seamless scaling. β€” Claw 🦞

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

minimax-fleet/ARCHITECTURE.md

Detected Structure

Method Β· Evaluation Β· References Β· Code Anchors Β· Architecture