Agent Harnesses: How LLMs Become Autonomous Software Engineers
A deep dive into OpenClaw, Claude Code, Codex, and the academic foundations powering today's coding agents — from ReAct loops to sandboxed execution and multi-agent coordination
Summary
Large language models can generate code, but they cannot do software engineering alone. The missing piece is the agent harness: the runtime system that gives a model eyes (file reading, code search), hands (file editing, shell execution), and a feedback loop (test results, linter output, error traces). This review maps the landscape of agent harnesses across three production systems and ten academic papers, tracing how the field evolved from the ReAct paper's think-act-observe loop (2022) through purpose-built coding agents like SWE-agent (2024) to today's full-featured harnesses: OpenClaw's persistent daemon architecture, Claude Code's permission-gated tool system with multi-agent teams, and Codex's sandboxed App Server with formal JSON-RPC protocols.
The central design tension is autonomy vs. safety. More capable harnesses give agents broader tool access and longer execution loops, but every additional capability is an additional attack surface. OpenClaw learned this the hard way with 190 security advisories in its first month. Claude Code and Codex take different approaches: Claude Code uses a layered permission model with optional human-in-the-loop confirmation, while Codex enforces OS-level sandboxing (macOS Seatbelt, Linux bubblewrap+seccomp) with a formal approval policy system. The academic literature contributes a parallel insight: SWE-agent showed that how the agent interacts with its tools (the Agent-Computer Interface) matters as much as the model itself, while Agentless demonstrated that simpler pipelines can match complex harnesses at a fraction of the cost.
Three architectural patterns dominate: (1) the ReAct loop — reason, act, observe, repeat — is universal across all systems; (2) context engineering — compaction, caching, hierarchical instruction files — is the primary scaling bottleneck; (3) multi-agent coordination — from MetaGPT's SOP-driven assembly lines to Claude Code's agent teams — is the frontier, with no consensus on the right abstraction.
Researcher Notes
The harness is the product, not the model. This is the single most important takeaway. OpenClaw, Claude Code, and Codex all use frontier LLMs as their reasoning engine, but the engineering that makes them useful — tool execution, permission enforcement, context management, sandbox isolation — lives entirely in the harness. The model proposes; the harness disposes. When OpenAI published 'Unrolling the Codex agent loop,' they explicitly separated 'the model' from 'the harness' as two distinct architectural components. When Anthropic's Claude Code source was leaked in March 2026, the 59.8MB codebase revealed a 46,000-line QueryEngine.ts orchestration layer — the model API call is a single function within it.
Context engineering is the real bottleneck, not model capability. Every harness has elaborate machinery for managing the context window: Claude Code has five fallback compaction strategies (time-based clearing, conversation summarization, session memory extraction, full history summarization, oldest-message truncation). Codex uses prompt caching to avoid quadratic data transfer costs and an encrypted compaction endpoint for unlimited conversation length. OpenClaw's 'Guard Context' step dynamically summarizes or truncates. The academic papers largely ignore this problem because they evaluate on single-turn benchmarks, but production harnesses spend more engineering effort on context management than on tool execution.
The security story is far from settled. OpenClaw's 190 security advisories in February 2026 were a wake-up call, but the underlying problem is fundamental: an agent harness gives an LLM access to a shell, a filesystem, and network resources. Every tool is a potential attack vector. The response has been layered defense: Codex uses OS-level sandboxing (Seatbelt on macOS, bubblewrap+seccomp on Linux) with a formal approval policy system. Claude Code uses a permission model where a second LLM call evaluates whether the user would approve each action. OpenClaw relies on community-built wrappers (IronClaw for WASM isolation, NemoClaw for kernel sandboxing). None of these are provably secure — the field is in the 'defense in depth' phase.
Multi-agent is the frontier, but the abstractions are immature. MetaGPT showed that SOP-driven role assignment works for structured workflows. AutoGen proved that conversational multi-agent systems can solve problems single agents cannot. But production harnesses are still experimenting: Claude Code's agent teams use file-based mailboxes and shared task lists with file locking — fundamentally a distributed systems problem being solved with filesystem primitives. Codex's subagent model is simpler (fan-out/fan-in with depth limits), trading coordination flexibility for reliability. The academic community hasn't converged on a theory of multi-agent coordination for coding tasks.
Agentless is the important counterpoint. While the industry races toward more complex harnesses, Xia et al.'s Agentless paper showed that a three-phase pipeline (localize → generate patch → filter with tests) achieves competitive results at $0.70/task with no agent loop at all. This is a healthy reminder that complexity must earn its keep. The best harness for a given task might be no harness at all.
Foundations: The ReAct Loop and Why Harnesses Exist
Every agent harness, from a 200-line script to OpenClaw's five-component daemon, implements some variant of the same core loop: the model reasons about what to do, acts by calling a tool, observes the result, and repeats. This pattern was formalized as ReAct by Yao et al. (ICLR 2023), though practitioners had been building ad-hoc versions since GPT-3.5 gained function-calling capabilities.
The insight behind ReAct is deceptively simple: interleaving reasoning traces with actions outperforms either alone. Pure reasoning (chain-of-thought) hallucinates when it lacks external information. Pure acting (tool calls without reasoning) makes incoherent sequences of actions. ReAct weaves them together: the model writes a thought ('I need to find the auth middleware file'), generates an action (search the codebase), observes the result (file found at src/middleware/auth.ts), reasons about what to do next ('The bug is in the token validation logic on line 47'), and acts again (edit the file). This think-act-observe cycle is now the beating heart of every production coding agent.
But ReAct alone is not enough to build a useful agent. The paper used simple text-based environments (web search, interactive fiction). Real software engineering requires: (1) tool execution — safely running shell commands, editing files, invoking compilers and test suites; (2) context management — keeping track of relevant code, conversation history, and project structure within a finite context window; (3) permission enforcement — preventing the agent from deleting production databases or pushing malicious code; (4) state persistence — maintaining memory across interactions. The agent harness is the engineering layer that provides all of this around the core ReAct loop.
CodeAct (Wang et al., ICML 2024) pushed the action space further by replacing JSON tool calls with executable Python code. Instead of the model outputting a structured function call that the harness routes to a predefined tool, the model writes arbitrary Python that runs in a persistent interpreter. This enables dynamic logic — loops, conditionals, error handling, self-debugging — that JSON schemas cannot express. CodeAct showed up to 20% improvement over JSON-based actions on agent benchmarks, and the pattern was adopted by OpenHands, one of the most prominent open-source coding agent platforms.
The evolution from ReAct to CodeAct to production harnesses follows a clear trajectory: give the model more expressive ways to interact with the environment, while building increasingly sophisticated safety guardrails around those interactions.
ReAct
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao — Princeton University, Google Research
ICLR 2023
ReAct: Synergizing Reasoning and Acting in Language Models
Interleaves reasoning traces with actions in a single LLM output loop. The model thinks, acts, observes the result, and repeats — grounding generation in external observations to reduce hallucination.
Key Innovation
Formalized the think-act-observe paradigm that now underlies every production agent harness, showing that combining reasoning and acting outperforms either alone
Limitations
- •
Original evaluation on relatively simple environments (web search, text games)
- •
No mechanism for managing long context or multi-file codebases
- •
Single-agent only — no coordination protocol
CodeAct
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji — University of Illinois Urbana-Champaign, UC San Diego, MIT
ICML 2024
Executable Code Actions Elicit Better LLM Agents
Replaces JSON tool calls with executable Python code as the sole action type. The agent writes Python that runs in a persistent interpreter; stdout/stderr becomes the next observation.
Key Innovation
Code-based actions enable dynamic logic, loops, and self-debugging that JSON schemas cannot express — up to 20% improvement over JSON-based actions on agent benchmarks
Limitations
- •
Requires a persistent interpreter with security implications
- •
Python-centric — less natural for non-Python workflows
- •
Arbitrary code execution increases attack surface
Comparison
| System | Action Space | Reasoning | Environment | Key Contribution |
|---|---|---|---|---|
ReAct (2023) | Text-based tool calls | Interleaved thought traces | Web search, text games | Formalized the agent loop paradigm |
CodeAct (2024) | Executable Python code | Interleaved thought traces | Persistent Python interpreter | Showed code > JSON for agent actions |
Measuring Agents: SWE-bench, SWE-agent, and the Agent-Computer Interface
Before you can build a good harness, you need to measure what 'good' means. SWE-bench (Jimenez et al., ICLR 2024) established the gold standard: 2,294 real GitHub issues from 12 popular Python repositories. Given a codebase and an issue description, the agent must produce a patch that passes the repository's existing test suite. No synthetic tasks — every issue and test is real. At release, GPT-4 solved fewer than 2% of tasks; the benchmark instantly revealed how far models were from practical software engineering.
SWE-bench spawned an ecosystem of variants — SWE-bench Lite (300 tasks), SWE-bench Verified (500 human-validated tasks), and extensions to other languages — and became the universal yardstick for coding agent evaluation. Every harness in this review reports SWE-bench numbers: Codex at 72.1% (pass@1 on Verified), Claude Code at 80.9%, OpenHands at 53%.
SWE-agent (Yang et al., NeurIPS 2024) made the crucial observation that benchmark performance depends not just on the model but on the Agent-Computer Interface (ACI) — the set of tools and interaction patterns the harness provides. The authors designed a custom ACI with purpose-built commands: a windowed file viewer (showing 100 lines at a time with line numbers), a search tool that returns context around matches, and a structured edit command that validates syntax before applying changes. These tools are optimized for LLM consumption — concise, unambiguous, with structured error feedback.
The result was striking: ACI design mattered more than model size. SWE-agent achieved 12.5% on SWE-bench Full (SOTA at publication) and the gap between good and bad ACIs was larger than the gap between model generations. This finding has profound implications for harness engineering: the tools you give the model are as important as the model itself. Every production harness has internalized this lesson — Claude Code's ~40 tools with structured output formats, Codex's sandboxed filesystem utilities with streaming output, OpenClaw's skill-based tool injection system.
Agentless (Xia et al., 2024) provided the essential counterpoint. Rather than building a complex agent loop, Agentless uses a simple three-phase pipeline: (1) hierarchical localization (narrow from repository to file to function), (2) patch generation, (3) filtering candidates by running tests. No tool use, no iterative loop, no agent state. The result: 32% on SWE-bench Lite at $0.70 per task — competitive with agent-based systems that cost orders of magnitude more. Agentless forces the field to justify every layer of harness complexity: if a pipeline without an agent loop can match your agent, your harness is adding cost without adding capability.
OpenHands (Wang et al., ICLR 2025) synthesized these insights into a comprehensive open-source platform. Agents run inside sandboxed Docker containers with access to shell commands, file editing, web browsing, and APIs. The platform uses CodeAct as its action layer and supports 15 benchmarks for evaluation. OpenHands achieved #1 on SWE-bench Full (29%) and 53% on Verified at publication, demonstrating that combining a well-designed ACI with a robust execution sandbox and the code-action paradigm produces a highly capable open-source coding agent.
SWE-bench
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan — Princeton University
ICLR 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
A benchmark of 2,294 real GitHub issues from 12 popular Python repositories. The agent receives a codebase and issue description and must produce a patch passing the repository's test suite.
Key Innovation
Established the gold-standard benchmark for end-to-end coding agent evaluation using real-world issues and tests, revealing that GPT-4 solved fewer than 2% at release
Limitations
- •
Python-only (extensions to other languages came later)
- •
Test suites vary in quality across repositories
- •
Pass@1 evaluation may undervalue agents that generate correct patches in multiple attempts
SWE-agent
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press — Princeton University
NeurIPS 2024
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Argues that the Agent-Computer Interface (ACI) — the set of tools and interaction patterns — is as important as the model. Designs custom file viewing, searching, and editing commands optimized for LLM consumption.
Key Innovation
Demonstrated that ACI design matters more than model size for coding agent performance; custom tools with structured error feedback dramatically improve agent effectiveness
Limitations
- •
Custom ACI requires significant engineering effort per domain
- •
Evaluation focused on Python repositories
- •
Single-agent architecture — no built-in multi-file coordination
Agentless
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang — University of Illinois Urbana-Champaign
arXiv 2024
Agentless: Demystifying LLM-based Software Engineering Agents
A three-phase pipeline — hierarchical localization, patch generation, test-based filtering — that solves coding tasks without an agent loop. No tool use, no iterative reasoning, no agent state.
Key Innovation
Showed that a non-agentic pipeline achieves 32% on SWE-bench Lite at $0.70/task, competitive with far more complex agent systems — forcing the field to justify harness complexity
Limitations
- •
No iterative debugging — if the initial patch is wrong, there is no recovery
- •
Localization accuracy is a hard ceiling on overall performance
- •
Cannot handle tasks requiring multi-step environmental interaction
OpenHands
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Graham Neubig — University of Illinois Urbana-Champaign, All Hands AI, Carnegie Mellon University
ICLR 2025
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
A comprehensive open-source platform where agents run inside sandboxed Docker containers with access to shell, file editing, web browsing, and APIs. Uses CodeAct as the action layer and supports 15 benchmarks.
Key Innovation
Unified platform covering the full agent lifecycle — sandboxed execution, event-stream state management, multi-turn interactions, and benchmark evaluation — achieving #1 on SWE-bench Full at publication
Limitations
- •
Docker dependency adds setup complexity
- •
Event-stream architecture can be memory-intensive for long sessions
- •
Community-driven — API stability varies
Comparison
| System | Type | SWE-bench Score | Key Insight | Cost/Task |
|---|---|---|---|---|
SWE-bench | Benchmark | GPT-4: <2% at release | Real-world issues are far harder than synthetic benchmarks | — (benchmark, not a system) |
SWE-agent | Agent + ACI | 12.5% Full (2024 SOTA) | Tool design matters more than model size | ~$2-4 |
Agentless | Pipeline (no agent loop) | 32% Lite | Simple pipelines can match complex agents | $0.70 |
OpenHands | Open platform | 29% Full, 53% Verified | CodeAct + sandbox + good ACI = strong open-source agent | ~$1-3 |
OpenClaw: The Persistent Daemon Agent
OpenClaw (originally Clawdbot, then briefly Moltbot) is the most popular open-source agent harness by GitHub stars (~247K as of March 2026). Created by Austrian developer Peter Steinberger, it takes a fundamentally different architectural approach from coding-focused agents: OpenClaw treats the agent as a persistent operating system process — a daemon that runs continuously, reacts to messages from multiple channels, and autonomously executes tasks on a schedule.
The Five-Component Architecture. OpenClaw's design has five clearly delineated components:
-
Gateway — The central daemon process and single source of truth for sessions and routing. It normalizes inputs from multiple channels (Slack, Discord, WhatsApp, Telegram, etc.) into a unified internal message format, authenticates inbound messages, and proxies outbound API calls to the configured LLM provider.
-
Brain (Agent Runner) — The orchestration layer that implements a five-step pipeline: (a) select the correct agent for the incoming session, (b) resolve the cheapest model that satisfies context requirements (with automatic key rotation on rate limits), (c) build the prompt dynamically from SOUL.md, skills, memories, and tool definitions, (d) guard context by enforcing the model's context window limit via summarization or truncation, (e) execute the ReAct loop until the model produces a terminal response.
-
Memory — All persistent context is stored as plain Markdown files on local disk (~/.openclaw/). No embedded database, no vector store. The directory structure includes preferences.md, contacts.md, projects.md, learnings.md, and per-agent SOUL.md files. This is a deliberate trade-off: human-auditable and portable at the cost of retrieval sophistication.
-
Skills — OpenClaw's plugin/tool system. Each skill is a Markdown file with YAML frontmatter declaring a name, trigger pattern, and available tools. Skills are loaded at agent startup and injected into the system prompt. The community Skills Registry has grown to 5,700+ entries.
-
Heartbeat — The autonomy mechanism. Every 30 minutes (configurable), the Gateway sends the agent a scheduled trigger. The agent reads HEARTBEAT.md — a checklist of standing tasks — and decides whether any item requires action. If nothing needs doing, the agent responds with HEARTBEAT_OK and the reply is silently dropped. This built-in scheduling is unique among agent frameworks — competitors require external cron or workflow triggers.
The Serial Lane Queue. Messages within a session are processed one at a time through a Lane Queue. This prevents tool conflicts and keeps session history consistent, but limits throughput. Parallelism is opt-in and restricted to explicitly marked low-risk tasks. This is a significant architectural choice that prioritizes correctness over speed.
The Security Problem. OpenClaw's rapid growth exposed serious security issues. Between January 31 and February 25, 2026, 190 security advisories were filed — including combinations that compose into unauthenticated remote code execution paths. Four arXiv papers appeared in March 2026 analyzing the vulnerabilities. The flat permission model (agents have broad tool access by default) creates substantial attack surface when LLM reasoning is connected to host execution. The ecosystem responded with third-party hardening wrappers: IronClaw (Rust/WASM isolation by NEAR AI), NemoClaw (kernel-level sandboxing by NVIDIA), and KubeClaw (Kubernetes JIT RBAC). These are downstream fixes, not upstream architecture changes.
Identity via SOUL.md. OpenClaw uses a Markdown file called SOUL.md as an identity layer injected into every system prompt. The official docs distinguish it from a system prompt: 'System prompts tell models what to do; soul files tell them who to be.' Companion files handle style (STYLE.md), operating modes (SKILL.md), and session memory (MEMORY.md).
OpenClaw
Peter Steinberger — Independent (now OpenAI)
Open-source (MIT license), released November 2025
Open-source project (no published paper)
A persistent daemon agent framework with five components: Gateway (routing), Brain (ReAct loop), Memory (Markdown files), Skills (plugin system), and Heartbeat (autonomous scheduling). Runs as a background process reacting to messages and scheduled triggers.
Key Innovation
First major framework to treat the agent as a persistent OS process with built-in heartbeat scheduling and multi-channel message routing — other frameworks require external infrastructure for scheduling and message delivery
Limitations
- •
Flat permission model led to 190 security advisories in one month
- •
Markdown-based memory lacks retrieval sophistication (no vector search)
- •
Serial lane queue limits throughput
- •
Multi-agent coordination requires third-party wrappers
IronClaw
NEAR AI team — NEAR AI
Open-source, 2026
Open-source project by NEAR AI (no published paper)
OpenClaw-inspired agent framework rewritten in Rust with all tools running in WASM containers using capability-based permissions. Zero telemetry by design.
Key Innovation
WASM-based tool isolation with capability-based permissions — each tool runs in its own sandbox with explicitly granted capabilities, preventing the lateral movement attacks that plagued OpenClaw
Limitations
- •
Smaller ecosystem than OpenClaw (fewer skills, less community support)
- •
WASM sandbox adds overhead for I/O-heavy tools
- •
Newer and less battle-tested
Comparison
| System | Language | Sandboxing | Scheduling | Multi-Agent | Community Skills |
|---|---|---|---|---|---|
OpenClaw | Node.js | Flat permissions (third-party wrappers available) | Built-in Heartbeat (30 min default) | Channel-to-agent bindings; no peer-to-peer | 5,700+ |
IronClaw | Rust | WASM containers with capability-based permissions | Inherited from OpenClaw pattern | Similar to OpenClaw | Growing |
NemoClaw | Python wrapper | Kernel-level (OS-level, not container) | Wraps OpenClaw's Heartbeat | Wraps OpenClaw | OpenClaw-compatible |
Claude Code: Permission-Gated Tool Orchestration and Agent Teams
Claude Code is Anthropic's agentic coding system — a harness that wraps Claude models with ~40 tools, a layered permission model, and multi-agent coordination capabilities. Released as a limited research preview in February 2025 and reaching general availability in May 2025, it rapidly became one of the most commercially successful AI products, surpassing $1B in annualized revenue by late 2025.
The Three-Phase Loop. Claude Code's core execution model iterates through three phases: (1) gather context — read files, search codebases, inspect error logs; (2) take action — edit files, run commands, call external services; (3) verify results — run tests, check outputs, course-correct. The orchestrator's behavior is defined in natural language in the system prompt, not in branching code logic — enabling behavioral iteration without redeployment.
Tool Architecture. Approximately 40 tools are organized into categories: file operations (Read, Write, Edit, Glob), search (Grep, pattern-based file search), execution (Bash), web (WebSearch, WebFetch), and orchestration (Agent for spawning subagents, AskUserQuestion). Read-only operations run concurrently; mutating operations execute serially to prevent conflicts. Each tool is independently sandboxed with configurable access controls. External tools connect via the Model Context Protocol (MCP), with tool definitions deferred by default to save context space — only tool names are loaded until a tool is actually used.
The Permission Model. This is Claude Code's most distinctive design choice. Four permission modes provide graduated autonomy: (1) Default — Claude asks before file edits and shell commands; (2) Auto-accept edits — file edits auto-approved, commands still require confirmation; (3) Plan mode — read-only tools only, generates a plan for user approval; (4) Auto mode (research preview) — a second, separate LLM call evaluates each proposed action to predict whether the user would approve it. This is architecturally significant: the safety evaluation is not the same inference that generated the action. It is a separate model call specifically for permission prediction.
Context Engineering. The QueryEngine (the core orchestration engine, revealed in a March 2026 source leak to be ~46,000 lines of TypeScript) manages five fallback context compaction strategies deployed sequentially: time-based clearing of outdated tool outputs, conversation summarization, session memory extraction, full history summarization, and oldest-message truncation. Project context enters via CLAUDE.md files (project-specific instructions loaded at session start) and an auto-memory system that saves learnings as work progresses.
Subagents. Claude Code can spawn subagent instances via the Agent tool. Each subagent receives its own system prompt, a restricted tool subset, and the project's CLAUDE.md, but does not receive the parent's conversation history or system prompt. Only the subagent's final message returns to the parent — intermediate tool calls stay isolated, preventing context contamination. Three execution models exist: Fork (byte-identical parent context copy), Teammate (separate terminal pane with file-based mailbox), and Worktree (git worktree with isolated branches). Subagents cannot spawn their own subagents, limiting recursion depth.
Agent Teams. The experimental multi-agent feature (enabled via feature flag) is architecturally distinct from subagents. A team lead spawns teammates — each a full Claude Code instance with its own context window — that coordinate through a shared task list (~/.claude/tasks/{team-name}/) and a file-based mailbox messaging system. Task claiming uses file locking to prevent race conditions. Teammates can be required to plan before implementing; the lead reviews and approves plans autonomously. This is fundamentally a distributed systems problem being solved with filesystem primitives — simple and pragmatic, but with known limitations around session resumption and task status consistency.
Hooks. External code can intercept the agent loop at 25+ lifecycle points: PreToolUse, PostToolUse, SessionStart, SessionEnd, UserPromptSubmit, SubagentStart, and more. Hooks can allow, deny, or defer tool calls, inject additional context, or modify tool inputs. When multiple hooks conflict, deny takes priority over ask, which takes priority over allow. This creates a deterministic interception layer that enables organizational policy enforcement without modifying the agent's core behavior.
Cowork. Claude Code Cowork brings the same harness architecture into Claude Desktop, running inside a sandboxed VM (Ubuntu 22.04, 4 CPU cores, 4GB RAM) using Apple's Virtualization framework. Multi-layer egress control (syscall blocking via gVisor, MITM proxy with per-boot CA certificates, domain allowlist) provides stronger isolation than the CLI. Rather than sandboxing a browser inside the VM, Cowork controls the host Chrome browser via a native messaging extension — providing real browsing with the user's actual cookies and sessions, but operating outside VM egress controls.
Claude Code
Anthropic — Anthropic
Commercial product, GA May 2025; open-source CLI (MIT license)
Commercial product by Anthropic (no published paper)
An agentic coding harness with ~40 tools, a four-tier permission model (including a separate LLM call for auto-approval evaluation), layered context compaction, and multi-agent coordination via subagents and agent teams.
Key Innovation
Permission model where a second LLM call evaluates each proposed action independently of the generating model; hooks system enabling deterministic organizational policy enforcement; three subagent execution models (fork, teammate, worktree)
Limitations
- •
No formally published wire protocol (unlike Codex's JSON-RPC spec)
- •
Agent teams are experimental with known limitations (no session resumption, no nested teams)
- •
Auto mode permission evaluation is non-deterministic
Claude Code Cowork
Anthropic — Anthropic
Commercial product, shipped January 2026
Commercial product by Anthropic (no published paper)
Brings the Claude Code harness architecture into Claude Desktop, running inside a sandboxed Linux VM with multi-layer egress control. Extends beyond coding to knowledge work: spreadsheets, presentations, research, reports.
Key Innovation
VM-level isolation with gVisor syscall blocking, per-boot ephemeral CA certificates for HTTPS inspection, and host browser control via native messaging extension — stronger isolation than CLI-based harnesses
Limitations
- •
Fixed VM resources (4 CPU cores, 4GB RAM)
- •
Ephemeral storage — no persistence across sessions
- •
Host browser control operates outside VM egress controls, creating a split security perimeter
Comparison
| Feature | Claude Code CLI | Claude Code Cowork |
|---|---|---|
Execution Environment | Host OS with permission gates | Sandboxed Ubuntu 22.04 VM |
Isolation | Per-tool permission model | VM boundary + gVisor + MITM proxy + domain allowlist |
Multi-Agent | Subagents + Agent Teams | Parent-child orchestration |
Browser Access | Via MCP tools | Host Chrome via native messaging extension |
Persistence | Session files on local disk | Ephemeral (formatted fresh each boot) |
Target Use Case | Software engineering in terminal | General knowledge work in desktop app |
Codex: The Sandboxed App Server with Formal Protocol
OpenAI's Codex (2025 — not the 2021 code completion model) is a multi-surface coding agent powered by the codex-1 model family (o3 fine-tuned via reinforcement learning on real-world coding tasks). It ships as a cloud product (in ChatGPT), an open-source CLI (Rust + TypeScript, MIT license), IDE extensions, a macOS app, and a GitHub Action. The architectural centerpiece is the App Server: a single stable binary that provides a formal JSON-RPC protocol for all client surfaces.
The Model/Harness Separation. OpenAI explicitly separates two components: the model (stateless reasoning engine) and the harness (execution layer). The harness executes tool calls, collects outputs, manages permissions, enforces sandbox policies, and decides when the agent loop terminates. The model proposes actions; the harness disposes of them. OpenAI published dedicated blog posts about this separation — 'Unrolling the Codex agent loop' and 'Unlocking the Codex harness' — establishing the model/harness distinction as a first-class architectural concept.
The App Server Protocol. The App Server uses JSON-RPC 2.0 streamed as newline-delimited JSON (JSONL) over stdio, with an experimental WebSocket transport. Three core primitives structure all interactions: Items (atomic units of I/O with started/delta/completed lifecycle), Turns (groups of items from one agent work unit, supporting mid-turn steering and interruption), and Threads (durable conversation containers supporting creation, resumption, forking, archival, rollback, and compaction). The protocol is fully bidirectional: the server can initiate requests (e.g., approval prompts) and pause the agent turn pending client response. This formal protocol — with versioned schemas and generated client bindings in Go, Python, TypeScript, Swift, and Kotlin — is Codex's most distinctive engineering contribution.
Two-Layer Security. Codex enforces security through two independent layers. Layer 1 (Sandbox Mode) controls what the agent can do technically: read-only, workspace-write (read/edit within workspace, run commands), or full access. Protected paths (.git/, .agents/, .codex/) remain read-only even in writable modes. Layer 2 (Approval Policy) controls when the user must confirm: on-request (approve sandbox violations and network access), untrusted (auto-run safe reads, approve state mutations), never (all approvals disabled), or granular (per-category control). OS-level enforcement uses macOS Seatbelt profiles on Mac and bubblewrap+seccomp on Linux — real kernel-level sandboxing, not process-level controls.
Codex Cloud runs tasks in isolated containers with a two-phase runtime: a setup phase with network access for installing dependencies, then an agent phase that is offline by default. Internet access must be explicitly enabled. Tasks run asynchronously and multiple tasks can run in parallel.
Context Management. Codex addresses the quadratic context growth problem (n turns requires resending all prior context) through two mechanisms: prompt caching (new content is appended to an identical prefix, enabling KV computation reuse) and conversation compaction (replacing full history with an encrypted compressed representation via the /responses/compact endpoint). The compaction output is opaque — it preserves the model's understanding without exposing raw text. Hierarchical project instructions enter via AGENTS.md files loaded from global override, global default, and per-directory levels.
Subagents. Codex can spawn specialized subagents in parallel using a fan-out/fan-in pattern. The orchestrator waits for all results before returning a consolidated response. Resource controls limit concurrent threads (default: 6) and recursive delegation depth (default: 1). Built-in agent types include default (general-purpose), worker (execution-focused), and explorer (read-heavy analysis). Custom agents are configured via config.toml with specific instructions, model, and sandbox policies.
The Rust Rewrite. The CLI is being rewritten from TypeScript to Rust (codex-rs/) for zero-dependency installation, native security bindings, and elimination of GC pauses. The TypeScript version is maintained in parallel pending Rust feature parity. The Rust rewrite also enables the App Server to be embedded directly in client applications, eliminating the need for a separate process.
Codex as MCP Server. Codex can expose itself as an MCP server (codex mcp-server), enabling other agents to use Codex as a coding tool. Two endpoints are exposed: codex (initiates a session) and codex-reply (continues a session). This composability — where Codex is both an agent and a tool that other agents can call — represents a maturing understanding of agent-to-agent interaction.
Codex (2025)
OpenAI — OpenAI
Commercial product, launched May 2025; open-source CLI (MIT license)
Commercial product by OpenAI (no published paper)
A multi-surface coding agent with a formal JSON-RPC App Server protocol, two-layer security (sandbox mode + approval policy with OS-level enforcement), and a clear model/harness architectural separation.
Key Innovation
Formal JSON-RPC protocol with versioned schemas enabling multi-language client implementations; OS-level sandboxing (Seatbelt, bubblewrap+seccomp); encrypted conversation compaction for unlimited session length; composability as both agent and MCP server
Limitations
- •
No published training paper for the codex-1 RL process
- •
Rapid model naming changes (codex-1 → gpt-5.x-codex series)
- •
Rust rewrite still catching up to TypeScript feature parity
- •
App Server protocol is not an open standard (versioned per binary release)
Comparison
| Dimension | Codex | Claude Code | OpenClaw |
|---|---|---|---|
Wire Protocol | JSON-RPC 2.0 (JSONL/stdio, WebSocket) | No published protocol | Internal (undocumented) |
Sandbox Enforcement | OS-level: Seatbelt (macOS), bubblewrap+seccomp (Linux) | Permission model + optional Auto mode (2nd LLM call) | Flat permissions (third-party wrappers) |
Context Compaction | Encrypted compaction endpoint + prompt caching | 5-stage fallback (time clearing → summarization → memory → full summary → truncation) | Guard Context step (summarize/truncate) |
Subagent Model | Fan-out/fan-in, max 6 threads, depth 1 | Fork/Teammate/Worktree, no recursion | Channel-to-agent bindings (no native subagents) |
Open Source | MIT (CLI) | MIT (CLI) | MIT (full framework) |
SWE-bench Verified | 72.1% (pass@1) | 80.9% | Not evaluated (general-purpose agent) |
Multi-Agent Coordination: From Academic Frameworks to Production Teams
Single-agent harnesses hit practical limits when tasks require parallel work, role specialization, or cross-cutting coordination. The multi-agent paradigm addresses this by splitting work across multiple agent instances that communicate and coordinate. The academic foundations were laid by MetaGPT and AutoGen; production harnesses are now building their own implementations.
MetaGPT (Hong et al., ICLR 2024 Oral) introduced SOP-driven multi-agent coordination. Instead of letting agents chat freely, MetaGPT encodes human Standard Operating Procedures into the pipeline: a Product Manager agent generates a PRD, an Architect produces a design doc, Engineers write code, and QA runs tests. Each role passes structured artifacts to the next. This assembly-line model dramatically reduces the incoherent outputs common in free-form multi-agent conversation, because each agent's input is a well-typed artifact (not an open-ended chat message) and each agent's role is precisely scoped. MetaGPT was ranked #1 in the LLM Agent category at ICLR 2024.
AutoGen (Wu et al., COLM 2024, from Microsoft Research) took the opposite approach: fully conversational multi-agent systems. The core abstraction is the 'conversable agent' — an entity that can be backed by an LLM, human input, a tool executor, or any combination. Agents pass messages to each other in programmable conversation patterns: sequential (agent A talks to agent B), group chat (multiple agents discuss), or nested (a conversation within a conversation). The power comes from flexibility: the same framework handles math problem-solving, coding, operations research, and human-in-the-loop workflows. The cost is that conversational coordination is less predictable than SOP-driven pipelines.
ToolLLM (Qin et al., ICLR 2024 Spotlight) addressed a different aspect of multi-agent capability: tool discovery and planning across massive API landscapes. The framework includes ToolBench (16,464 real-world APIs across 49 categories), ToolLLaMA (a fine-tuned model for tool use), and a Depth-First Search Decision Tree (DFSDT) planner that explores multiple tool-call paths and backtracks when a path fails. DFSDT is the first systematic tree-search approach to tool planning, handling API failures via backtracking rather than linear retry — a critical capability when coordinating across unreliable external services.
Production implementations are converging on simpler patterns. Claude Code's agent teams use file-based mailboxes and shared task lists with file locking — a straightforward filesystem-based coordination mechanism. Codex uses fan-out/fan-in with hard limits on concurrent threads and recursion depth. OpenClaw relies on channel-to-agent bindings with shared filesystem state. None of these approach the sophistication of MetaGPT's SOP pipelines or AutoGen's conversational topologies, reflecting a production engineering preference for simplicity and debuggability over coordination expressiveness.
The Scaffolding/Harness taxonomy paper by Bui (arXiv, March 2026) provides the clearest published framework for understanding these systems. It distinguishes scaffolding (static assembly before the first prompt: system prompt, tool schemas, subagent registry) from harness (runtime orchestration: tool dispatch, context compaction, safety enforcement, state management). This vocabulary helps explain why production multi-agent systems look so different from academic ones: academic systems focus on scaffolding (agent roles, communication topologies) while production systems spend most of their engineering on harness (context management, permission enforcement, crash recovery).
MetaGPT
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zhengying Chen, Steven Zheng, Juergen Schmidhuber — DeepWisdom, KAUST, Xiamen University
ICLR 2024 (Oral)
MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework
Encodes human Standard Operating Procedures into a multi-agent pipeline with distinct roles (PM, Architect, Engineer, QA) passing structured artifacts. An assembly-line model where each agent's output is a typed artifact consumed by the next.
Key Innovation
SOP-driven coordination with structured artifact passing reduces incoherent outputs compared to free-form multi-agent chat; ranked #1 in LLM Agent category at ICLR 2024
Limitations
- •
Rigid pipeline — not suitable for tasks requiring dynamic role reassignment
- •
Each role requires careful prompt engineering
- •
Artifact schemas must be defined upfront
AutoGen
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen White, Doug Burger, Chi Wang — Microsoft Research, Penn State University
COLM 2024
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
A framework where customizable 'conversable agents' pass messages in programmable conversation patterns (sequential, group chat, nested). Each agent can be backed by an LLM, human, tool executor, or combination.
Key Innovation
Decoupled agent role from communication topology — the same framework handles math, coding, QA, and human-in-the-loop workflows via different conversation patterns
Limitations
- •
Conversational coordination is less predictable than pipeline-based systems
- •
Group chat can produce unfocused discussion
- •
Token costs scale with number of agents and conversation length
ToolLLM
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Zhiyuan Liu, Maosong Sun — Tsinghua University, OpenBMB
ICLR 2024 (Spotlight)
ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs
A complete tool-use framework with ToolBench (16,464 real APIs), ToolLLaMA (fine-tuned model), and a Depth-First Search Decision Tree (DFSDT) planner for multi-step tool planning with backtracking.
Key Innovation
DFSDT planner — first systematic tree-search approach to tool-use that handles API failures via backtracking rather than linear retry; ToolLLaMA matches ChatGPT on tool use while being open-source
Limitations
- •
API instability — real-world APIs change and break over time
- •
DFSDT search can be slow for deeply nested tool sequences
- •
Evaluation assumes correct API documentation is available
OpenDev Harness Paper
Nghi D. Q. Bui — OpenDev (independent)
arXiv, March 2026
Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned
Distinguishes scaffolding (static assembly: system prompt, tool schemas, subagent registry) from harness (runtime orchestration: tool dispatch, context compaction, safety enforcement). Describes a six-phase ReAct loop implementation.
Key Innovation
Provides the clearest published taxonomy distinguishing scaffolding from harness — terminology now appearing in follow-on work; documents practical context engineering techniques absent from academic papers
Limitations
- •
Practitioner report rather than peer-reviewed paper
- •
Single-system focus (OpenDev)
- •
No formal evaluation on standard benchmarks
Comparison
| System | Coordination Model | Agent Communication | Best For |
|---|---|---|---|
MetaGPT | SOP pipeline with typed artifacts | Structured artifact passing (PRD → design → code → tests) | Well-defined workflows with clear role boundaries |
AutoGen | Conversable agents with programmable patterns | Free-form message passing (sequential, group chat, nested) | Flexible multi-agent tasks requiring dynamic interaction |
ToolLLM | Tree-search planner with backtracking | Single agent with DFSDT planning over API sequences | Complex multi-step API orchestration with failure recovery |
Claude Code Teams | Shared task list + file-based mailbox | File-system primitives with file locking | Parallel coding tasks with simple coordination |
Codex Subagents | Fan-out/fan-in with depth limits | Result-only communication (no peer-to-peer) | Focused subtasks with consolidated results |
Cross-Cutting Comparison: Five Agent Harnesses Head-to-Head
| Dimension | OpenClaw | Claude Code | Codex | OpenHands | SWE-agent |
|---|---|---|---|---|---|
| Primary Use Case | General-purpose autonomous agent | Software engineering | Software engineering | Software engineering | Software engineering (research) |
| Architecture | Persistent daemon (5 components) | CLI with ~40 tools + QueryEngine | Multi-surface App Server (JSON-RPC) | Docker-sandboxed platform | Custom ACI + agent loop |
| Agent Loop | ReAct via Brain component | Three-phase (gather → act → verify) | Model/harness iterative loop | CodeAct (executable Python) | Custom ACI-based loop |
| Sandboxing | Flat permissions (third-party wrappers) | Permission model (4 modes) + optional 2nd LLM | OS-level (Seatbelt, bubblewrap+seccomp) | Docker containers | Docker containers |
| Context Management | Guard Context (summarize/truncate) | 5-stage fallback compaction | Prompt caching + encrypted compaction | Event-stream state | Windowed file viewer |
| Multi-Agent | Channel bindings (no native peer-to-peer) | Subagents + Agent Teams (experimental) | Subagents (fan-out/fan-in, max 6) | Single agent | Single agent |
| Open Source | Yes (MIT, 247K stars) | Yes (MIT) | Yes (MIT, CLI) | Yes (MIT) | Yes (MIT) |
| Wire Protocol | Internal | None published | JSON-RPC 2.0 (JSONL) | Event stream | Custom |
| SWE-bench Verified | Not evaluated (general-purpose) | 80.9% | 72.1% | 53% | 12.5% Full (2024 SOTA) |
| Scheduling | Built-in Heartbeat | External | External (GitHub Action) | External | External |