Vol. 2Saturday, April 4, 2026

Agent Harnesses: How LLMs Become Autonomous Software Engineers

A deep dive into OpenClaw, Claude Code, Codex, and the academic foundations powering today's coding agents — from ReAct loops to sandboxed execution and multi-agent coordination

agent-harnessreact-looptool-usesandboxingmulti-agentcontext-engineeringcoding-agentsagent-computer-interface

Summary

Large language models can generate code, but they cannot do software engineering alone. The missing piece is the agent harness: the runtime system that gives a model eyes (file reading, code search), hands (file editing, shell execution), and a feedback loop (test results, linter output, error traces). This review maps the landscape of agent harnesses across three production systems and ten academic papers, tracing how the field evolved from the ReAct paper's think-act-observe loop (2022) through purpose-built coding agents like SWE-agent (2024) to today's full-featured harnesses: OpenClaw's persistent daemon architecture, Claude Code's permission-gated tool system with multi-agent teams, and Codex's sandboxed App Server with formal JSON-RPC protocols.

The central design tension is autonomy vs. safety. More capable harnesses give agents broader tool access and longer execution loops, but every additional capability is an additional attack surface. OpenClaw learned this the hard way with 190 security advisories in its first month. Claude Code and Codex take different approaches: Claude Code uses a layered permission model with optional human-in-the-loop confirmation, while Codex enforces OS-level sandboxing (macOS Seatbelt, Linux bubblewrap+seccomp) with a formal approval policy system. The academic literature contributes a parallel insight: SWE-agent showed that how the agent interacts with its tools (the Agent-Computer Interface) matters as much as the model itself, while Agentless demonstrated that simpler pipelines can match complex harnesses at a fraction of the cost.

Three architectural patterns dominate: (1) the ReAct loop — reason, act, observe, repeat — is universal across all systems; (2) context engineering — compaction, caching, hierarchical instruction files — is the primary scaling bottleneck; (3) multi-agent coordination — from MetaGPT's SOP-driven assembly lines to Claude Code's agent teams — is the frontier, with no consensus on the right abstraction.

Researcher Notes

The harness is the product, not the model. This is the single most important takeaway. OpenClaw, Claude Code, and Codex all use frontier LLMs as their reasoning engine, but the engineering that makes them useful — tool execution, permission enforcement, context management, sandbox isolation — lives entirely in the harness. The model proposes; the harness disposes. When OpenAI published 'Unrolling the Codex agent loop,' they explicitly separated 'the model' from 'the harness' as two distinct architectural components. When Anthropic's Claude Code source was leaked in March 2026, the 59.8MB codebase revealed a 46,000-line QueryEngine.ts orchestration layer — the model API call is a single function within it.

Context engineering is the real bottleneck, not model capability. Every harness has elaborate machinery for managing the context window: Claude Code has five fallback compaction strategies (time-based clearing, conversation summarization, session memory extraction, full history summarization, oldest-message truncation). Codex uses prompt caching to avoid quadratic data transfer costs and an encrypted compaction endpoint for unlimited conversation length. OpenClaw's 'Guard Context' step dynamically summarizes or truncates. The academic papers largely ignore this problem because they evaluate on single-turn benchmarks, but production harnesses spend more engineering effort on context management than on tool execution.

The security story is far from settled. OpenClaw's 190 security advisories in February 2026 were a wake-up call, but the underlying problem is fundamental: an agent harness gives an LLM access to a shell, a filesystem, and network resources. Every tool is a potential attack vector. The response has been layered defense: Codex uses OS-level sandboxing (Seatbelt on macOS, bubblewrap+seccomp on Linux) with a formal approval policy system. Claude Code uses a permission model where a second LLM call evaluates whether the user would approve each action. OpenClaw relies on community-built wrappers (IronClaw for WASM isolation, NemoClaw for kernel sandboxing). None of these are provably secure — the field is in the 'defense in depth' phase.

Multi-agent is the frontier, but the abstractions are immature. MetaGPT showed that SOP-driven role assignment works for structured workflows. AutoGen proved that conversational multi-agent systems can solve problems single agents cannot. But production harnesses are still experimenting: Claude Code's agent teams use file-based mailboxes and shared task lists with file locking — fundamentally a distributed systems problem being solved with filesystem primitives. Codex's subagent model is simpler (fan-out/fan-in with depth limits), trading coordination flexibility for reliability. The academic community hasn't converged on a theory of multi-agent coordination for coding tasks.

Agentless is the important counterpoint. While the industry races toward more complex harnesses, Xia et al.'s Agentless paper showed that a three-phase pipeline (localize → generate patch → filter with tests) achieves competitive results at $0.70/task with no agent loop at all. This is a healthy reminder that complexity must earn its keep. The best harness for a given task might be no harness at all.

Foundations: The ReAct Loop and Why Harnesses Exist

Every agent harness, from a 200-line script to OpenClaw's five-component daemon, implements some variant of the same core loop: the model reasons about what to do, acts by calling a tool, observes the result, and repeats. This pattern was formalized as ReAct by Yao et al. (ICLR 2023), though practitioners had been building ad-hoc versions since GPT-3.5 gained function-calling capabilities.

The insight behind ReAct is deceptively simple: interleaving reasoning traces with actions outperforms either alone. Pure reasoning (chain-of-thought) hallucinates when it lacks external information. Pure acting (tool calls without reasoning) makes incoherent sequences of actions. ReAct weaves them together: the model writes a thought ('I need to find the auth middleware file'), generates an action (search the codebase), observes the result (file found at src/middleware/auth.ts), reasons about what to do next ('The bug is in the token validation logic on line 47'), and acts again (edit the file). This think-act-observe cycle is now the beating heart of every production coding agent.

But ReAct alone is not enough to build a useful agent. The paper used simple text-based environments (web search, interactive fiction). Real software engineering requires: (1) tool execution — safely running shell commands, editing files, invoking compilers and test suites; (2) context management — keeping track of relevant code, conversation history, and project structure within a finite context window; (3) permission enforcement — preventing the agent from deleting production databases or pushing malicious code; (4) state persistence — maintaining memory across interactions. The agent harness is the engineering layer that provides all of this around the core ReAct loop.

CodeAct (Wang et al., ICML 2024) pushed the action space further by replacing JSON tool calls with executable Python code. Instead of the model outputting a structured function call that the harness routes to a predefined tool, the model writes arbitrary Python that runs in a persistent interpreter. This enables dynamic logic — loops, conditionals, error handling, self-debugging — that JSON schemas cannot express. CodeAct showed up to 20% improvement over JSON-based actions on agent benchmarks, and the pattern was adopted by OpenHands, one of the most prominent open-source coding agent platforms.

The evolution from ReAct to CodeAct to production harnesses follows a clear trajectory: give the model more expressive ways to interact with the environment, while building increasingly sophisticated safety guardrails around those interactions.

ReAct

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao Princeton University, Google Research

ICLR 2023

ReAct: Synergizing Reasoning and Acting in Language Models

Interleaves reasoning traces with actions in a single LLM output loop. The model thinks, acts, observes the result, and repeats — grounding generation in external observations to reduce hallucination.

Key Innovation

Formalized the think-act-observe paradigm that now underlies every production agent harness, showing that combining reasoning and acting outperforms either alone

Limitations

  • Original evaluation on relatively simple environments (web search, text games)

  • No mechanism for managing long context or multi-file codebases

  • Single-agent only — no coordination protocol

reactagent-loopreasoningtool-usefoundational

CodeAct

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji University of Illinois Urbana-Champaign, UC San Diego, MIT

ICML 2024

Executable Code Actions Elicit Better LLM Agents

Replaces JSON tool calls with executable Python code as the sole action type. The agent writes Python that runs in a persistent interpreter; stdout/stderr becomes the next observation.

Key Innovation

Code-based actions enable dynamic logic, loops, and self-debugging that JSON schemas cannot express — up to 20% improvement over JSON-based actions on agent benchmarks

Limitations

  • Requires a persistent interpreter with security implications

  • Python-centric — less natural for non-Python workflows

  • Arbitrary code execution increases attack surface

code-actionsagent-looptool-usepythonself-debugging

Comparison

SystemAction SpaceReasoningEnvironmentKey Contribution

ReAct (2023)

Text-based tool calls

Interleaved thought traces

Web search, text games

Formalized the agent loop paradigm

CodeAct (2024)

Executable Python code

Interleaved thought traces

Persistent Python interpreter

Showed code > JSON for agent actions

Measuring Agents: SWE-bench, SWE-agent, and the Agent-Computer Interface

Before you can build a good harness, you need to measure what 'good' means. SWE-bench (Jimenez et al., ICLR 2024) established the gold standard: 2,294 real GitHub issues from 12 popular Python repositories. Given a codebase and an issue description, the agent must produce a patch that passes the repository's existing test suite. No synthetic tasks — every issue and test is real. At release, GPT-4 solved fewer than 2% of tasks; the benchmark instantly revealed how far models were from practical software engineering.

SWE-bench spawned an ecosystem of variants — SWE-bench Lite (300 tasks), SWE-bench Verified (500 human-validated tasks), and extensions to other languages — and became the universal yardstick for coding agent evaluation. Every harness in this review reports SWE-bench numbers: Codex at 72.1% (pass@1 on Verified), Claude Code at 80.9%, OpenHands at 53%.

SWE-agent (Yang et al., NeurIPS 2024) made the crucial observation that benchmark performance depends not just on the model but on the Agent-Computer Interface (ACI) — the set of tools and interaction patterns the harness provides. The authors designed a custom ACI with purpose-built commands: a windowed file viewer (showing 100 lines at a time with line numbers), a search tool that returns context around matches, and a structured edit command that validates syntax before applying changes. These tools are optimized for LLM consumption — concise, unambiguous, with structured error feedback.

The result was striking: ACI design mattered more than model size. SWE-agent achieved 12.5% on SWE-bench Full (SOTA at publication) and the gap between good and bad ACIs was larger than the gap between model generations. This finding has profound implications for harness engineering: the tools you give the model are as important as the model itself. Every production harness has internalized this lesson — Claude Code's ~40 tools with structured output formats, Codex's sandboxed filesystem utilities with streaming output, OpenClaw's skill-based tool injection system.

Agentless (Xia et al., 2024) provided the essential counterpoint. Rather than building a complex agent loop, Agentless uses a simple three-phase pipeline: (1) hierarchical localization (narrow from repository to file to function), (2) patch generation, (3) filtering candidates by running tests. No tool use, no iterative loop, no agent state. The result: 32% on SWE-bench Lite at $0.70 per task — competitive with agent-based systems that cost orders of magnitude more. Agentless forces the field to justify every layer of harness complexity: if a pipeline without an agent loop can match your agent, your harness is adding cost without adding capability.

OpenHands (Wang et al., ICLR 2025) synthesized these insights into a comprehensive open-source platform. Agents run inside sandboxed Docker containers with access to shell commands, file editing, web browsing, and APIs. The platform uses CodeAct as its action layer and supports 15 benchmarks for evaluation. OpenHands achieved #1 on SWE-bench Full (29%) and 53% on Verified at publication, demonstrating that combining a well-designed ACI with a robust execution sandbox and the code-action paradigm produces a highly capable open-source coding agent.

SWE-bench

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan Princeton University

ICLR 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

A benchmark of 2,294 real GitHub issues from 12 popular Python repositories. The agent receives a codebase and issue description and must produce a patch passing the repository's test suite.

Key Innovation

Established the gold-standard benchmark for end-to-end coding agent evaluation using real-world issues and tests, revealing that GPT-4 solved fewer than 2% at release

Limitations

  • Python-only (extensions to other languages came later)

  • Test suites vary in quality across repositories

  • Pass@1 evaluation may undervalue agents that generate correct patches in multiple attempts

benchmarksoftware-engineeringgithub-issuesevaluation

SWE-agent

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press Princeton University

NeurIPS 2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Argues that the Agent-Computer Interface (ACI) — the set of tools and interaction patterns — is as important as the model. Designs custom file viewing, searching, and editing commands optimized for LLM consumption.

Key Innovation

Demonstrated that ACI design matters more than model size for coding agent performance; custom tools with structured error feedback dramatically improve agent effectiveness

Limitations

  • Custom ACI requires significant engineering effort per domain

  • Evaluation focused on Python repositories

  • Single-agent architecture — no built-in multi-file coordination

agent-computer-interfacecoding-agenttool-designsoftware-engineering

Agentless

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang University of Illinois Urbana-Champaign

arXiv 2024

Agentless: Demystifying LLM-based Software Engineering Agents

A three-phase pipeline — hierarchical localization, patch generation, test-based filtering — that solves coding tasks without an agent loop. No tool use, no iterative reasoning, no agent state.

Key Innovation

Showed that a non-agentic pipeline achieves 32% on SWE-bench Lite at $0.70/task, competitive with far more complex agent systems — forcing the field to justify harness complexity

Limitations

  • No iterative debugging — if the initial patch is wrong, there is no recovery

  • Localization accuracy is a hard ceiling on overall performance

  • Cannot handle tasks requiring multi-step environmental interaction

pipelinenon-agenticcost-efficientlocalizationcounter-narrative

OpenHands

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Graham Neubig University of Illinois Urbana-Champaign, All Hands AI, Carnegie Mellon University

ICLR 2025

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

A comprehensive open-source platform where agents run inside sandboxed Docker containers with access to shell, file editing, web browsing, and APIs. Uses CodeAct as the action layer and supports 15 benchmarks.

Key Innovation

Unified platform covering the full agent lifecycle — sandboxed execution, event-stream state management, multi-turn interactions, and benchmark evaluation — achieving #1 on SWE-bench Full at publication

Limitations

  • Docker dependency adds setup complexity

  • Event-stream architecture can be memory-intensive for long sessions

  • Community-driven — API stability varies

open-sourcecoding-agentdocker-sandboxcodeactplatform

Comparison

SystemTypeSWE-bench ScoreKey InsightCost/Task

SWE-bench

Benchmark

GPT-4: <2% at release

Real-world issues are far harder than synthetic benchmarks

— (benchmark, not a system)

SWE-agent

Agent + ACI

12.5% Full (2024 SOTA)

Tool design matters more than model size

~$2-4

Agentless

Pipeline (no agent loop)

32% Lite

Simple pipelines can match complex agents

$0.70

OpenHands

Open platform

29% Full, 53% Verified

CodeAct + sandbox + good ACI = strong open-source agent

~$1-3

OpenClaw: The Persistent Daemon Agent

OpenClaw (originally Clawdbot, then briefly Moltbot) is the most popular open-source agent harness by GitHub stars (~247K as of March 2026). Created by Austrian developer Peter Steinberger, it takes a fundamentally different architectural approach from coding-focused agents: OpenClaw treats the agent as a persistent operating system process — a daemon that runs continuously, reacts to messages from multiple channels, and autonomously executes tasks on a schedule.

The Five-Component Architecture. OpenClaw's design has five clearly delineated components:

  1. Gateway — The central daemon process and single source of truth for sessions and routing. It normalizes inputs from multiple channels (Slack, Discord, WhatsApp, Telegram, etc.) into a unified internal message format, authenticates inbound messages, and proxies outbound API calls to the configured LLM provider.

  2. Brain (Agent Runner) — The orchestration layer that implements a five-step pipeline: (a) select the correct agent for the incoming session, (b) resolve the cheapest model that satisfies context requirements (with automatic key rotation on rate limits), (c) build the prompt dynamically from SOUL.md, skills, memories, and tool definitions, (d) guard context by enforcing the model's context window limit via summarization or truncation, (e) execute the ReAct loop until the model produces a terminal response.

  3. Memory — All persistent context is stored as plain Markdown files on local disk (~/.openclaw/). No embedded database, no vector store. The directory structure includes preferences.md, contacts.md, projects.md, learnings.md, and per-agent SOUL.md files. This is a deliberate trade-off: human-auditable and portable at the cost of retrieval sophistication.

  4. Skills — OpenClaw's plugin/tool system. Each skill is a Markdown file with YAML frontmatter declaring a name, trigger pattern, and available tools. Skills are loaded at agent startup and injected into the system prompt. The community Skills Registry has grown to 5,700+ entries.

  5. Heartbeat — The autonomy mechanism. Every 30 minutes (configurable), the Gateway sends the agent a scheduled trigger. The agent reads HEARTBEAT.md — a checklist of standing tasks — and decides whether any item requires action. If nothing needs doing, the agent responds with HEARTBEAT_OK and the reply is silently dropped. This built-in scheduling is unique among agent frameworks — competitors require external cron or workflow triggers.

The Serial Lane Queue. Messages within a session are processed one at a time through a Lane Queue. This prevents tool conflicts and keeps session history consistent, but limits throughput. Parallelism is opt-in and restricted to explicitly marked low-risk tasks. This is a significant architectural choice that prioritizes correctness over speed.

The Security Problem. OpenClaw's rapid growth exposed serious security issues. Between January 31 and February 25, 2026, 190 security advisories were filed — including combinations that compose into unauthenticated remote code execution paths. Four arXiv papers appeared in March 2026 analyzing the vulnerabilities. The flat permission model (agents have broad tool access by default) creates substantial attack surface when LLM reasoning is connected to host execution. The ecosystem responded with third-party hardening wrappers: IronClaw (Rust/WASM isolation by NEAR AI), NemoClaw (kernel-level sandboxing by NVIDIA), and KubeClaw (Kubernetes JIT RBAC). These are downstream fixes, not upstream architecture changes.

Identity via SOUL.md. OpenClaw uses a Markdown file called SOUL.md as an identity layer injected into every system prompt. The official docs distinguish it from a system prompt: 'System prompts tell models what to do; soul files tell them who to be.' Companion files handle style (STYLE.md), operating modes (SKILL.md), and session memory (MEMORY.md).

OpenClaw

Peter Steinberger Independent (now OpenAI)

Open-source (MIT license), released November 2025

Open-source project (no published paper)

A persistent daemon agent framework with five components: Gateway (routing), Brain (ReAct loop), Memory (Markdown files), Skills (plugin system), and Heartbeat (autonomous scheduling). Runs as a background process reacting to messages and scheduled triggers.

Key Innovation

First major framework to treat the agent as a persistent OS process with built-in heartbeat scheduling and multi-channel message routing — other frameworks require external infrastructure for scheduling and message delivery

Limitations

  • Flat permission model led to 190 security advisories in one month

  • Markdown-based memory lacks retrieval sophistication (no vector search)

  • Serial lane queue limits throughput

  • Multi-agent coordination requires third-party wrappers

agent-harnesspersistent-daemonheartbeatmulti-channelopen-source

IronClaw

NEAR AI team NEAR AI

Open-source, 2026

Open-source project by NEAR AI (no published paper)

OpenClaw-inspired agent framework rewritten in Rust with all tools running in WASM containers using capability-based permissions. Zero telemetry by design.

Key Innovation

WASM-based tool isolation with capability-based permissions — each tool runs in its own sandbox with explicitly granted capabilities, preventing the lateral movement attacks that plagued OpenClaw

Limitations

  • Smaller ecosystem than OpenClaw (fewer skills, less community support)

  • WASM sandbox adds overhead for I/O-heavy tools

  • Newer and less battle-tested

rustwasmcapability-based-securityopenclaw-derivative

Comparison

SystemLanguageSandboxingSchedulingMulti-AgentCommunity Skills

OpenClaw

Node.js

Flat permissions (third-party wrappers available)

Built-in Heartbeat (30 min default)

Channel-to-agent bindings; no peer-to-peer

5,700+

IronClaw

Rust

WASM containers with capability-based permissions

Inherited from OpenClaw pattern

Similar to OpenClaw

Growing

NemoClaw

Python wrapper

Kernel-level (OS-level, not container)

Wraps OpenClaw's Heartbeat

Wraps OpenClaw

OpenClaw-compatible

Claude Code: Permission-Gated Tool Orchestration and Agent Teams

Claude Code is Anthropic's agentic coding system — a harness that wraps Claude models with ~40 tools, a layered permission model, and multi-agent coordination capabilities. Released as a limited research preview in February 2025 and reaching general availability in May 2025, it rapidly became one of the most commercially successful AI products, surpassing $1B in annualized revenue by late 2025.

The Three-Phase Loop. Claude Code's core execution model iterates through three phases: (1) gather context — read files, search codebases, inspect error logs; (2) take action — edit files, run commands, call external services; (3) verify results — run tests, check outputs, course-correct. The orchestrator's behavior is defined in natural language in the system prompt, not in branching code logic — enabling behavioral iteration without redeployment.

Tool Architecture. Approximately 40 tools are organized into categories: file operations (Read, Write, Edit, Glob), search (Grep, pattern-based file search), execution (Bash), web (WebSearch, WebFetch), and orchestration (Agent for spawning subagents, AskUserQuestion). Read-only operations run concurrently; mutating operations execute serially to prevent conflicts. Each tool is independently sandboxed with configurable access controls. External tools connect via the Model Context Protocol (MCP), with tool definitions deferred by default to save context space — only tool names are loaded until a tool is actually used.

The Permission Model. This is Claude Code's most distinctive design choice. Four permission modes provide graduated autonomy: (1) Default — Claude asks before file edits and shell commands; (2) Auto-accept edits — file edits auto-approved, commands still require confirmation; (3) Plan mode — read-only tools only, generates a plan for user approval; (4) Auto mode (research preview) — a second, separate LLM call evaluates each proposed action to predict whether the user would approve it. This is architecturally significant: the safety evaluation is not the same inference that generated the action. It is a separate model call specifically for permission prediction.

Context Engineering. The QueryEngine (the core orchestration engine, revealed in a March 2026 source leak to be ~46,000 lines of TypeScript) manages five fallback context compaction strategies deployed sequentially: time-based clearing of outdated tool outputs, conversation summarization, session memory extraction, full history summarization, and oldest-message truncation. Project context enters via CLAUDE.md files (project-specific instructions loaded at session start) and an auto-memory system that saves learnings as work progresses.

Subagents. Claude Code can spawn subagent instances via the Agent tool. Each subagent receives its own system prompt, a restricted tool subset, and the project's CLAUDE.md, but does not receive the parent's conversation history or system prompt. Only the subagent's final message returns to the parent — intermediate tool calls stay isolated, preventing context contamination. Three execution models exist: Fork (byte-identical parent context copy), Teammate (separate terminal pane with file-based mailbox), and Worktree (git worktree with isolated branches). Subagents cannot spawn their own subagents, limiting recursion depth.

Agent Teams. The experimental multi-agent feature (enabled via feature flag) is architecturally distinct from subagents. A team lead spawns teammates — each a full Claude Code instance with its own context window — that coordinate through a shared task list (~/.claude/tasks/{team-name}/) and a file-based mailbox messaging system. Task claiming uses file locking to prevent race conditions. Teammates can be required to plan before implementing; the lead reviews and approves plans autonomously. This is fundamentally a distributed systems problem being solved with filesystem primitives — simple and pragmatic, but with known limitations around session resumption and task status consistency.

Hooks. External code can intercept the agent loop at 25+ lifecycle points: PreToolUse, PostToolUse, SessionStart, SessionEnd, UserPromptSubmit, SubagentStart, and more. Hooks can allow, deny, or defer tool calls, inject additional context, or modify tool inputs. When multiple hooks conflict, deny takes priority over ask, which takes priority over allow. This creates a deterministic interception layer that enables organizational policy enforcement without modifying the agent's core behavior.

Cowork. Claude Code Cowork brings the same harness architecture into Claude Desktop, running inside a sandboxed VM (Ubuntu 22.04, 4 CPU cores, 4GB RAM) using Apple's Virtualization framework. Multi-layer egress control (syscall blocking via gVisor, MITM proxy with per-boot CA certificates, domain allowlist) provides stronger isolation than the CLI. Rather than sandboxing a browser inside the VM, Cowork controls the host Chrome browser via a native messaging extension — providing real browsing with the user's actual cookies and sessions, but operating outside VM egress controls.

Claude Code

Anthropic Anthropic

Commercial product, GA May 2025; open-source CLI (MIT license)

Commercial product by Anthropic (no published paper)

An agentic coding harness with ~40 tools, a four-tier permission model (including a separate LLM call for auto-approval evaluation), layered context compaction, and multi-agent coordination via subagents and agent teams.

Key Innovation

Permission model where a second LLM call evaluates each proposed action independently of the generating model; hooks system enabling deterministic organizational policy enforcement; three subagent execution models (fork, teammate, worktree)

Limitations

  • No formally published wire protocol (unlike Codex's JSON-RPC spec)

  • Agent teams are experimental with known limitations (no session resumption, no nested teams)

  • Auto mode permission evaluation is non-deterministic

agent-harnesspermission-modelmulti-agenthookscontext-engineering

Claude Code Cowork

Anthropic Anthropic

Commercial product, shipped January 2026

Commercial product by Anthropic (no published paper)

Brings the Claude Code harness architecture into Claude Desktop, running inside a sandboxed Linux VM with multi-layer egress control. Extends beyond coding to knowledge work: spreadsheets, presentations, research, reports.

Key Innovation

VM-level isolation with gVisor syscall blocking, per-boot ephemeral CA certificates for HTTPS inspection, and host browser control via native messaging extension — stronger isolation than CLI-based harnesses

Limitations

  • Fixed VM resources (4 CPU cores, 4GB RAM)

  • Ephemeral storage — no persistence across sessions

  • Host browser control operates outside VM egress controls, creating a split security perimeter

vm-isolationdesktop-agentknowledge-workgvisorsandboxing

Comparison

FeatureClaude Code CLIClaude Code Cowork

Execution Environment

Host OS with permission gates

Sandboxed Ubuntu 22.04 VM

Isolation

Per-tool permission model

VM boundary + gVisor + MITM proxy + domain allowlist

Multi-Agent

Subagents + Agent Teams

Parent-child orchestration

Browser Access

Via MCP tools

Host Chrome via native messaging extension

Persistence

Session files on local disk

Ephemeral (formatted fresh each boot)

Target Use Case

Software engineering in terminal

General knowledge work in desktop app

Codex: The Sandboxed App Server with Formal Protocol

OpenAI's Codex (2025 — not the 2021 code completion model) is a multi-surface coding agent powered by the codex-1 model family (o3 fine-tuned via reinforcement learning on real-world coding tasks). It ships as a cloud product (in ChatGPT), an open-source CLI (Rust + TypeScript, MIT license), IDE extensions, a macOS app, and a GitHub Action. The architectural centerpiece is the App Server: a single stable binary that provides a formal JSON-RPC protocol for all client surfaces.

The Model/Harness Separation. OpenAI explicitly separates two components: the model (stateless reasoning engine) and the harness (execution layer). The harness executes tool calls, collects outputs, manages permissions, enforces sandbox policies, and decides when the agent loop terminates. The model proposes actions; the harness disposes of them. OpenAI published dedicated blog posts about this separation — 'Unrolling the Codex agent loop' and 'Unlocking the Codex harness' — establishing the model/harness distinction as a first-class architectural concept.

The App Server Protocol. The App Server uses JSON-RPC 2.0 streamed as newline-delimited JSON (JSONL) over stdio, with an experimental WebSocket transport. Three core primitives structure all interactions: Items (atomic units of I/O with started/delta/completed lifecycle), Turns (groups of items from one agent work unit, supporting mid-turn steering and interruption), and Threads (durable conversation containers supporting creation, resumption, forking, archival, rollback, and compaction). The protocol is fully bidirectional: the server can initiate requests (e.g., approval prompts) and pause the agent turn pending client response. This formal protocol — with versioned schemas and generated client bindings in Go, Python, TypeScript, Swift, and Kotlin — is Codex's most distinctive engineering contribution.

Two-Layer Security. Codex enforces security through two independent layers. Layer 1 (Sandbox Mode) controls what the agent can do technically: read-only, workspace-write (read/edit within workspace, run commands), or full access. Protected paths (.git/, .agents/, .codex/) remain read-only even in writable modes. Layer 2 (Approval Policy) controls when the user must confirm: on-request (approve sandbox violations and network access), untrusted (auto-run safe reads, approve state mutations), never (all approvals disabled), or granular (per-category control). OS-level enforcement uses macOS Seatbelt profiles on Mac and bubblewrap+seccomp on Linux — real kernel-level sandboxing, not process-level controls.

Codex Cloud runs tasks in isolated containers with a two-phase runtime: a setup phase with network access for installing dependencies, then an agent phase that is offline by default. Internet access must be explicitly enabled. Tasks run asynchronously and multiple tasks can run in parallel.

Context Management. Codex addresses the quadratic context growth problem (n turns requires resending all prior context) through two mechanisms: prompt caching (new content is appended to an identical prefix, enabling KV computation reuse) and conversation compaction (replacing full history with an encrypted compressed representation via the /responses/compact endpoint). The compaction output is opaque — it preserves the model's understanding without exposing raw text. Hierarchical project instructions enter via AGENTS.md files loaded from global override, global default, and per-directory levels.

Subagents. Codex can spawn specialized subagents in parallel using a fan-out/fan-in pattern. The orchestrator waits for all results before returning a consolidated response. Resource controls limit concurrent threads (default: 6) and recursive delegation depth (default: 1). Built-in agent types include default (general-purpose), worker (execution-focused), and explorer (read-heavy analysis). Custom agents are configured via config.toml with specific instructions, model, and sandbox policies.

The Rust Rewrite. The CLI is being rewritten from TypeScript to Rust (codex-rs/) for zero-dependency installation, native security bindings, and elimination of GC pauses. The TypeScript version is maintained in parallel pending Rust feature parity. The Rust rewrite also enables the App Server to be embedded directly in client applications, eliminating the need for a separate process.

Codex as MCP Server. Codex can expose itself as an MCP server (codex mcp-server), enabling other agents to use Codex as a coding tool. Two endpoints are exposed: codex (initiates a session) and codex-reply (continues a session). This composability — where Codex is both an agent and a tool that other agents can call — represents a maturing understanding of agent-to-agent interaction.

Codex (2025)

OpenAI OpenAI

Commercial product, launched May 2025; open-source CLI (MIT license)

Commercial product by OpenAI (no published paper)

A multi-surface coding agent with a formal JSON-RPC App Server protocol, two-layer security (sandbox mode + approval policy with OS-level enforcement), and a clear model/harness architectural separation.

Key Innovation

Formal JSON-RPC protocol with versioned schemas enabling multi-language client implementations; OS-level sandboxing (Seatbelt, bubblewrap+seccomp); encrypted conversation compaction for unlimited session length; composability as both agent and MCP server

Limitations

  • No published training paper for the codex-1 RL process

  • Rapid model naming changes (codex-1 → gpt-5.x-codex series)

  • Rust rewrite still catching up to TypeScript feature parity

  • App Server protocol is not an open standard (versioned per binary release)

agent-harnessjson-rpcsandboxingapp-serverrustmcp-server

Comparison

DimensionCodexClaude CodeOpenClaw

Wire Protocol

JSON-RPC 2.0 (JSONL/stdio, WebSocket)

No published protocol

Internal (undocumented)

Sandbox Enforcement

OS-level: Seatbelt (macOS), bubblewrap+seccomp (Linux)

Permission model + optional Auto mode (2nd LLM call)

Flat permissions (third-party wrappers)

Context Compaction

Encrypted compaction endpoint + prompt caching

5-stage fallback (time clearing → summarization → memory → full summary → truncation)

Guard Context step (summarize/truncate)

Subagent Model

Fan-out/fan-in, max 6 threads, depth 1

Fork/Teammate/Worktree, no recursion

Channel-to-agent bindings (no native subagents)

Open Source

MIT (CLI)

MIT (CLI)

MIT (full framework)

SWE-bench Verified

72.1% (pass@1)

80.9%

Not evaluated (general-purpose agent)

Multi-Agent Coordination: From Academic Frameworks to Production Teams

Single-agent harnesses hit practical limits when tasks require parallel work, role specialization, or cross-cutting coordination. The multi-agent paradigm addresses this by splitting work across multiple agent instances that communicate and coordinate. The academic foundations were laid by MetaGPT and AutoGen; production harnesses are now building their own implementations.

MetaGPT (Hong et al., ICLR 2024 Oral) introduced SOP-driven multi-agent coordination. Instead of letting agents chat freely, MetaGPT encodes human Standard Operating Procedures into the pipeline: a Product Manager agent generates a PRD, an Architect produces a design doc, Engineers write code, and QA runs tests. Each role passes structured artifacts to the next. This assembly-line model dramatically reduces the incoherent outputs common in free-form multi-agent conversation, because each agent's input is a well-typed artifact (not an open-ended chat message) and each agent's role is precisely scoped. MetaGPT was ranked #1 in the LLM Agent category at ICLR 2024.

AutoGen (Wu et al., COLM 2024, from Microsoft Research) took the opposite approach: fully conversational multi-agent systems. The core abstraction is the 'conversable agent' — an entity that can be backed by an LLM, human input, a tool executor, or any combination. Agents pass messages to each other in programmable conversation patterns: sequential (agent A talks to agent B), group chat (multiple agents discuss), or nested (a conversation within a conversation). The power comes from flexibility: the same framework handles math problem-solving, coding, operations research, and human-in-the-loop workflows. The cost is that conversational coordination is less predictable than SOP-driven pipelines.

ToolLLM (Qin et al., ICLR 2024 Spotlight) addressed a different aspect of multi-agent capability: tool discovery and planning across massive API landscapes. The framework includes ToolBench (16,464 real-world APIs across 49 categories), ToolLLaMA (a fine-tuned model for tool use), and a Depth-First Search Decision Tree (DFSDT) planner that explores multiple tool-call paths and backtracks when a path fails. DFSDT is the first systematic tree-search approach to tool planning, handling API failures via backtracking rather than linear retry — a critical capability when coordinating across unreliable external services.

Production implementations are converging on simpler patterns. Claude Code's agent teams use file-based mailboxes and shared task lists with file locking — a straightforward filesystem-based coordination mechanism. Codex uses fan-out/fan-in with hard limits on concurrent threads and recursion depth. OpenClaw relies on channel-to-agent bindings with shared filesystem state. None of these approach the sophistication of MetaGPT's SOP pipelines or AutoGen's conversational topologies, reflecting a production engineering preference for simplicity and debuggability over coordination expressiveness.

The Scaffolding/Harness taxonomy paper by Bui (arXiv, March 2026) provides the clearest published framework for understanding these systems. It distinguishes scaffolding (static assembly before the first prompt: system prompt, tool schemas, subagent registry) from harness (runtime orchestration: tool dispatch, context compaction, safety enforcement, state management). This vocabulary helps explain why production multi-agent systems look so different from academic ones: academic systems focus on scaffolding (agent roles, communication topologies) while production systems spend most of their engineering on harness (context management, permission enforcement, crash recovery).

MetaGPT

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zhengying Chen, Steven Zheng, Juergen Schmidhuber DeepWisdom, KAUST, Xiamen University

ICLR 2024 (Oral)

MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework

Encodes human Standard Operating Procedures into a multi-agent pipeline with distinct roles (PM, Architect, Engineer, QA) passing structured artifacts. An assembly-line model where each agent's output is a typed artifact consumed by the next.

Key Innovation

SOP-driven coordination with structured artifact passing reduces incoherent outputs compared to free-form multi-agent chat; ranked #1 in LLM Agent category at ICLR 2024

Limitations

  • Rigid pipeline — not suitable for tasks requiring dynamic role reassignment

  • Each role requires careful prompt engineering

  • Artifact schemas must be defined upfront

multi-agentsop-drivenrole-basedstructured-artifactssoftware-engineering

AutoGen

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen White, Doug Burger, Chi Wang Microsoft Research, Penn State University

COLM 2024

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

A framework where customizable 'conversable agents' pass messages in programmable conversation patterns (sequential, group chat, nested). Each agent can be backed by an LLM, human, tool executor, or combination.

Key Innovation

Decoupled agent role from communication topology — the same framework handles math, coding, QA, and human-in-the-loop workflows via different conversation patterns

Limitations

  • Conversational coordination is less predictable than pipeline-based systems

  • Group chat can produce unfocused discussion

  • Token costs scale with number of agents and conversation length

multi-agentconversable-agentsconversation-patternshuman-in-the-loop

ToolLLM

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Zhiyuan Liu, Maosong Sun Tsinghua University, OpenBMB

ICLR 2024 (Spotlight)

ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs

A complete tool-use framework with ToolBench (16,464 real APIs), ToolLLaMA (fine-tuned model), and a Depth-First Search Decision Tree (DFSDT) planner for multi-step tool planning with backtracking.

Key Innovation

DFSDT planner — first systematic tree-search approach to tool-use that handles API failures via backtracking rather than linear retry; ToolLLaMA matches ChatGPT on tool use while being open-source

Limitations

  • API instability — real-world APIs change and break over time

  • DFSDT search can be slow for deeply nested tool sequences

  • Evaluation assumes correct API documentation is available

tool-useapi-planningtree-searchbacktrackingfine-tuning

OpenDev Harness Paper

Nghi D. Q. Bui OpenDev (independent)

arXiv, March 2026

Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

Distinguishes scaffolding (static assembly: system prompt, tool schemas, subagent registry) from harness (runtime orchestration: tool dispatch, context compaction, safety enforcement). Describes a six-phase ReAct loop implementation.

Key Innovation

Provides the clearest published taxonomy distinguishing scaffolding from harness — terminology now appearing in follow-on work; documents practical context engineering techniques absent from academic papers

Limitations

  • Practitioner report rather than peer-reviewed paper

  • Single-system focus (OpenDev)

  • No formal evaluation on standard benchmarks

taxonomyscaffoldingharnesscontext-engineeringpractitioner-report

Comparison

SystemCoordination ModelAgent CommunicationBest For

MetaGPT

SOP pipeline with typed artifacts

Structured artifact passing (PRD → design → code → tests)

Well-defined workflows with clear role boundaries

AutoGen

Conversable agents with programmable patterns

Free-form message passing (sequential, group chat, nested)

Flexible multi-agent tasks requiring dynamic interaction

ToolLLM

Tree-search planner with backtracking

Single agent with DFSDT planning over API sequences

Complex multi-step API orchestration with failure recovery

Claude Code Teams

Shared task list + file-based mailbox

File-system primitives with file locking

Parallel coding tasks with simple coordination

Codex Subagents

Fan-out/fan-in with depth limits

Result-only communication (no peer-to-peer)

Focused subtasks with consolidated results

Cross-Cutting Comparison: Five Agent Harnesses Head-to-Head

DimensionOpenClawClaude CodeCodexOpenHandsSWE-agent
Primary Use CaseGeneral-purpose autonomous agentSoftware engineeringSoftware engineeringSoftware engineeringSoftware engineering (research)
ArchitecturePersistent daemon (5 components)CLI with ~40 tools + QueryEngineMulti-surface App Server (JSON-RPC)Docker-sandboxed platformCustom ACI + agent loop
Agent LoopReAct via Brain componentThree-phase (gather → act → verify)Model/harness iterative loopCodeAct (executable Python)Custom ACI-based loop
SandboxingFlat permissions (third-party wrappers)Permission model (4 modes) + optional 2nd LLMOS-level (Seatbelt, bubblewrap+seccomp)Docker containersDocker containers
Context ManagementGuard Context (summarize/truncate)5-stage fallback compactionPrompt caching + encrypted compactionEvent-stream stateWindowed file viewer
Multi-AgentChannel bindings (no native peer-to-peer)Subagents + Agent Teams (experimental)Subagents (fan-out/fan-in, max 6)Single agentSingle agent
Open SourceYes (MIT, 247K stars)Yes (MIT)Yes (MIT, CLI)Yes (MIT)Yes (MIT)
Wire ProtocolInternalNone publishedJSON-RPC 2.0 (JSONL)Event streamCustom
SWE-bench VerifiedNot evaluated (general-purpose)80.9%72.1%53%12.5% Full (2024 SOTA)
SchedulingBuilt-in HeartbeatExternalExternal (GitHub Action)ExternalExternal

Sources Checked

02:00 PM UTC
02:00 PM UTC
02:00 PM UTC
02:00 PM UTC
02:00 PM UTC
02:00 PM UTC
02:00 PM UTC