The five roles inside real agent stacks in 2026
Practitioners aren't picking one model for agents. They're routing across five roles. Here's which models fill each slot and why.
The era of picking a single model for your agent framework is over. Practitioner-reported usage patterns across OpenClaw, Cline, Roo Code, Aider, and similar tools point to a consistent five-role architecture: a primary driver for orchestration and judgment, a planner for large-context reasoning, an executor/coder optimized on cost, a background worker for disposable tasks, and a local/open-source fallback for privacy or budget constraints. The models filling each slot are converging faster than the benchmarks would predict.
The role map
| Model | Creator | Common agent role | Main strength | Typical failure mode | Best-fit workload |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | Primary driver | Reliable multi-tool chains, strong judgment | Higher cost per session | Long orchestration loops |
| Gemini 2.5 Pro | Planner | Very large context, architecture reasoning | Loops, bloated edits, context growth | Feature definition, codebase-wide planning | |
| GPT-5.4 Mini | OpenAI | Executor/coder | Strong coding per dollar | Less autonomy on ambiguous tasks | Batch coding, scoped execution |
| GPT-5.2-Codex | OpenAI | Executor/coder | High throughput coding at 105 tok/s | Narrower general reasoning | Code generation pipelines |
| Qwen3-Coder | Alibaba | Local/open-source fallback | Best open-source Act-mode option | Breaks on long multi-tool loops | Local coding, cheap execution |
| Gemini 2.5 Flash | Background worker | Speed, low cost | Poor orchestration judgment | Heartbeats, summaries, context condensing | |
| Claude Haiku 4.5 | Anthropic | Background worker | Fast, cheap, predictable | Too thin for complex decisions | Cron jobs, simple checks |
Why Claude Sonnet 4.6 keeps winning the driver seat
Claude Sonnet 4.6 (Anthropic) scores 51.7 on quality index at $6.00/M tokens. That's not the cheapest option. But in OpenClaw-style agent benchmarks, it repeatedly hits 5/5 on task completion where cheaper models collapse. One practitioner-reported benchmark had Sonnet 4.6 and o4-mini both at 5/5, Grok 4.1 Fast at 3/5, Gemini 2.5 Flash at 1/5, and DeepSeek V3.2 at 0/5.
The pattern is consistent: Sonnet 4.6 makes better decisions after the fifth or sixth tool call in a chain. It doesn't hallucinate tool arguments as often, doesn't silently stall, and recovers from ambiguous tool outputs more gracefully. Practitioners report it "feels much better" than Flash or DeepSeek V3 even on simple tasks, and that models like Kimi K2.5 tend to give up partway through.
The tradeoff is real. At $6.00/M tokens, running Sonnet 4.6 as your only model burns budget fast on tasks that don't need its judgment.
Gemini 2.5 Pro as the planning layer
In Cline-style workflows, Gemini 2.5 Pro (Google) is overrepresented in the planning phase. Its massive context window makes it the natural choice for ingesting entire codebases, writing feature specs, and producing architecture plans. Quality index sits at 48.4 with output at 119 tok/s — fast enough for interactive planning sessions.
But practitioners also report real friction: weird loops where the model repeats itself, bloated diffs with unnecessary changes, odd tool-call behavior, and context windows that grow unpredictably. Several users describe a pattern of starting with Gemini for planning, then switching back to Sonnet for execution after hitting reliability issues. Gemini plans well; it doesn't always execute cleanly.
GPT-5 Mini and Codex: the cost-efficient executors
GPT-5.4 Mini (OpenAI) at 48.1 quality and $1.69/M tokens is the workhorse model in many Roo Code setups. In independent minimal-agent SWE-bench testing, GPT-5 Mini gave up only about 5 percentage points versus full GPT-5 while costing roughly one-fifth as much. That's a compelling failure/cost curve for scoped coding tasks.
GPT-5.2-Codex (OpenAI) pushes 105 tok/s at $4.81/M tokens with 49.0 quality — a strong option when throughput matters more than cost floor. Both models show up frequently as the "hands" in agent stacks where Sonnet or Gemini Pro handles the "brain."
Qwen3-Coder: the open-source model people actually use
In fresh community benchmarking on GitHub tasks, Qwen3-Coder was described as the strongest open-source performer in the tested set. It's the model practitioners actually deploy in Act mode for local setups and cheap execution loops.
But "strongest open-source" still means weaker than cloud frontier models on sustained multi-tool orchestration. Qwen3-Coder works for narrow coding tasks and single-shot generation. It struggles with the kind of persistent-memory, multi-step loops that define real agent sessions. It matters because it's genuinely competitive on cost and privacy. It doesn't replace Sonnet 4.6 as a primary driver in most setups.
The cheap background layer
Gemini 2.5 Flash and Claude Haiku 4.5 fill the same niche: disposable compute for tasks where failure is cheap. Heartbeat checks, cron-triggered summaries, context condensing between agent steps, lightweight sub-agent calls. These models run at high tok/s and low cost, which is exactly what you want when you're making dozens of calls per orchestration loop that don't require judgment.
Where budget models still fail
The gap between "good benchmark score" and "reliable agent behavior" is widest in three areas. First, tool-call hallucination: cheaper models invent function arguments or call tools that don't exist, which causes cascading failures in multi-step loops. Second, silent stalls: the model stops making progress but doesn't signal failure, burning tokens on empty loops. Third, fake completion: the model reports a task as done when it isn't, which is worse than an honest error because the orchestrator moves on.
That OpenClaw benchmark is instructive. Gemini 2.5 Flash scored 1/5 and DeepSeek V3.2 scored 0/5 — not because they can't generate code, but because they can't sustain reliable tool use across a full task. The failure mode isn't "bad code." It's "broken agency."
Local-first stacks: useful but bounded
Local models handle routing decisions, summarization, and narrow coding tasks well. For practitioners who need data to stay on-premises, Qwen3-Coder and similar open-weight models are functional for scoped work. But long multi-tool loops with persistent memory still expose the gap. Context management, error recovery, and tool-call reliability all degrade faster on local models than on cloud frontier options. The practical pattern is hybrid: local for what you can, cloud for what you must.
Recommendation matrix
| User profile | Recommended stack | Why it works | Main tradeoff |
|---|---|---|---|
| Solo dev on a budget | GPT-5.4 Mini (driver) + Gemini 2.5 Flash (background) + Qwen3-Coder (local fallback) | $1.69/M primary cost, strong coding quality, open-source option for offline work | Less reliable on long orchestration chains than Sonnet 4.6 |
| Power user, long agent sessions | Claude Sonnet 4.6 (driver) + Gemini 2.5 Pro (planner) + GPT-5.4 Mini (executor) + Haiku 4.5 (background) | Best reliability on multi-tool chains, large-context planning, cost-efficient execution layer | Higher total spend; Gemini planning layer needs monitoring for loops |
| Privacy-sensitive / local-first | Qwen3-Coder (primary) + local summarizer (background) + cloud fallback for complex tasks | Data stays on-premises for most work | Noticeably weaker on sustained agentic loops; cloud fallback needed for hard tasks |
The right model depends on the role it's filling, not its headline benchmark. Use FindLLM's LLM Selector to filter by cost, speed, and coding quality for each slot in your stack, or compare models side-by-side to match specific workload requirements.
Stay in the loop
Weekly LLM analysis delivered to your inbox. No spam.