Back to blog
Mar 2026·10 min read

the model is not the product

AI AgentsLLMsEngineering

claude opus 4.5 scores 42% on core-bench with one scaffold. 78% with another. same model, same benchmark, same weights. the only variable is the code that wraps it.

if that number doesn't bother you, you haven't thought hard enough about what you're actually building when you build an agent.

the ai discourse has a fixation problem. every week there's a new model drop, a new benchmark, a new leaderboard reshuffle, and a wave of people interpreting capability improvements as the primary lever for making agents work. it isn't. the companies actually shipping production agents: cursor, manus, cognition with devin, anthropic with claude code, have all quietly converged on a different conclusion: the harness is the product. the model is a component.

a harness is everything around the model that makes it useful. the execution loop, the tool definitions, how errors get handled, how state gets managed, and most importantly, what information reaches the model and when. strip away the marketing and every production agent is the same core structure:

while (model returns tool calls):
  execute tool → capture result → append to context → call model again

that's it. cursor's architecture fits in that loop. so does claude code. so does manus. the engineering is in what surrounds it.

diagram 01 — the agent execution loop
HARNESS MODEL decides what to do TOOLS bash · read · write · search CONTEXT what the model can see loop repeats until text output STATE · MEMORY · ERROR RECOVERY
fig 01 — every production agent, claude code, cursor, manus, runs this same loop. the engineering is in the harness around it.

context is the product, not an afterthought

the specific engineering decision that separates working agents from demos is something borrowed from 1980s ui/ux research. john carroll at ibm and later jakob nielsen called it progressive disclosure: show only what is needed now, reveal complexity on demand. collapsible menus exist because dumping every possible option on a user at once degrades their ability to make decisions. the same dynamic applies to language models, except instead of cognitive load the problem is attention fragmentation.

liu et al. documented this in a 2024 tacl paper: llm performance follows a u-shaped curve based on where relevant information sits in the input. highest accuracy when the relevant content is at the beginning or end. degraded, measurably and consistently, when it's buried in the middle. even with long-context models. even with 2 million token windows. the architecture of attention hasn't been solved by making the window bigger; it's just been made more expensive to fill badly.

diagram 02 — the u-shaped attention curve
position of relevant info in context window accuracy start end middle high attention high attention "lost in the middle" liu et al., TACL 2024 progressive disclosure keeps fresh info here ↑
fig 02 — the u-shaped attention curve. performance peaks when relevant info is at the start or end of the context. harnesses that load context progressively exploit this directly.

cursor ran an a/b test on this directly. their mcp servers include tools with long definitions, most of which go unused in any given session. they tried giving the agent only tool names as static context and fetching full definitions on-demand. token usage dropped 46.9%. that's not a rounding error. that's nearly half the tokens, gone, by not frontloading information the agent probably won't need.

vercel's case study is more dramatic. they removed 80% of their agent's tools. token usage went from 145,463 to 67,483. steps from 100 to 19. latency from 724 seconds to 141. the agent went from failing tasks to completing them. less capability, literally better performance. the information overload was the bug.

cloudflare found the same thing from a different angle. their code mode approach, letting the agent write code against a typed sdk rather than invoking individual mcp tools, cuts token usage by 32% for simple single tasks and 81% for complex multi-step operations. for their full api surface, the naïve mcp approach would consume 1.17 million tokens before the agent does anything. code mode brings that to roughly 1,000. same api coverage. 99.9% fewer tokens. the model's fluency in code turns out to be a compression algorithm.

diagram 03 — token cost: before vs. after harness optimization
Vercel agent 145K 67K Cursor lazy load 100% 53.1% Cloudflare code mode 100% 19% before after
fig 03 — token reduction from harness-level context discipline. vercel: −54%. cursor lazy tool loading: −47%. cloudflare code mode (complex tasks): −81%.

two valid architectures, one principle

manus has rewritten their framework five times. each rewrite removed things. their current architecture keeps all ~29 tools loaded permanently, not because they want the model to see all of them, but because removing tools near the front of the context invalidates the kv-cache for all subsequent tokens. so instead of toggling tool availability through definitions, they control it through logit masking: constraining which output tokens the model can generate during decoding. same effect, cache preserved. this is the kind of optimization that only emerges when you've watched your infra bill in real time.

cursor made the opposite choice: lazy load tool definitions. both work. the right answer probably depends on your specific token economics, which is itself an argument for treating harness design as a first-class engineering discipline with its own benchmarks, not a footnote to model selection.

diagram 04 — two valid answers to tool overload
MANUS all 29 tools always in context browser_* · shell_* · file_* · ··· availability controlled by logit masking at decode time kv-cache stays valid ✓ CURSOR only tool names loaded statically name_1 · name_2 · name_3 · ··· full definition fetched on-demand, when needed −46.9% token usage ✓ vs
fig 04 — opposite approaches to tool overload: manus uses logit masking (all tools loaded, outputs constrained), cursor uses lazy loading (definitions fetched on demand). both work. the right choice depends on your token economics.

claude code's approach is worth examining closely because anthropic has been unusually transparent about it. the system prompt compiles ~110+ conditional strings. tool results carry injected system reminders (fixed text appended after every tool execution) which achieve higher behavioral adherence than system-prompt-only instructions because the text repeats on every call, not just at session start. the todowrite tool does nothing functionally. it's a no-op that forces the agent to articulate and track its plan, a harness-level trick for keeping a model coherent over long trajectories. langchain's analysis of deep agents calls it out explicitly: fake planning tools for real behavioral anchoring.

the skill loading pattern is equally deliberate. skills are stored as .claude/skills/ files and are not preloaded into every conversation. unlike claude.md which loads every session, skills load only when claude detects relevance. a project with dozens of skill files doesn't inject all of them into every context. they're fetched on demand. progressive disclosure implemented at the filesystem level.

anthropic's published work on long-running agents adds another layer. when they tried to build a claude.ai clone using claude opus 4.5, the agent kept failing in two ways: trying to build everything at once, and losing track of its own state across context windows. the fix wasn't a smarter model. it was a different harness. they added an initializer agent that wrote out 200+ individual feature requirements before any coding started. every feature marked "failing." each subsequent coding session had a concrete checklist to work from. the model was the same. the harness gave it memory it didn't have before.

diagram 05 — progressive disclosure in context loading
naive: dump everything all 29 tool definitions (full text) all skill files (every one) full session history complete repo map all previous tool results … more context … ~145K tokens / agent fails 25K tokens · 0.8% efficiency (Claude-Mem) progressive: load on demand tool names only (not definitions) relevant skill file (this task only) recent history (last N turns) older history → 1-line summaries tool definitions → fetched if needed ~67K tokens / agent succeeds 955 tokens · 100% efficiency (Claude-Mem)
fig 05 — naive context loading vs. progressive disclosure. the same information, staged differently. claude-mem research shows static loading at 25K tokens achieves 0.8% signal efficiency; progressive disclosure achieves 100% at 955 tokens.

the benchmark evidence

the research synthesis from langchain is what makes the scaffold-over-model argument hardest to dismiss. their deepagents-cli went from 52.8% to 66.5% on terminalbench 2.0 by changing only the harness. thirteen points. on the same model. sonnet 4 went from 33% to 47% under different scaffolds. sonnet 4.5 from 44% to 62%.

the swe-bench numbers tell a similar story from a different direction. models that score above 70% on swe-bench verified drop to below 25% on swe-bench pro, a harder benchmark using enterprise codebases. the scaffold stays the same; the task gets more realistic. gpt-5 scores 23.3%. claude opus 4.1 hits 23.1%. on real enterprise codebases, below 20%. the benchmark inflation from 2024 to 2025 wasn't models getting radically smarter. it was scaffolding getting better and tasks staying artificially clean.

openai's own preparedness framework documented this: gpt-4's performance on swe-bench lite ranged from 2.7% with an early rag-based scaffold to 28.3% with coder. a ten-fold gap. same model. the gap between the best and worst scaffolds for any given model frequently exceeds the gap between models.

diagram 06 — same model, different scaffolds: performance variance
0% 50% 100% Opus 4.5 CORE-Bench 42% 78% +36 pts (scaffold change only) Sonnet 4 CORE-Bench 33% 47% +14 pts LangChain TerminalBench 52.8% 66.5% +13.7 pts GPT-4 SWE-bench lite: 2.7% (early RAG scaffold) → 28.3% (CodeR) | same model, 10× gap
fig 06 — scaffold changes, not model upgrades, drive these performance swings. the gap between worst and best scaffold for a model frequently exceeds the gap between models.

dex horthy, who developed the 12 factor agents methodology, puts a specific threshold on this: push past 40% of the model's input capacity and you enter what he calls the "dumb zone." signal-to-noise degrades, attention fragments, and agents start making mistakes that look like reasoning failures but are actually a poorly designed harness drowning the model in irrelevant context.

the enterprise picture adds another dimension entirely. salesforce's 2026 connectivity benchmark found that the average enterprise now runs 12 ai agents, but only 27% of them are connected to the rest of the stack. 73% are shadow agents: unmonitored, ungoverned, no harness. microsoft's telemetry says over 80% of fortune 500 companies have active ai agents, many built by teams that never coordinated with platform engineering. the problem isn't building agents anymore. it's building the infrastructure that controls them.


the companies actually winning

aider's approach to context is probably the most technically novel thing happening in this space. they built a pagerank-based repository map using tree-sitter: parse the codebase to extract definitions, build a graph where files are nodes and dependencies are edges, run pagerank to rank symbols by importance, use binary search to fit the most critical content within a token budget. it's a fundamentally different answer to the same question: how do you give an agent the context it needs without giving it everything. but it's still progressive disclosure. the mechanism is just graph theory instead of lazy loading.

swe-agent's linter-gated edits are simpler and just as instructive. when the agent issues an edit command, a linter runs automatically. syntactically incorrect code gets rejected; the agent must retry. one harness-level guardrail, 3% performance improvement. no model change. the harness doesn't trust the model to get it right; it checks and enforces.

the uncomfortable implication of all this is that the companies currently winning the agent race aren't winning because of proprietary model access. they're winning because they've invested deeply in harness engineering, which is harder to copy than it looks and almost completely absent from the benchmarks people use to evaluate model capability. there's no standard benchmark for comparing harness designs. cursor's 46.9% token reduction is one of the very few published numbers. everyone is optimizing in the dark, which is partly why the teams who've been doing it longest are so far ahead.

the teams shipping the best agents keep simplifying. manus: five rewrites, each one removed things. anthropic designs claude code's scaffold to shrink as models improve. replit went from one agent to three but each individual agent got simpler. over-engineering is the default failure mode, and it's also invisible on leaderboards, which is why it persists.

the model is the engine. the harness is the car. and right now, most of the ai field is standing in a showroom debating horsepower.


sources: langchain deep agents report · cursor dynamic context discovery · manus context engineering blog · anthropic effective harnesses post · liu et al. tacl 2024 · hal leaderboard / princeton · cloudflare code mode blog · swe-bench pro (scale ai) · verdant swe-bench verified report · salesforce 2026 connectivity benchmark · horthy 12 factor agents · vercel agent case study (phil schmid)