The agent harness: why the model is the smallest part of an AI agent
Here is a thought experiment. Take a working AI agent — one that books travel, triages tickets, or fixes code — and quietly swap its language model for a different frontier model. Most of the time, very little changes. Now instead swap its harness: the loop, the tools, the memory, the rules around the model. Everything changes. That asymmetry is the most important and least understood fact about building agents today — the model is the smallest part of the system.
What an "agent harness" actually is
A language model does exactly one thing: given some text, it predicts more text. That is not an agent. An agent is what you get when you wrap that model in software that lets it take actions, see the results, remember what happened, and keep going until a goal is met. The harness is all of that wrapping — every piece of code, configuration and execution logic that is not the model itself.
If the model is a brain in a jar, the harness is the body and nervous system: the senses that bring information in, the hands that act on the world, the memory, and the reflexes that stop it doing something dangerous. The field has converged on four parts that are necessary and sufficient:
- An agent loop — the cycle that runs the model again and again until the task is done.
- A tool interface — the defined set of actions the agent is allowed to take.
- Context management — deciding what the model gets to see on each turn.
- Control — the validation, permissions and limits that govern what actually runs.
The loop at the centre of everything
A single model call answers a question. An agent pursues a goal, and that requires iteration: plan a step, take it, look at what happened, decide the next step. This plan–act–observe cycle is the engine of every agent — and crucially, the harness runs the loop, not the model. The model is simply consulted once per turn.
# the agent loop, in spirit while not done: step = model(context) # model proposes ONE action result = harness.run(step) # harness validates + executes it context = remember(context, result) # observe, then loop again
Everything that makes an agent feel intelligent — persistence, recovery from mistakes, multi-step reasoning — lives in how well this loop is designed, not in the model weights.
Never let the model touch the tools directly
The single most important rule of harness design: the model never executes anything itself. It only proposes a structured action — "call search_orders with these arguments." The harness then validates that request against a schema, checks whether it is permitted, runs it, and feeds the result back. This control plane between intention and action is what turns a text generator into a system you can trust — exactly as an operating system sits between a program and the hardware.
Context engineering: the art of what the model sees
A model has no memory between calls. Every turn, the harness rebuilds what the agent knows from scratch and packs it into a limited context window. Choosing what goes in — the goal, the recent steps, retrieved facts, tool results — and what gets summarised or dropped is its own discipline, often called context engineering. Get it wrong and the agent forgets its objective halfway through, or drowns in irrelevant detail. Most "the agent went off the rails" stories are really context failures.
Why most agent failures are harness failures
When teams study why production agents break, the model is rarely the culprit. The recurring offenders are harness defects: context drift (the model slowly loses the thread), schema misalignment (a tool call that does not match what the tool expects), and state degradation (memory that grows stale or inconsistent). Industry analyses through 2026 attribute the majority of enterprise agent failures to exactly these glue problems — not to the intelligence of the model.
That is also why model choice has stopped being the differentiator. Frontier models are now close enough in raw capability that swapping one for another rarely decides whether an agent works. What decides it is the harness — the loop, the tools, the context discipline and the guardrails. The model is the smallest, most interchangeable part; the engineering that makes an agent reliable is everything around it.
That layer — the loop, the tools, the retrieval and the controls that turn a clever model into a dependable product — is exactly where we build. If you are moving an agent from an impressive demo to something you can put in front of customers, the harness is where that work happens.