What Is an Agent Harness? The Real Work in AI Agents
If you’ve been following the AI space even remotely over the past few years, you’ll know that conversations have mostly focused on the model. And while debating which ones are the smartest, least hallucinatory, and best at code is genuinely useful, we do ourselves a disservice by not considering the wider system.
While the topic is still nascent, the field seems to have converged onto a simple formula noted first by Viv Trivedy: Agent = Model + Harness.
What is an agent harness? It’s everything that wraps around a language model to make it a functioning agent. That includes the system prompts, tools it can call, memory it reads, orchestration logic, context management, and the constraints that prevent it from doing the wrong thing. The model generates tokens while the harness decides what the model sees, what happens with the output, and what runs next.
Addy Osmani says it best: “A decent model with a great harness beats a great model with a bad one.” That is where most of the engineering leverage sits right now, and “harness engineering” has become one of the fastest-growing areas of discussion in the AI practitioner community in 2026.
The rise of harness engineering
In February 2026, Mitchell Hashimoto, creator of Vagrant and Terraform, published “My AI Adoption Journey.” Step 5 was titled “Engineer the Harness”:
“I’ve grown to calling this ‘harness engineering.’ It is the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.”
Two months later, Addy Osmani published “Agent Harness Engineering,” the most thorough public treatment of the subject to date:
“A coding agent is the model plus everything you build around it. Harness engineering treats that scaffolding as a real artifact, and it tightens every time the agent slips.”
Viv Trivedy’s sums it up best with a formula that’s oft quoted across the industry:
Agent = Model + Harness. If you’re not the model, you’re the harness.
Claude Code, Cursor, Codex, Aider are all harnesses built on top of similar (and sometimes identical) underlying models. Yet ask any practitioner in the community, and they’ll swear by their favorite tool as the most reliable, easy-to-use, or consistent. That’s because the behavior people experience is shaped by the harness.
What do agent harnesses solve for?
The topic clearly resonated with AI enthusiasts.
On May 2, 2026, a post called “The agent harness belongs outside the sandbox” hit the front page of Hacker News with 181 points and 121 comments. Practitioners across the community spent two hours arguing about where harness logic should live relative to execution environments.
Viv Trivedy’s breakdown of harness anatomy on X has drawn over 2,000 likes (and counting). On April 30, shell implementation of a complete agent harness called Pu.sh hit HN with a fair amount of excitement and debate, solidifying the point that a full harness doesn’t have to be a framework. It can exist within 400 lines of code.
And if you read through the endless conversations on X, Reddit, Hacker News, you’ll see three recurring AI pains that agent harnesses are meant to address:
Context rot. As context windows fill during long tasks, model output quality significantly degrade and the signal-to-noise ratio in what the model is reading drops. Good harnesses address this with compaction, selective tool output, and context resets between subtasks. They make it a known infrastructure problem rather than model problem.
Prompt drift. Run today’s prompt tomorrow (or the day after that), and the output will inevitably drift. Bump model versions as new ones release and your old behavior will eventually break. That’s the nature of non-deterministic LLM, alongside the quickly changing, ever improving models that get released at breakneck speeds. The harness helps deliver consistent and reliable results, regardless of the model, by locking the agent’s behavior in more deterministic configuration rather than floating in a prompt.
Vulnerabilities: Security researcher Simon Willison calls it the “lethal trifecta”: an agent with access to private data, exposure to untrusted content, and the ability to communicate externally. An attacker who controls the untrusted content channel can direct the agent to exfiltrate private data. The harness is the only layer where all three risks are addressed together.
What does the agent harness market look like?
The harness market, while still nascent, organized itself around those three problems. And as with any market, we’re slowly seeing the space differentiate itself into four clear categories that each serve different needs:
Code-first frameworks
Some of the earliest versions of an agent harness, code-first frameworks like LangChain, LlamaIndex, AutoGen, CrewAI are Python libraries for developers who want to build agents from the ground up. These frameworks are typically expressive and composable, but require deep expertise, weeks to months of assembly, and constant maintenance and management to.
Visual and low-code builders
Visual and low code tools like n8n, Langflow, and Flowise are drag-and-drop interfaces for connecting models to tools and data. But many practitioners hit a ceiling. Complex conditional logic, proper error handling, and multi-step agent behavior get unwieldy fast in visual editors. And tools like n8n works well for automations that follow a fixed path, but for dynamic paths, where the agent needs to decide what to do next, visual tools tend to be too brittle to work in production.
Cloud lab wrappers
Cloud lab wrappers include products like OpenAI Agents SDK, Anthropic’s Claude Agent SDK, and Google’s Gemini-based agent tooling. These give you a managed runtime backed by the provider’s own infrastructure, but the tradeoff is model lock-in. Your harness is designed around one provider’s APIs, and moving is expensive. And with how quickly the model space has expanded past the frontier models, developers need the flexibility to mix and match across LLM providers.
Off-the-shelf, integrated platforms
We’ve seen momentum building for off-the-shelf platforms that don’t require deep developer knowledge or brittle workflows to deliver agentic results. Tools like Friday Studio sit in this category; these platforms that give you a full harness runtime (memory, MCP tools, signals, orchestration, scheduling) with a conversation-based interface for configuring it, and a durable, deterministic config format that owns the result.
The appeal is the same reason people use Rails instead of assembling a web stack from scratch. You don’t want to wire up a message bus, a credential store, a scheduler, a context manager, and an MCP client one-by-one. You want the harness primitives to already exist so you can spend your effort on understanding what’s specific to your task. You can get something working in an afternoon, and reliability features like retry logic, observability, integrations, and human-in-the-loop approvals are already in the platform, rather than on a to-do list to build.
The line between good integrated platforms and bad ones is whether the config they produce is legible and visible. Workflows built on a platform that generates opaque state that you can’t inspect, version, or hand to a teammate can’t realistically be used in a production environment; they’re no better than a demo. What you need is a platform that produces a readable config file that you can diff, share, and run the same way on any machine.
What makes a good agent harness?
Osmani has a better break down of this than I can provide, which I’ll summarize here:
Earn every component. Every piece of a harness should trace to a failure it prevents or a behavior it enables. Hashimoto’s rule is that every line in a good AGENTS.md should trace back to something that went wrong.
Manage context on purpose. The agent only knows what is in its context window. A harness that loads every tool, every skill, every doc at startup degrades performance before the agent takes a single action. Good harnesses load skills when the task calls for them, offload large tool outputs to the filesystem, and compact on long runs.
Enforce rather than guidelines. Prompts that say “never do X” don’t act as enforcement. Enforcement means hooks that run before and after tool calls, block destructive commands, andrequire approval before external writes. A real production system needs to enforce.
Treat every failure as a configuration problem. Osmani and HumanLayer both make the point that most agent failures are configuration problems. If the agent doesn’t know a convention, then add it. If the agent runs a destructive command, then block it with a hook. A bad run calls for a harness improvement rather than blindly retrying against the model.
Produce configuration. A harness that stores its logic in floating prompts or a platform-specific database creates lock-in and fragility. A harness that produces a versioned, readable config file that behaves the same way regardless of which machine runs it.
Where Friday fits
Friday Studio is a complete agent harness. It builds in memory, MCP tool integrations, a scheduler, credential management, FSM-backed job orchestration, and a signal system into a single platform. You don’t have to spend hours wiring those pieces together yourself because they’re already there, out-of-the-box.
Two things separate Friday from other options in the space:
The first is that configuration is the output. You describe what you want in chat (“Every morning, triage my inbox, draft replies, file real asks as Linear tickets”) and Friday generates a workspace.yml that specifies the agents, the tools they can call, the signals that trigger them, and the FSM jobs that orchestrate the work. That file is readable, diffable, version-controllable, and portable. You can hand it to a teammate and it runs exactly the same on their machine with their own accounts and tools. Most integrated platforms trap their config in a GUI canvas or a proprietary database, but Friday produces a file you own.
The second is that you do not need to be a developer to use it. Friday is built for the tightening loop Hashimoto and Osmani describe, where each failure becomes a configuration fix. All you need to do is tell Friday what went wrong (or ask it to diagnose itself) and it’ll update the config. The YAML is there if you want make configuration changes yourself, but you do not have to touch it to get a production-grade harness.
That combination of a complete harness runtime, generated from conversation, producing config you own is what makes Friday the shortest path from ‘I want an agent that does X’ to an agent that actually reliably does X, on schedule, every time.
Friday is open source on GitHub and available as a one-click installer for macOS at hellofriday.ai.
Where agent harnesses are headed
Better models will never make harnesses obsolete; instead, they raise the ceiling for a satisfactory solution that meets all of a user’s needs. For example, scaffolding that handled context problems six months ago is now dead code because the models solved that problem. But the tasks now reachable have their own failure modes, and those require new harness layers.
Three directions have real momentum right now:
Multi-agent coordination. Single-agent pipelines hit limits on complex tasks. The pattern gaining traction is specialized agents: planner, executor, and reviewer, each scoped tightly and coordinated by the harness. Friday’s FSM-backed jobs and JetStream message bus handle this. You define which agents run, in what order and with what tools, and the runtime fans the work out and brings it back.
Just-in-time context assembly. Loading everything at startup is one of the most common production mistakes. Friday’s skill system works the other way: agents load skills on demand, tool access is whitelisted per agent, and each agent in a job sees only what its role requires.
Self-improving harnesses. The Meta-agent project ran an LLM-driven loop that proposed targeted harness updates from failed traces, validated against a holdout set, moving a customer-service agent from 67% to 87% task accuracy. Friday doesn’t close this loop automatically yet, but durable, readable config is the prereq, and its the direction Friday is headed.
Osmani points toward harnesses that act closer to a compiler, generating optimal scaffolding from a task spec at runtime. Friday’s conversation-to-config model is the nearest current step in that direction.
The practitioners driving this conversation are mostly developers today. But the pattern: every mistake becomes a rule, config ships the behavior, the harness tightens over time, doesn’t have to stay that way. The work is in the design.
Friday is source available on GitHub and available as a one-click installer for macOS at hellofriday.ai.


