Skip to main content

Pluggable Code Execution in the New OpenAI Agents SDK: Our Early Access Findings

Author: Romain Bourboulou

Making it easier to build locally, run on the remote provider of your choice, and even switch providers without changing the core workflow.

Executive Summary

  • Thinner harnesses and code-executing agents are better suited to open-ended tasks, where overly rigid orchestration can limit model performance.
  • With this, code execution and sandboxing are now core architectural concerns for agentic systems.
  • Agents SDK reduces the complexity and code required to build code-executing agents, by as much as 6x in our tests.

For a long time, progress in agent systems came from improving orchestration: better prompting, tool interfaces, context management, and tighter control flow. But as coding agents become more capable, that balance is starting to shift.

In many open-ended workflows, the bottleneck is no longer the agent loop itself, it's the execution layer: the sandbox where the model writes code, runs commands, inspects outputs, and iterates. As more task-level reasoning moves into that environment, the surrounding orchestration needs to become simpler to let the model express its full capability.

That shift is exactly what the new version of the Agents SDK allows. From our testing during early access, we found that rather than adding another layer of framework logic, it makes the execution layer more modular and composable so the rest of the system can stay thin.

The Shift

It’s become trendy in harness engineering to reduce the harness to its minimum effective form. At a high level, the harness is the software around the model: the layer that manages context, tools, control flow, and feedback loops so the model can perform work reliably.

Over the past few years, many improvements in agent performance came from strengthening that layer. Better tools, better memory and retrieval, more explicit decomposition, and tighter orchestration often made systems more reliable and more capable. In that paradigm, progress largely meant encoding more task logic into the software around the model.

That pattern is now weakening, at least for a class of open-ended tasks. A growing number of projects and papers suggest that performance does not always improve when the harness becomes more prescriptive. In assisted coding, long-running tasks, browser use and long-context tasks, the same pattern appears repeatedly: once the model is intelligent enough, forcing too much task structure into the surrounding software can become a constraint rather than an advantage.

The role of the harness is thus changing. Instead of trying to anticipate the task in advance through rigid orchestration, the harness increasingly serves to provide a clean execution surface: a sandbox where the model can inspect state, run code, recover from errors, and adapt its own approach while remaining bounded by the system’s interfaces and safeguards. This is close to the shift Andrej Karpathy described in Software Engineer 3.0: some of the logic that previously lived in software moves upward into the “prompt”.

The lesson is not that agent systems should remove structure everywhere. Many tasks still benefit from explicit workflows, heuristics, and deterministic guardrails, especially when the task is narrow, high-volume, or has a clear success criterion. As we argued in our previous post on Heuristics for Agentic System Design, strong orchestration still matters when a reliable logic flow is both possible and desirable.

For open-ended tasks, the focus is shifting. The challenge is less about designing ever more elaborate orchestration layers, and more about building execution environments that are simple, observable, and modular enough for the model to work effectively inside them.

Offloading The Complexity From The Harness To The Execution Layer

Once an agent can read files, write code, run shell commands, and spin-up long running tasks, the engineering challenge changes. The hard part is no longer prompt optimisation or tool routing alone. The fact that the agent is now operating on a real system makes these agents significantly more powerful, but also more sensitive as they introduce a larger safety and security surface. For example, an agent that can execute code can take harmful actions if its environment is poorly isolated (see the Sandbox Bench by the AISI).

Sandboxing is thus becoming a critical concern in agent frameworks. In earlier systems, execution was often treated as an add-on: a tool bolted onto the harness. But once execution becomes stateful, long-running, or remote, that approach starts to break down. Managing the sandbox itself, its lifecycle, its state, its interfaces, and its articulating to the agent loop quickly becomes a system-design problem of its own. That is one of the reason why we see a growing number of providers now offer managed environments for code execution, including OpenAI’s Container API and shell tool, Modal, Cloudflare, Daytona, E2B etc.

This boundary matters because code execution requires stronger isolation and tighter runtime control than the rest of the harness. In practice, poorly implemented code-executing agents can introduce three business-critical risks: uncontrolled compute spend, destructive actions on internal systems, and exposure of sensitive information. With proper containerisation, isolation, and runtime safeguards, these risks can be contained to a level that is acceptable for real-world deployments.

One way to think about it is to imagine giving the agent its own sealed workspace rather than the keys to the entire office. It can still do useful work inside that space, but only within clearly defined boundaries. You can cap how much compute it uses, limit what systems and files it can touch, and control what information is available to it in the first place.

That does not eliminate risk entirely, but it changes the problem from “an agent loose in your infrastructure” to “an agent operating inside a controlled environment.” If this layer is going to become a standard part of agent systems, it needs first-class support in the framework itself. With this the sandbox becomes a modular execution layer with portable primitives that developers can adopt quickly, swap across providers, and scale without constantly reworking agent logic.

Why This Needs Better Agent Frameworks Support

Once an agent executes code, the sandbox itself needs orchestration. Moving from a local proof-of-concept to remote execution, multiple backends, or long-running sessions exponentially increases operational burden. You need a consistent way to create environments, stop them, pause and resume them, snapshot state, reconnect later, and manage all of this across providers.

None of this is conceptually glamorous, but it matters in practice. This is exactly the kind of infrastructure that becomes painful when every team rebuilds an agentic pipeline from scratch, especially when it's not integrated into the agent framework…

This is where better framework support becomes important. We had early access to the newer OpenAI Agents SDK and used it to build sandboxed agents ourselves. What stood out was the shift in architectural emphasis: the SDK treats execution as a first-class layer rather than as a peripheral tool. In practice, that means you can spin up a sandboxed agent, snapshot a sandbox, or resume execution with less code (~6 times in some of our tests), then switch backends without rewriting the surrounding agent logic.

This cleaner separation of concerns allows the harness to stay focused on reasoning, context, and workflow. The execution layer can focus on isolation, portability, and runtime state. This abstraction makes it easier to build coding agents that are both more capable and easier to evolve, that can move between local and remote execution, support longer-running tasks, and change execution backends without forcing a redesign of the whole system.

Key Takeaway

As more task-level logic moves from the harness into the model, some of the system complexity moves with it—down into the execution layer. Code execution and sandboxing are now core architectural concerns for agentic systems, especially for coding-heavy and open-ended tasks. It is now as important to design the agentic pipeline as it is to design the environment in which the agent can act safely, reliably, and over time.

This is why higher-level abstractions around sandboxed execution matter. The newer OpenAI Agents SDK moves in that direction by treating execution as a modular layer of the system: portable across backends, stateful across long-running tasks, and simple enough to use without rebuilding the same infrastructure for each new setup.

The broader lesson is that the next generation of agent frameworks will likely be defined less by how much orchestration logic they add, and more by how well they structure the execution environments agents increasingly depend on.